1. ## Segmenting after block-sorting

It's just a thought.

AFAIK, segmenting is today done before actual BWT transformation and further stages. Because of this it's impossible to accurately simulate the compression of particular segmenting schemes without actually doing full compression for every segmenting scheme.

I have an idea:
- Assume we computed some potentially good segmenting schemes, and want to discover their effect on compression ratio,
- We do full BWT on entire data, keeping original position with each byte,
- With such information, we can compute BWT of segmented blocks of one segmentation scheme, in a linear pass,

eg:

Lets assume we have done full block-sorting and we obtain:

Code:
```5 2 8 9 4 6 3 1 7
a s a f i a p e w```
Assume segmenting schme which divides block into two blocks of length 4 & 5. Therefore the first block's will contain only suffices with indices 1..4 and second block 5..9. So we can split original block into two in linear pass easily:

Code:
```2 4 3 1
s i p e
5 8 9 6 7
a a f a w```
Of cource this is a little wrong, ie the corresponding symbols, at segments boundaries, will change.

Let me know if you understand the idea or not or what do you think about it. I think that even if this method won't be suitable for actual compression (because it would make compression too long) it should be helpful when searching for good segmentation algorithms. If we once compute the Suffix Array for full block, we could then derive Suffix Arrays for segments quickly, in simple linear pass over full Suffix Array per segmentation scheme.

2. ## Thanks:

Bulat Ziganshin (14th May 2016)

3. Well, its certainly interesting and may be worth trying.
I'd start by estimating the codelength of each symbol in BWT, with a good postcoder.
Then process these codelengths in original order (before BWT), and find the point where
Code:
```accumulated_codelength/n >= coef*(total_codelength-accumulated_codelength)/(total-n)
with coef=1.2 or something like that```
ie symbols from the block tail are expected to have a considerable redundancy, when encoded in wrong context.
So then we can split the block into [0..n) + [n..total) parts and either repeat the scan recursively,
or (better) recompute BWT for the parts first.

But there's a problem.
Segmentation is done mainly because BWT has to work with blocks anyway.
In other words, this kind of segmentation, even if it was perfect, can't be applied if you don't have enough memory for BWT.
Also its relatively slow and requires a lot of memory. Extra 1N for codelengths can be saved by making a weird out-of-order postcoder
(scan in source order, but use bwt contexts), but you'd need other 5N anyway.

And btw, the segmentation provided by a simple low-order CM model (like Shkarin's seg_file) seems good enough as it is...