It's just a thought.
AFAIK, segmenting is today done before actual BWT transformation and further stages. Because of this it's impossible to accurately simulate the compression of particular segmenting schemes without actually doing full compression for every segmenting scheme.
I have an idea:
- Assume we computed some potentially good segmenting schemes, and want to discover their effect on compression ratio,
- We do full BWT on entire data, keeping original position with each byte,
- With such information, we can compute BWT of segmented blocks of one segmentation scheme, in a linear pass,
eg:
Lets assume we have done full block-sorting and we obtain:
Code:
5 2 8 9 4 6 3 1 7
a s a f i a p e w
Assume segmenting schme which divides block into two blocks of length 4 & 5. Therefore the first block's will contain only suffices with indices 1..4 and second block 5..9. So we can split original block into two in linear pass easily:
Code:
2 4 3 1
s i p e
5 8 9 6 7
a a f a w
Of cource this is a little wrong, ie the corresponding symbols, at segments boundaries, will change.
Let me know if you understand the idea or not or what do you think about it. I think that even if this method won't be suitable for actual compression (because it would make compression too long) it should be helpful when searching for good segmentation algorithms. If we once compute the Suffix Array for full block, we could then derive Suffix Arrays for segments quickly, in simple linear pass over full Suffix Array per segmentation scheme.