Results 1 to 2 of 2

Thread: Segmenting after block-sorting

  1. #1
    Join Date
    Jun 2009
    Kraków, Poland
    Thanked 136 Times in 104 Posts

    Segmenting after block-sorting

    It's just a thought.

    AFAIK, segmenting is today done before actual BWT transformation and further stages. Because of this it's impossible to accurately simulate the compression of particular segmenting schemes without actually doing full compression for every segmenting scheme.

    I have an idea:
    - Assume we computed some potentially good segmenting schemes, and want to discover their effect on compression ratio,
    - We do full BWT on entire data, keeping original position with each byte,
    - With such information, we can compute BWT of segmented blocks of one segmentation scheme, in a linear pass,


    Lets assume we have done full block-sorting and we obtain:

    5 2 8 9 4 6 3 1 7
    a s a f i a p e w
    Assume segmenting schme which divides block into two blocks of length 4 & 5. Therefore the first block's will contain only suffices with indices 1..4 and second block 5..9. So we can split original block into two in linear pass easily:

    2 4 3 1
    s i p e
    5 8 9 6 7
    a a f a w
    Of cource this is a little wrong, ie the corresponding symbols, at segments boundaries, will change.

    Let me know if you understand the idea or not or what do you think about it. I think that even if this method won't be suitable for actual compression (because it would make compression too long) it should be helpful when searching for good segmentation algorithms. If we once compute the Suffix Array for full block, we could then derive Suffix Arrays for segments quickly, in simple linear pass over full Suffix Array per segmentation scheme.

  2. Thanks:

    Bulat Ziganshin (14th May 2016)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Kharkov, Ukraine
    Thanked 1,397 Times in 802 Posts
    Well, its certainly interesting and may be worth trying.
    I'd start by estimating the codelength of each symbol in BWT, with a good postcoder.
    Then process these codelengths in original order (before BWT), and find the point where
    accumulated_codelength/n >= coef*(total_codelength-accumulated_codelength)/(total-n)
    with coef=1.2 or something like that
    ie symbols from the block tail are expected to have a considerable redundancy, when encoded in wrong context.
    So then we can split the block into [0..n) + [ parts and either repeat the scan recursively,
    or (better) recompute BWT for the parts first.

    But there's a problem.
    Segmentation is done mainly because BWT has to work with blocks anyway.
    In other words, this kind of segmentation, even if it was perfect, can't be applied if you don't have enough memory for BWT.
    Also its relatively slow and requires a lot of memory. Extra 1N for codelengths can be saved by making a weird out-of-order postcoder
    (scan in source order, but use bwt contexts), but you'd need other 5N anyway.

    And btw, the segmentation provided by a simple low-order CM model (like Shkarin's seg_file) seems good enough as it is...

Similar Threads

  1. bsc, new block sorting compressor
    By Gribok in forum Data Compression
    Replies: 363
    Last Post: 14th June 2017, 21:55
  2. My concurrent block sorter (QRotSort)
    By Piotr Tarsa in forum Data Compression
    Replies: 16
    Last Post: 21st August 2011, 15:19
  3. Little question about Suffix Sorting techniques
    By Piotr Tarsa in forum Data Compression
    Replies: 0
    Last Post: 23rd May 2011, 21:01
  4. Brute forcing Delta block size
    By SvenBent in forum Data Compression
    Replies: 2
    Last Post: 2nd May 2009, 13:44
  5. Block sorting for LZ compression
    By Bulat Ziganshin in forum Forum Archive
    Replies: 15
    Last Post: 14th April 2007, 16:37

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts