Page 1 of 3 123 LastLast
Results 1 to 30 of 61

Thread: BWT/ST compression

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Question BWT/ST compression

    Well, I have an idea about my own BWT/ST-based file compressor. In brief, I want to use some standard/well known sorting with a large enough block (16-64 MB) plus my own CM as the output encoder.
    Having said BWT is very new for me and I know we have a very special BWT-specialists here. The question is, what sorting method is the best so far - i.e. stable enough on very redundant files a la PIC, fp.log, ...
    Plus post yout own thoughts...


  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    ST seems to be slower at decompression...

    Well, actually the basic BWT idea is quite simple:

    define own "compare" function

    initialize pointer structure; each ptr[i]=i

    do sort() or qsort() or ...

    store in stream block size, "min_element"

    encode transformed block

  3. #3
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    48
    Thanks
    1
    Thanked 1 Time in 1 Post
    Quote Originally Posted by encode View Post
    Well, actually the basic BWT idea is quite simple:
    define own "compare" function
    initialize pointer structure; each ptr[i]=i
    do sort() or qsort() or ...
    store in stream block size, "min_element"
    encode transformed block
    Usually, a MTF step after sorting a block gives better compression when using a statistical method (say arith) for encoding. I found that radix sort (as suggested by Burrows and Wheeler) is often used instead of quicksort, but the poor implementation of BigCrunch didn't show improvements when changing the algorithm.

    I've never heard about ST (Schindler Transforms ?) before, so if you planed to give more details, i'd be happy to read more

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Google for Schindler Transformation

    Actually here we talking about modern BWT output encoding - i.e. instead of using shitty MTF or something use modern CM model specially tuned for BWT output. Search thru forum for Shelwien's posts...

    Additional question is about simplicity of the BWT stage... I want things as simple as it possible!

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Very briefly tested encoding of the BWT output (using Shelwien's BWT calculator). No MTF. Order-1 model provides better results than order-0.
    Maybe one of the reasons is:

    if order-1 context is a then a char is more probable.

    Because, BWT output should contain long char repetitions a la - aaaaaaaabaaaa

    Maybe such non zero-order encoding is better than MTF+order-0 coder.

    Well, to find out I must write a simple BTW first...

  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    To my surprise it is

    But contexts producing good compression should be different compared to normal order N, since you deal with a sequence of characters under some very long context (in analogy to bit histories).

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Yep, we may use a few dynamically mixed models, say order-0, order-1, some models with tricky contexts - a la higher bits of previous character, etc.

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    BTW, Shelwien's BWT calculator is SLOOOW... Seems to be files like PIC freezes the program!

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > ST seems to be slower at decompression...

    That's implementation-dependent. At least my inverse ST2 is surely
    faster than usual inverse BWT.

    > Well, actually the basic BWT idea is quite simple:

    I won't call that "BWT idea". Its just an implementation,
    but it doesn't explain anything.
    Well, I'd describe BWT like this:
    1. By sorting the string shifts (=contexts), as the last column
    of the sorted string shift matrix we acquire the sequence of symbols,
    similar to contexts histories by statistical properties.
    (Note that using a normal lexicographical order for sorting means
    that we work with the context histories of _right_ contexts,
    its similar to processing the file with CM in reverse order).
    2. This last column includes all the source symbols, thus to transform
    it back into the source string we only need to know the permutation.
    3. If we sort the symbols in the last column, it turns into the first
    column - this is enough to compute the necessary permutation when
    lexicographic sort is used.
    Other explanation is that having first and last column of the shift
    matrix, we actually have all the symbol digrams from the source, which
    can be sorted to restore the 2nd column, then we'd have all trigrams,
    etc, etc.
    4. There's a lot of similar context-separating bijective transformations,
    but not all allow for convenient reverse transformation.

    > define own "compare" function

    That would be much faster with STL.
    Though not practical for a compressor anyway.

    > do sort() or qsort() or ...

    That would hang forever on a file with long matches

    > The question is, what sorting method is the best so far
    > i.e. stable enough on very redundant files

    It doesn't seem like there's a single best method.
    All BWT compressor developers just invent their own tricks.
    There's also a lot of interesting articles about that.
    I can only say that qsort is not the best choice for sure -
    fast implementations afaik use radix sort as a first stage.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Here's another "BWT calculator" for you, a more convenient one:
    http://shelwien.googlepages.com/DC099307.rar
    if you run it like "dc -d -fcpretodxlb README.TXT" its output would
    be simple BWT, but it also allows to play with a lot of different
    options.

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by Shelwien View Post
    Here's another "BWT calculator" for you, a more convenient one:
    http://shelwien.googlepages.com/DC099307.rar
    if you run it like "dc -d -fcpretodxlb README.TXT" its output would
    be simple BWT, but it also allows to play with a lot of different
    options.
    Thanks!
    Also I will look at BBB... It is simple and is OK...

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    VadimV, please join the discussion!

  14. #14
    Expert

    Join Date
    May 2008
    Location
    Moscow
    Posts
    5
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    VadimV, please join the discussion!
    Well

    It seems there're 3 fastest modern sorting algorithms using similar (almost the same) "divide-and-conquer" technique: msufsort (Michael Maniscalco), itssort (Yuta Mori) and archon/dark (Dima Malyshev). These algorithms differ mainly by the way they handle redundancy (anchors in archon or inverse suffix array for trapping repetitions in msufsort, for example). Some other recent linear methods (Ko&Aluru, Kaarkkaainen&Sanders, Nong&Zhang&Chan) are noticeably slower on most common files.

    It is worth to mention a very cache effective merge sorting (Edgar Binder), but it drastically slows down on long repetitions. And so on...

    As to compression, personally I like 3 different methods, each for its own property:

    1) Direct coding (dgca, uhbc). BTW, dgca use order-2 model and it is quite enough on most files.
    2) Distance coding (dc, ybs, bssc, dark). I think it is most effective method on text files and it's potential is not exhausted yet (both in compression and speed).
    3) WFC (grzip, uhbc) as best among "list-update" family

    M03 and CM are too slow in my taste

  15. #15
    Expert

    Join Date
    May 2008
    Location
    Moscow
    Posts
    5
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    At least my inverse ST2 is surely faster than usual inverse BWT.
    Sure. Due to the principled inability to avoid cache misses in inverse BWT on large blocks.

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    BTW, do we have a simplest/smallest BWT implementations a la that one posted at compression.ru forum? I'm just a fan of simple things...

    And what do you think about BBB's sorting (5N fast mode)? - It looks simple enough and works really fine for me!

  17. #17
    Expert

    Join Date
    May 2008
    Location
    Moscow
    Posts
    5
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    BTW, do we have a simplest/smallest BWT implementations a la that one posted at compression.ru forum? I'm just a fan of simple things...
    Deep shallow?
    http://roquefort.di.unipi.it/~ferrax...ixArray/ds.tgz

    Quote Originally Posted by encode View Post
    And what do you think about BBB's sorting (5N fast mode)? - It looks simple enough and works really fine for me!
    AFAIK, MS stable_sort() uses merge sorting algorithm instead of quicksort in sort(). O(N log N) in the best case and the worst case is O(N(log N)**2).

  18. #18
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by VadimV View Post
    M03 and CM are too slow in my taste
    Well, I just started writing my own BWT-based compressor. It will use stable_sort() and my own Fast CM. I hope I can prove that CM can be fast and strong at compression.

  19. #19
    Expert

    Join Date
    May 2008
    Location
    Moscow
    Posts
    5
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    Well, I just started writing my own BWT-based compressor. It will use stable_sort() and my own Fast CM. I hope I can prove that CM can be fast and strong at compression.
    Simple CM is the same as direct coding
    My mistake earlier - gca uses only order-1 model. And my own experiments with order-2 gained about 1% in compression. With context mixing, of course

    BTW, substantial speed intrigue is to encode long runs... For example, distance coder don't need it at all. And any form of RLE inevitably worsens compression.

  20. #20
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    BTW, the BWT layer of my new compressor is already done and worked perfectly. It might be SLOW at compression, but it very fast at decompression!

    My weapon is my CM encoder - I have some very cool unreleased things which I may use here...

  21. #21
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Cool

    Draft version of my compressor is done, I hope it will be released soon - I just need more testing/tuning/experimenting...

    Well, it looks as BWT compressor should be - it is *VERY* efficient on text files and BMP/pictures. Not really efficient on binary data and in some extreme cases may just freeze due to unstable sorting. No filters added. Block size is fixed and equal to 32 MB. Anyway, my goal is to show BWT+CM performance.

    The results of *VERY DRAFT* version written today:

    calgary.tar -> 804,310 bytes (785 KB)
    canterbury.tar -> 471,590 bytes (460 KB)

    book1 -> 216,230 bytes (211 KB)

    world95.txt -> 483,195 bytes (471 KB)
    fp.log -> 566,853 bytes (553 KB)
    rafale.bmp -> 762,686 bytes (744 KB)
    a10.jpg -> 831,093 bytes (811 KB)

    ENWIK8 -> 22,495,936 bytes (21.4 MB)
    ENWIK9 -> 196,323,970 bytes (187 MB)

    Huh, finally I wrote my own BWT...


  22. #22
    Expert

    Join Date
    May 2008
    Location
    Moscow
    Posts
    5
    Thanks
    5
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    Huh, finally I wrote my own BWT...
    Congrats! Very good starting!

    Quote Originally Posted by encode View Post
    book1 -> 216,230 bytes (211 KB)
    Results of another BWT compressors with CM (i.e. direct coding):

    uhbc 1.0: 224,584 (0:33)
    dgcac 1.10: 218,480 (0:44)
    gca 0.90k: 217,757 (0:3
    my experimental order-2 coder: 214,591 (0:50)
    bbb: 213,162 (2:31)
    Last edited by VadimV; 19th June 2008 at 15:14.

  23. #23
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Done some experiments. Well, BWT-output encoding is not so simple task as it looks at the first sight.

    Simple things for sure not work here.

    MTF is far much worser than smart CM model.

    Currently, I'm still experimenting with different models and adaptive mixing. The results became slightly better in most cases.

    To be continued...

  24. #24
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    book1 -> 214,920 bytes

    calgary.tar -> 798,857 bytes

    BTW, my compressor in some cases beats even BBB, being *MUCH* faster!

    Anyway, development in progress...

  25. #25
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    BTW, std::stable_sort() needs additionally about 2N, this means that final memory usage will be 5N+2N=7N...

  26. #26
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Lightbulb

    More or less, the compressor step by step becomes more and more interesting. Current version beats BBB on calgary.tar and on many other files. I think it might be my flagship compression engine, instead of the BALZ. The CM-backend inside is crazy - five models mixed with four dynamic context-sensitive mixers and at the same time this heavy construction works relatively fast, at least much faster than BBB...

    Well, another thing I wanna try or even add to compressor is auto-block-size.

    BWT performs really badly dealing with mixed data - for example, compression of each SFC file separately is better than single sfc.7z with all files. An opposite thing with BALZ - compressed size of sfc.7z is smaller than if we compress each file separetely.

    I think somehow we may determine each block size and compress it independently...

  27. #27
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    (Large) dictionary LZ's matching don't depend (that drastically for RO) on a local context, that's why merging the whole source set together almost always helps (Should fit into the window).

    To help context related algorithms in such a case you need segmentation, is that what you meant? Lowering the BWT block size will help too, but this is some kind of "heuristic" segmentation. It will be better to identify segements based on their statistics and compress each segment independent of each other.

  28. #28
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by toffer View Post
    To help context related algorithms in such a case you need segmentation, is that what you meant?
    Correct.

    Well, currently I'm experimenting with LZP heuristic. I keep table with chars accessed with hash and a threshold. If number of false predictions is larger than some defined threshold, we like with a knife separate this block. Well, such thing works! I'm just experimenting with a threshold and a context for the table.

  29. #29
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Something like this could be used to roughly detect a local change in statistic. But if you want to get a good "identification rate" you need higher lzp orders, the drawback here is, that these appear infrequently, hence the detection will happen too late.

    You should collect statistics under a order 2/3 (just a rough guess, to find a compromise between precision and dead time) context and join blocks with similar distributions. To measure similarity one could use a scalar product.

    Just ideas, that's how i would do it...

  30. #30
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Ideally we should not do segmentation on text files. At the same time finding and separating different files/data inside input stream - for example different resourses inside an EXE file...

Page 1 of 3 123 LastLast

Similar Threads

  1. NanoZip - a new archiver, using bwt, lz, cm, etc...
    By Sami in forum Data Compression
    Replies: 280
    Last Post: 29th November 2015, 10:46
  2. Alphabet Reordering for BWT
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 31st May 2009, 03:32
  3. Modern BWT speed
    By GerryB in forum Data Compression
    Replies: 4
    Last Post: 5th May 2009, 17:28
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 15:47
  5. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 02:01

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •