Results 1 to 9 of 9

Thread: A new, minimally-sorted BWT variant

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts

    A new, minimally-sorted BWT variant

    I've been experimenting with a new invertible BWT-like transform. Like the BWT, it groups together all of the maximal contexts (suffix tree nodes), so it's realized by a suffix tree in the sense of [1]. Unlike the BWT, it doesn't use lexicographic order to determine an ordering among contexts. A total order is derived from two constraints:

    • Every suffix is next to (at least) one of its longest matches (by trailing context)
    • Earlier suffixes come before later suffixes

    The output of the transform is a pair of (transformed string, index), like the BWT. The index gives the position of the last symbol in the transformed string. The first symbol in the input string has null context (it doesn't wrap around), and the null context is placed first, so the first symbol of the input and transform are always the same, and decoding starts at the beginning. The first symbol determines the second context (after the null context), so the second symbol of the input and transform are also always the same.

    Ex. "BANANAS" => ("BANNSAA", 4)
    "ABRACADABRA" => ("ABCDBRRAAAA",
    "QWERTY" => ("QWERTY", 5)

    The transform was inspired by PPM, LZ and other algorithms that compress in the original order. It's also somewhat similar to the Schindler transform, in the sense that that transform falls back on the positional order to break ties, so it's a mix of sort and positional orders. Unlike the ST, this uses unbounded contexts, and prefers the positional order to the sort order. It would be possible to make a bounded-context variant of this, too.

    Overall, this transform compresses very similarly to the BWT, but on some types of data, this beats the BWT by about 1%. That seems to indicate that some data (e.g., source code) benefits slightly from being ordered positionally rather than lexicographically.

    One thing that became clear when I was developing this is that the order of the BWT makes decoding very simple. To decode this, it seems to be required to materialize the suffix tree as you go. It doesn't have the property that symbols in the first and last columns are in the same relative order, and, in fact, I realized that that property is unique to the BWT. Having that property implies that a transform is completely sorted. So, to branch out significantly beyond the BWT, it's necessary to break that invariant and complicate the decoder. But, I believe that an unlimited number of new transforms are possible, and this experiment provides an example of how you might write encoders and decoders for them.

    I've attached code and 64-bit windows binaries.

    1. Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, and Marinella Sciortino. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July 2005), 688-713.
    Attached Files Attached Files
    Last edited by nburns; 9th July 2013 at 19:56.

  2. Thanks (3):

    Bulat Ziganshin (10th July 2013),Matt Mahoney (10th July 2013),willvarfar (10th July 2013)

  3. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Tried compiling in 32 bit Windows.
    Code:
    C:\tmp\minsort-transform>make
    make -C minsort_transform
    make[1]: Entering directory `/cygdrive/c/tmp/minsort-transform/minsort_transform
    '
    make[1]: Nothing to be done for `all'.
    make[1]: Leaving directory `/cygdrive/c/tmp/minsort-transform/minsort_transform'
    
    make -C minsort_rev_transform
    make[1]: Entering directory `/cygdrive/c/tmp/minsort-transform/minsort_rev_transform'
    gcc -O2 -Wall -DNDEBUG   -c -o minsort_rev_transform.o minsort_rev_transform.c
    minsort_rev_transform.c:7:18: fatal error: Judy.h: No such file or directory
     #include <Judy.h>
                      ^
    compilation terminated.
    make[1]: *** [minsort_rev_transform.o] Error 1
    make[1]: Leaving directory `/cygdrive/c/tmp/minsort-transform/minsort_rev_transform'
    make: *** [minsort_rev_transform] Error 2
    The forward transform compiled OK, so I was able to compare the transform with BWT using the same context model (order 0-1 ISSE chain).

    4,067,439 english.dic
    1,186,312 english-s4.3ci1.zpaq
    4,067,439 english.min
    1,139,925 english.min-s4.0ci1.zpaq

    2,988,578 world95.txt
    480,423 world95-s4.3ci1.zpaq
    2,988,578 world95.min
    484,116 world95.min-s4.0ci1.zpaq

    Method s4.3 selects streaming mode, 16 MB block size and BWT. Method s4.0 is the same without BWT. .min is the output of minsort_transform.exe. english.dic (from Maximum Compression corpus) is a sorted list of English words known to compress poorly with BWT. world95.txt is more uniform so BWT does better. I also tried book1 and enwik7 (first 10 MB of enwik but both gave output files that were different from the input size, so I didn't test them further.

    768,771 BOOK1
    177,643 book1.min

    10,000,000 enwik7
    10,116,439 enwik7.min

    The files at least look like I would expect a BWT to look. enwik8 ran for about 30 minutes and crashed with no output.

  4. Thanks:

    nburns (11th July 2013)

  5. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    That version of the reverse transform depends on 64-bit Judy arrays. I've attached one that uses C++ maps and is 32-bit clean. It's slower.

    In 32-bit mode, it's likely to run out of memory, and it's generally not very well tested.

    I'll try to reproduce the errors you found.

    Thanks.

    Edit: Here's a link to the output for enwik8: https://www.dropbox.com/s/t46xp9c9yuwaw5h/enwik8min.7z
    It will run out of memory in 32-bit mode for certain.
    Attached Files Attached Files
    Last edited by nburns; 10th July 2013 at 20:39.

  6. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Here are some results on enwik8. BWT does a little better than minsort.

    100,000,000 enwik8.min
    21,246,081 enwik8-s7.3ci1.zpaq
    21,267,515 enwik8.min-s7.0ci1.zpaq

    I also read the paper on boosting text compression. I'm not sure how it relates to your algorithm. The paper describes theoretical results for an improved compression algorithm for BWT compression, the first that doesn't use MTF+RLE. (I guess that's true because the paper was written in 2005 and I wrote BBB in 2006). The idea was to partition the BWT output into substrings that share common low-order contexts, then compress each substring using Huffman and run length coding. It considers 4 cases for each substring:

    1. All characters are the same.
    2. All characters except one are the same.
    3. Most probable character has p >= 1/2.
    4. Most probable character has p < 1/2.

    In case 1 it encodes the character. In case 2 it encodes the two characters and the position of the odd one. In case 3 it uses a run length code for the most common character and a Huffman code for the rest. The run length uses an Elias Gamma code. In case 4 it uses a Huffman code. There are also codes to give the length or end marker of each substring and the Huffman tables.

    The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.

    Anyway, some more BWT results. For zpaq I tried varying length ISSE chains and a mixer on end. Times are on a 2.0 GHz T3200. Block size is 100 MB except for bzip2 which only goes to 0.9 MB.

    29,008,758 enwik8.bz2, 22s
    21,965,953 enwik8-s7.3c.zpaq, 64s (order 0)
    21,246,081 enwik8-s7.3ci1.zpaq, 81s (order 1)
    21,094,125 enwik8-s7.3ci1.1.zpaq, 107s (order 2)
    21,055,449 enwik8-s7.3ci1.1.1.zpaq, 123s (order 3)
    21,058,599 enwik8-s7.3ci1.1.1.1.zpaq, 140s (order 4)
    21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq, 172s (order 4 with mixer on the end)
    20,847,290 enwik8-cfm100.bbb, 286s
    20,825,972 enwik8-b100.bcm, 52s
    20,783,324 enwik8-b100.bsc, 31s
    20,513,744 enwik8-p1.nz, 17s
    Last edited by Matt Mahoney; 11th July 2013 at 02:25. Reason: added bcm, nanozip

  7. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    The problem with files getting bigger is due to line ending conversions. I corrected that on the google code repo and I attached a new tarball.

    Quote Originally Posted by Matt Mahoney View Post
    I also read the paper on boosting text compression. I'm not sure how it relates to your algorithm. The paper describes theoretical results for an improved compression algorithm for BWT compression, the first that doesn't use MTF+RLE. (I guess that's true because the paper was written in 2005 and I wrote BBB in 2006). The idea was to partition the BWT output into substrings that share common low-order contexts, then compress each substring using Huffman and run length coding. It considers 4 cases for each substring:

    1. All characters are the same.
    2. All characters except one are the same.
    3. Most probable character has p >= 1/2.
    4. Most probable character has p < 1/2.

    In case 1 it encodes the character. In case 2 it encodes the two characters and the position of the odd one. In case 3 it uses a run length code for the most common character and a Huffman code for the rest. The run length uses an Elias Gamma code. In case 4 it uses a Huffman code. There are also codes to give the length or end marker of each substring and the Huffman tables.

    The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.
    The paper is kind of old, and only semi-relevant, but they say that their analysis applies to any transform that is realized by a suffix tree, as they put it. This transform fits that definition, so that helps to explain why the compression performance is similar to the BWT.

    Anyway, some more BWT results. For zpaq I tried varying length ISSE chains and a mixer on end. Times are on a 2.0 GHz T3200. Block size is 100 MB except for bzip2 which only goes to 0.9 MB.

    29,008,758 enwik8.bz2, 22s
    21,965,953 enwik8-s7.3c.zpaq, 64s (order 0)
    21,246,081 enwik8-s7.3ci1.zpaq, 81s (order 1)
    21,094,125 enwik8-s7.3ci1.1.zpaq, 107s (order 2)
    21,055,449 enwik8-s7.3ci1.1.1.zpaq, 123s (order 3)
    21,058,599 enwik8-s7.3ci1.1.1.1.zpaq, 140s (order 4)
    21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq, 172s (order 4 with mixer on the end)
    20,847,290 enwik8-cfm100.bbb, 286s
    20,783,324 enwik8-b100.bsc, 31s
    XML tends to favor the BWT. Sorting must somehow be useful in that case. A lot of less structured data favors this transform.

    I'd like to try fiddling with the order to see if bigger gains are possible. I have a few ideas to try.
    Attached Files Attached Files

  8. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Win32 g++ still can't find Judy.h. But here are some more results comparing minsort and BWT with various ISSE chain lengths with and without mixers. Each group is sorted by size.

    21,010,746 enwik8-s7.3ci1.1.1m.zpaq
    21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq
    21,041,879 enwik8-s7.3ci1.1m.zpaq
    21,055,449 enwik8-s7.3ci1.1.1.zpaq
    21,058,599 enwik8-s7.3ci1.1.1.1.zpaq
    21,094,125 enwik8-s7.3ci1.1.zpaq
    21,160,740 enwik8-s7.3ci1m.zpaq
    21,246,081 enwik8-s7.3ci1.zpaq
    21,965,953 enwik8-s7.3c.zpaq

    21,045,234 enwik8.min-s7.0ci1.1.1.1m.zpaq
    21,045,637 enwik8.min-s7.0ci1.1.1m.zpaq
    21,077,332 enwik8.min-s7.0ci1.1m.zpaq
    21,086,771 enwik8.min-s7.0ci1.1.1.zpaq
    21,090,282 enwik8.min-s7.0ci1.1.1.1.zpaq
    21,122,323 enwik8.min-s7.0ci1.1.zpaq
    21,194,916 enwik8.min-s7.0ci1m.zpaq
    21,267,515 enwik8.min-s7.0ci1.zpaq
    21,978,806 enwik8.min-s7.0c.zpaq

    Edit: wrong file, sorry. minsort_rev_transform.exe in minsort_rev_32bit.zip runs. But Win32 g++ complains it can't find sys/mmap.h
    Last edited by Matt Mahoney; 11th July 2013 at 03:50.

  9. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Win32 g++ still can't find Judy.h. But here are some more results comparing minsort and BWT with various ISSE chain lengths with and without mixers. Each group is sorted by size.
    It uses Judy arrays. They compiled fine for me under Cygwin. I haven't tried them for Windows without Cygwin. For mmap, I guess it needs to be built with Cygwin or ported. I've found memory mapping to be fast and convenient, and I think Cygwin correctly translates it to the Windows equivalent. I guess you are not a unix guy. I use a Linux VM even on my windows PC, and if I didn't have that, I'd use Cygwin. I think it's easier to develop in and the bash shell is much nicer than the windows cmd.exe prompt. Normally, I try to limit dependencies on nonstandard libraries and things like Judy, but it was expedient in this case. At some time I will try to refactor the reverse transform to use a O(1) array lookup instead of a Judy array, because it shouldn't need any kind of associative array. There are other algorithmic improvements, too, and I think it should end up being O(n). So far my priority has been just making it work and not getting lost in the complexity. Putting aside things like portability helps with that. I see the code as primarily a way to demonstrate and prove the concepts, not as the end product per se. I'd like to get a sense of how interesting and useful it is, before putting in more time on refinements. And, I invite anyone to contribute if they want, or implement it themselves. Ultimately, I'd like to understand how much potential there is in improving these permutations. The BWT uses lexicographic ordering because it's simple and elegant, but not because it's inherently good for compression. And not because it's a precondition for being invertible, as this conclusively demonstrates. All of the effort so far has gone into dealing with the permutation that gets emitted by the BWT, but what if you have direct control over the permutation itself? Maybe it's more efficient to reduce the error than model it. The better you can make the permutation, the less touching up it will need.

    The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.


    I thought the paper was helpful to me in understanding the compressibility of the BWT when I was still kind of new to it. I don't think the ideas are state of the art. Basically the idea is to break up the BWT by contexts using a suffix tree. The potential problem there is that suffix tree nodes are much too granular and you'd be splitting runs. They show that you can use a dynamic programming-type approach to decide which nodes can be combined so you could use a non-adaptive encoding like Huffman and get the global optimal. Other papers published later showed that you can achieve the same benefit more simply with wavelet trees and RLE, and that breaking the BWT up by fixed-size chunks works almost as well as suffix-tree nodes. I think PPM has long been attacking the same thing from another angle.

    21,010,746 enwik8-s7.3ci1.1.1m.zpaq
    21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq
    21,041,879 enwik8-s7.3ci1.1m.zpaq
    21,055,449 enwik8-s7.3ci1.1.1.zpaq
    21,058,599 enwik8-s7.3ci1.1.1.1.zpaq
    21,094,125 enwik8-s7.3ci1.1.zpaq
    21,160,740 enwik8-s7.3ci1m.zpaq
    21,246,081 enwik8-s7.3ci1.zpaq
    21,965,953 enwik8-s7.3c.zpaq

    21,045,234 enwik8.min-s7.0ci1.1.1.1m.zpaq
    21,045,637 enwik8.min-s7.0ci1.1.1m.zpaq
    21,077,332 enwik8.min-s7.0ci1.1m.zpaq
    21,086,771 enwik8.min-s7.0ci1.1.1.zpaq
    21,090,282 enwik8.min-s7.0ci1.1.1.1.zpaq
    21,122,323 enwik8.min-s7.0ci1.1.zpaq
    21,194,916 enwik8.min-s7.0ci1m.zpaq
    21,267,515 enwik8.min-s7.0ci1.zpaq
    21,978,806 enwik8.min-s7.0c.zpaq
    I doubt there is anything that can be done on enwik8 that will make it outperform the BWT. On various kinds of data sometimes it just plain beats the BWT, but not enwik8.
    Last edited by nburns; 11th July 2013 at 06:28.

  10. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Actually I have a Windows and Linux machine, but I tend to do more of my work in Windows.

    Anyway I did some more experiments with english.dic, which is known to compress poorly with BWT because it is sorted. I compared BWT, minsort, and no transform at all, modeling with ICM-ISSE chains of orders 0 through 5. We see that minsort does better than BWT at all orders, but that using neither (for order 1 and higher) works best of all. method s4.3 selects streaming mode, 16 MB blocks and BWT. s4.0 is the same without BWT. c selects an order 0 ICM and i1.1... is an ISSE chain with orders 1, 2...

    Code:
    BWT
    4,067,439 english.dic
    1,227,708 english.dic-s4.3c.zpaq
    1,186,312 english.dic-s4.3ci1.zpaq
    1,176,767 english.dic-s4.3ci1.1.zpaq
    1,175,797 english.dic-s4.3ci1.1.1.zpaq
    1,177,630 english.dic-s4.3ci1.1.1.1.zpaq
    1,180,554 english.dic-s4.3ci1.1.1.1.1.zpaq
    
    minsort
    4,067,439 english.min
    1,197,049 english-min-s4.0c.zpaq
    1,135,283 english-min-s4.0ci1.zpaq
    1,121,638 english-min-s4.0ci1.1.zpaq
    1,118,581 english-min-s4.0ci1.1.1.zpaq
    1,120,034 english-min-s4.0ci1.1.1.1.zpaq
    1,123,118 english-min-s4.0ci1.1.1.1.1.zpaq
    
    no transform
    4,067,439 english.dic
    1,938,272 english.dic-s4.0c.zpaq
      991,423 english.dic-s4.0ci1.zpaq
      705,696 english.dic-s4.0ci1.1.zpaq
      599,960 english.dic-s4.0ci1.1.1.zpaq
      538,241 english.dic-s4.0ci1.1.1.1.zpaq
      505,578 english.dic-s4.0ci1.1.1.1.1.zpaq
    Edit: another example, this time on a tar file of the Calgary corpus. Again, minsort outperforms BWT. We know that BWT does poorly on a mix of different data types. You could get better BWT compression by compressing each file separately (or at least the binary files separately). In this case, BWT and minsort compresses better than direct modeling at order 3 and below, but direct modeling at orders 4 and higher gives the best results.

    Code:
    BWT
    3,152,896 calgary.tar
      844,364 calgary.tar-s4.3c.zpaq
      800,525 calgary.tar-s4.3ci1.zpaq
      795,718 calgary.tar-s4.3ci1.1.zpaq
      797,000 calgary.tar-s4.3ci1.1.1.zpaq
      800,074 calgary.tar-s4.3ci1.1.1.1.zpaq
      803,725 calgary.tar-s4.3ci1.1.1.1.1.zpaq
    
    minsort
    3,152,896 calgary.min
      843,870 calgary.min-s4.0c.zpaq
      798,148 calgary.min-s4.0ci1.zpaq
      792,957 calgary.min-s4.0ci1.1.zpaq
      794,035 calgary.min-s4.0ci1.1.1.zpaq
      797,014 calgary.min-s4.0ci1.1.1.1.zpaq
      800,685 calgary.min-s4.0ci1.1.1.1.1.zpaq
    
    no transform
    3,152,896 calgary.tar
    1,717,067 calgary.tar-s4.0c.zpaq
    1,274,254 calgary.tar-s4.0ci1.zpaq
      999,891 calgary.tar-s4.0ci1.1.zpaq
      834,849 calgary.tar-s4.0ci1.1.1.zpaq
      765,286 calgary.tar-s4.0ci1.1.1.1.zpaq
      743,027 calgary.tar-s4.0ci1.1.1.1.1.zpaq
    Last edited by Matt Mahoney; 11th July 2013 at 23:23.

  11. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Edit: another example, this time on a tar file of the Calgary corpus. Again, minsort outperforms BWT. We know that BWT does poorly on a mix of different data types. You could get better BWT compression by compressing each file separately (or at least the binary files separately). In this case, BWT and minsort compresses better than direct modeling at order 3 and below, but direct modeling at orders 4 and higher gives the best results.
    In my testing, using less sophisticated models, I've seen differences from the BWT of up to 1%. This may indicate a hard limit, as keeping all of the nodes of the suffix tree together places constraints on how far the order can deviate.

    I tried changing the order a bit by eliminating all nodes with fewer than 3 suffixes, and the transform remained decodable, but compression was always slightly worse, across a variety of different kinds of data. So, pruning nodes is one possible way to change the order. Another way that should be possible is splitting or prematurely terminating nodes, if they look like they are no longer useful. That could help in the case of mixed data types.

Similar Threads

  1. The BWT is redundant...
    By nburns in forum Data Compression
    Replies: 29
    Last Post: 10th April 2013, 03:59
  2. BWT - how to use?
    By m^2 in forum Data Compression
    Replies: 29
    Last Post: 6th November 2008, 02:01
  3. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21
  4. New LZW variant
    By encode in forum Forum Archive
    Replies: 14
    Last Post: 28th January 2008, 21:33
  5. New move-to-front variant?
    By Lasse Reinhold in forum Forum Archive
    Replies: 7
    Last Post: 17th September 2007, 16:05

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •