# Thread: A new, minimally-sorted BWT variant

1. ## A new, minimally-sorted BWT variant

I've been experimenting with a new invertible BWT-like transform. Like the BWT, it groups together all of the maximal contexts (suffix tree nodes), so it's realized by a suffix tree in the sense of [1]. Unlike the BWT, it doesn't use lexicographic order to determine an ordering among contexts. A total order is derived from two constraints:

• Every suffix is next to (at least) one of its longest matches (by trailing context)
• Earlier suffixes come before later suffixes

The output of the transform is a pair of (transformed string, index), like the BWT. The index gives the position of the last symbol in the transformed string. The first symbol in the input string has null context (it doesn't wrap around), and the null context is placed first, so the first symbol of the input and transform are always the same, and decoding starts at the beginning. The first symbol determines the second context (after the null context), so the second symbol of the input and transform are also always the same.

Ex. "BANANAS" => ("BANNSAA", 4)
"QWERTY" => ("QWERTY", 5)

The transform was inspired by PPM, LZ and other algorithms that compress in the original order. It's also somewhat similar to the Schindler transform, in the sense that that transform falls back on the positional order to break ties, so it's a mix of sort and positional orders. Unlike the ST, this uses unbounded contexts, and prefers the positional order to the sort order. It would be possible to make a bounded-context variant of this, too.

Overall, this transform compresses very similarly to the BWT, but on some types of data, this beats the BWT by about 1%. That seems to indicate that some data (e.g., source code) benefits slightly from being ordered positionally rather than lexicographically.

One thing that became clear when I was developing this is that the order of the BWT makes decoding very simple. To decode this, it seems to be required to materialize the suffix tree as you go. It doesn't have the property that symbols in the first and last columns are in the same relative order, and, in fact, I realized that that property is unique to the BWT. Having that property implies that a transform is completely sorted. So, to branch out significantly beyond the BWT, it's necessary to break that invariant and complicate the decoder. But, I believe that an unlimited number of new transforms are possible, and this experiment provides an example of how you might write encoders and decoders for them.

I've attached code and 64-bit windows binaries.

1. Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, and Marinella Sciortino. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July 2005), 688-713.

2. ## Thanks (3):

Bulat Ziganshin (10th July 2013),Matt Mahoney (10th July 2013),willvarfar (10th July 2013)

3. Tried compiling in 32 bit Windows.
Code:
```C:\tmp\minsort-transform>make
make -C minsort_transform
make[1]: Entering directory `/cygdrive/c/tmp/minsort-transform/minsort_transform
'
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/cygdrive/c/tmp/minsort-transform/minsort_transform'

make -C minsort_rev_transform
make[1]: Entering directory `/cygdrive/c/tmp/minsort-transform/minsort_rev_transform'
gcc -O2 -Wall -DNDEBUG   -c -o minsort_rev_transform.o minsort_rev_transform.c
minsort_rev_transform.c:7:18: fatal error: Judy.h: No such file or directory
#include <Judy.h>
^
compilation terminated.
make[1]: *** [minsort_rev_transform.o] Error 1
make[1]: Leaving directory `/cygdrive/c/tmp/minsort-transform/minsort_rev_transform'
make: *** [minsort_rev_transform] Error 2```
The forward transform compiled OK, so I was able to compare the transform with BWT using the same context model (order 0-1 ISSE chain).

4,067,439 english.dic
1,186,312 english-s4.3ci1.zpaq
4,067,439 english.min
1,139,925 english.min-s4.0ci1.zpaq

2,988,578 world95.txt
480,423 world95-s4.3ci1.zpaq
2,988,578 world95.min
484,116 world95.min-s4.0ci1.zpaq

Method s4.3 selects streaming mode, 16 MB block size and BWT. Method s4.0 is the same without BWT. .min is the output of minsort_transform.exe. english.dic (from Maximum Compression corpus) is a sorted list of English words known to compress poorly with BWT. world95.txt is more uniform so BWT does better. I also tried book1 and enwik7 (first 10 MB of enwik but both gave output files that were different from the input size, so I didn't test them further.

768,771 BOOK1
177,643 book1.min

10,000,000 enwik7
10,116,439 enwik7.min

The files at least look like I would expect a BWT to look. enwik8 ran for about 30 minutes and crashed with no output.

4. ## Thanks:

nburns (11th July 2013)

5. That version of the reverse transform depends on 64-bit Judy arrays. I've attached one that uses C++ maps and is 32-bit clean. It's slower.

In 32-bit mode, it's likely to run out of memory, and it's generally not very well tested.

I'll try to reproduce the errors you found.

Thanks.

Edit: Here's a link to the output for enwik8: https://www.dropbox.com/s/t46xp9c9yuwaw5h/enwik8min.7z
It will run out of memory in 32-bit mode for certain.

6. Here are some results on enwik8. BWT does a little better than minsort.

100,000,000 enwik8.min
21,246,081 enwik8-s7.3ci1.zpaq
21,267,515 enwik8.min-s7.0ci1.zpaq

I also read the paper on boosting text compression. I'm not sure how it relates to your algorithm. The paper describes theoretical results for an improved compression algorithm for BWT compression, the first that doesn't use MTF+RLE. (I guess that's true because the paper was written in 2005 and I wrote BBB in 2006). The idea was to partition the BWT output into substrings that share common low-order contexts, then compress each substring using Huffman and run length coding. It considers 4 cases for each substring:

1. All characters are the same.
2. All characters except one are the same.
3. Most probable character has p >= 1/2.
4. Most probable character has p < 1/2.

In case 1 it encodes the character. In case 2 it encodes the two characters and the position of the odd one. In case 3 it uses a run length code for the most common character and a Huffman code for the rest. The run length uses an Elias Gamma code. In case 4 it uses a Huffman code. There are also codes to give the length or end marker of each substring and the Huffman tables.

The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.

Anyway, some more BWT results. For zpaq I tried varying length ISSE chains and a mixer on end. Times are on a 2.0 GHz T3200. Block size is 100 MB except for bzip2 which only goes to 0.9 MB.

29,008,758 enwik8.bz2, 22s
21,965,953 enwik8-s7.3c.zpaq, 64s (order 0)
21,246,081 enwik8-s7.3ci1.zpaq, 81s (order 1)
21,094,125 enwik8-s7.3ci1.1.zpaq, 107s (order 2)
21,055,449 enwik8-s7.3ci1.1.1.zpaq, 123s (order 3)
21,058,599 enwik8-s7.3ci1.1.1.1.zpaq, 140s (order 4)
21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq, 172s (order 4 with mixer on the end)
20,847,290 enwik8-cfm100.bbb, 286s
20,825,972 enwik8-b100.bcm, 52s
20,783,324 enwik8-b100.bsc, 31s
20,513,744 enwik8-p1.nz, 17s

7. The problem with files getting bigger is due to line ending conversions. I corrected that on the google code repo and I attached a new tarball.

Originally Posted by Matt Mahoney
I also read the paper on boosting text compression. I'm not sure how it relates to your algorithm. The paper describes theoretical results for an improved compression algorithm for BWT compression, the first that doesn't use MTF+RLE. (I guess that's true because the paper was written in 2005 and I wrote BBB in 2006). The idea was to partition the BWT output into substrings that share common low-order contexts, then compress each substring using Huffman and run length coding. It considers 4 cases for each substring:

1. All characters are the same.
2. All characters except one are the same.
3. Most probable character has p >= 1/2.
4. Most probable character has p < 1/2.

In case 1 it encodes the character. In case 2 it encodes the two characters and the position of the odd one. In case 3 it uses a run length code for the most common character and a Huffman code for the rest. The run length uses an Elias Gamma code. In case 4 it uses a Huffman code. There are also codes to give the length or end marker of each substring and the Huffman tables.

The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.
The paper is kind of old, and only semi-relevant, but they say that their analysis applies to any transform that is realized by a suffix tree, as they put it. This transform fits that definition, so that helps to explain why the compression performance is similar to the BWT.

Anyway, some more BWT results. For zpaq I tried varying length ISSE chains and a mixer on end. Times are on a 2.0 GHz T3200. Block size is 100 MB except for bzip2 which only goes to 0.9 MB.

29,008,758 enwik8.bz2, 22s
21,965,953 enwik8-s7.3c.zpaq, 64s (order 0)
21,246,081 enwik8-s7.3ci1.zpaq, 81s (order 1)
21,094,125 enwik8-s7.3ci1.1.zpaq, 107s (order 2)
21,055,449 enwik8-s7.3ci1.1.1.zpaq, 123s (order 3)
21,058,599 enwik8-s7.3ci1.1.1.1.zpaq, 140s (order 4)
21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq, 172s (order 4 with mixer on the end)
20,847,290 enwik8-cfm100.bbb, 286s
20,783,324 enwik8-b100.bsc, 31s
XML tends to favor the BWT. Sorting must somehow be useful in that case. A lot of less structured data favors this transform.

I'd like to try fiddling with the order to see if bigger gains are possible. I have a few ideas to try.

8. Win32 g++ still can't find Judy.h. But here are some more results comparing minsort and BWT with various ISSE chain lengths with and without mixers. Each group is sorted by size.

21,010,746 enwik8-s7.3ci1.1.1m.zpaq
21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq
21,041,879 enwik8-s7.3ci1.1m.zpaq
21,055,449 enwik8-s7.3ci1.1.1.zpaq
21,058,599 enwik8-s7.3ci1.1.1.1.zpaq
21,094,125 enwik8-s7.3ci1.1.zpaq
21,160,740 enwik8-s7.3ci1m.zpaq
21,246,081 enwik8-s7.3ci1.zpaq
21,965,953 enwik8-s7.3c.zpaq

21,045,234 enwik8.min-s7.0ci1.1.1.1m.zpaq
21,045,637 enwik8.min-s7.0ci1.1.1m.zpaq
21,077,332 enwik8.min-s7.0ci1.1m.zpaq
21,086,771 enwik8.min-s7.0ci1.1.1.zpaq
21,090,282 enwik8.min-s7.0ci1.1.1.1.zpaq
21,122,323 enwik8.min-s7.0ci1.1.zpaq
21,194,916 enwik8.min-s7.0ci1m.zpaq
21,267,515 enwik8.min-s7.0ci1.zpaq
21,978,806 enwik8.min-s7.0c.zpaq

Edit: wrong file, sorry. minsort_rev_transform.exe in minsort_rev_32bit.zip runs. But Win32 g++ complains it can't find sys/mmap.h

9. Originally Posted by Matt Mahoney
Win32 g++ still can't find Judy.h. But here are some more results comparing minsort and BWT with various ISSE chain lengths with and without mixers. Each group is sorted by size.
It uses Judy arrays. They compiled fine for me under Cygwin. I haven't tried them for Windows without Cygwin. For mmap, I guess it needs to be built with Cygwin or ported. I've found memory mapping to be fast and convenient, and I think Cygwin correctly translates it to the Windows equivalent. I guess you are not a unix guy. I use a Linux VM even on my windows PC, and if I didn't have that, I'd use Cygwin. I think it's easier to develop in and the bash shell is much nicer than the windows cmd.exe prompt. Normally, I try to limit dependencies on nonstandard libraries and things like Judy, but it was expedient in this case. At some time I will try to refactor the reverse transform to use a O(1) array lookup instead of a Judy array, because it shouldn't need any kind of associative array. There are other algorithmic improvements, too, and I think it should end up being O(n). So far my priority has been just making it work and not getting lost in the complexity. Putting aside things like portability helps with that. I see the code as primarily a way to demonstrate and prove the concepts, not as the end product per se. I'd like to get a sense of how interesting and useful it is, before putting in more time on refinements. And, I invite anyone to contribute if they want, or implement it themselves. Ultimately, I'd like to understand how much potential there is in improving these permutations. The BWT uses lexicographic ordering because it's simple and elegant, but not because it's inherently good for compression. And not because it's a precondition for being invertible, as this conclusively demonstrates. All of the effort so far has gone into dealing with the permutation that gets emitted by the BWT, but what if you have direct control over the permutation itself? Maybe it's more efficient to reduce the error than model it. The better you can make the permutation, the less touching up it will need.

The paper gives only theoretical bounds on the compressed size but no experimental results on real benchmarks. It would be interesting to compare with context models like those used in BBB and zpaq. I use an order 0-1 ISSE chain in zpaq and found during development that MTF and RLE made compression worse. BBB uses a complex model with a CM and 6 component ACM (SSE) chain and gets a little better compression than zpaq.

I thought the paper was helpful to me in understanding the compressibility of the BWT when I was still kind of new to it. I don't think the ideas are state of the art. Basically the idea is to break up the BWT by contexts using a suffix tree. The potential problem there is that suffix tree nodes are much too granular and you'd be splitting runs. They show that you can use a dynamic programming-type approach to decide which nodes can be combined so you could use a non-adaptive encoding like Huffman and get the global optimal. Other papers published later showed that you can achieve the same benefit more simply with wavelet trees and RLE, and that breaking the BWT up by fixed-size chunks works almost as well as suffix-tree nodes. I think PPM has long been attacking the same thing from another angle.

21,010,746 enwik8-s7.3ci1.1.1m.zpaq
21,012,895 enwik8-s7.3ci1.1.1.1m.zpaq
21,041,879 enwik8-s7.3ci1.1m.zpaq
21,055,449 enwik8-s7.3ci1.1.1.zpaq
21,058,599 enwik8-s7.3ci1.1.1.1.zpaq
21,094,125 enwik8-s7.3ci1.1.zpaq
21,160,740 enwik8-s7.3ci1m.zpaq
21,246,081 enwik8-s7.3ci1.zpaq
21,965,953 enwik8-s7.3c.zpaq

21,045,234 enwik8.min-s7.0ci1.1.1.1m.zpaq
21,045,637 enwik8.min-s7.0ci1.1.1m.zpaq
21,077,332 enwik8.min-s7.0ci1.1m.zpaq
21,086,771 enwik8.min-s7.0ci1.1.1.zpaq
21,090,282 enwik8.min-s7.0ci1.1.1.1.zpaq
21,122,323 enwik8.min-s7.0ci1.1.zpaq
21,194,916 enwik8.min-s7.0ci1m.zpaq
21,267,515 enwik8.min-s7.0ci1.zpaq
21,978,806 enwik8.min-s7.0c.zpaq
I doubt there is anything that can be done on enwik8 that will make it outperform the BWT. On various kinds of data sometimes it just plain beats the BWT, but not enwik8.

10. Actually I have a Windows and Linux machine, but I tend to do more of my work in Windows.

Anyway I did some more experiments with english.dic, which is known to compress poorly with BWT because it is sorted. I compared BWT, minsort, and no transform at all, modeling with ICM-ISSE chains of orders 0 through 5. We see that minsort does better than BWT at all orders, but that using neither (for order 1 and higher) works best of all. method s4.3 selects streaming mode, 16 MB blocks and BWT. s4.0 is the same without BWT. c selects an order 0 ICM and i1.1... is an ISSE chain with orders 1, 2...

Code:
```BWT
4,067,439 english.dic
1,227,708 english.dic-s4.3c.zpaq
1,186,312 english.dic-s4.3ci1.zpaq
1,176,767 english.dic-s4.3ci1.1.zpaq
1,175,797 english.dic-s4.3ci1.1.1.zpaq
1,177,630 english.dic-s4.3ci1.1.1.1.zpaq
1,180,554 english.dic-s4.3ci1.1.1.1.1.zpaq

minsort
4,067,439 english.min
1,197,049 english-min-s4.0c.zpaq
1,135,283 english-min-s4.0ci1.zpaq
1,121,638 english-min-s4.0ci1.1.zpaq
1,118,581 english-min-s4.0ci1.1.1.zpaq
1,120,034 english-min-s4.0ci1.1.1.1.zpaq
1,123,118 english-min-s4.0ci1.1.1.1.1.zpaq

no transform
4,067,439 english.dic
1,938,272 english.dic-s4.0c.zpaq
991,423 english.dic-s4.0ci1.zpaq
705,696 english.dic-s4.0ci1.1.zpaq
599,960 english.dic-s4.0ci1.1.1.zpaq
538,241 english.dic-s4.0ci1.1.1.1.zpaq
505,578 english.dic-s4.0ci1.1.1.1.1.zpaq```
Edit: another example, this time on a tar file of the Calgary corpus. Again, minsort outperforms BWT. We know that BWT does poorly on a mix of different data types. You could get better BWT compression by compressing each file separately (or at least the binary files separately). In this case, BWT and minsort compresses better than direct modeling at order 3 and below, but direct modeling at orders 4 and higher gives the best results.

Code:
```BWT
3,152,896 calgary.tar
844,364 calgary.tar-s4.3c.zpaq
800,525 calgary.tar-s4.3ci1.zpaq
795,718 calgary.tar-s4.3ci1.1.zpaq
797,000 calgary.tar-s4.3ci1.1.1.zpaq
800,074 calgary.tar-s4.3ci1.1.1.1.zpaq
803,725 calgary.tar-s4.3ci1.1.1.1.1.zpaq

minsort
3,152,896 calgary.min
843,870 calgary.min-s4.0c.zpaq
798,148 calgary.min-s4.0ci1.zpaq
792,957 calgary.min-s4.0ci1.1.zpaq
794,035 calgary.min-s4.0ci1.1.1.zpaq
797,014 calgary.min-s4.0ci1.1.1.1.zpaq
800,685 calgary.min-s4.0ci1.1.1.1.1.zpaq

no transform
3,152,896 calgary.tar
1,717,067 calgary.tar-s4.0c.zpaq
1,274,254 calgary.tar-s4.0ci1.zpaq
999,891 calgary.tar-s4.0ci1.1.zpaq
834,849 calgary.tar-s4.0ci1.1.1.zpaq
765,286 calgary.tar-s4.0ci1.1.1.1.zpaq
743,027 calgary.tar-s4.0ci1.1.1.1.1.zpaq```

11. Originally Posted by Matt Mahoney
Edit: another example, this time on a tar file of the Calgary corpus. Again, minsort outperforms BWT. We know that BWT does poorly on a mix of different data types. You could get better BWT compression by compressing each file separately (or at least the binary files separately). In this case, BWT and minsort compresses better than direct modeling at order 3 and below, but direct modeling at orders 4 and higher gives the best results.
In my testing, using less sophisticated models, I've seen differences from the BWT of up to 1%. This may indicate a hard limit, as keeping all of the nodes of the suffix tree together places constraints on how far the order can deviate.

I tried changing the order a bit by eliminating all nodes with fewer than 3 suffixes, and the transform remained decodable, but compression was always slightly worse, across a variety of different kinds of data. So, pruning nodes is one possible way to change the order. Another way that should be possible is splitting or prematurely terminating nodes, if they look like they are no longer useful. That could help in the case of mixed data types.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•