I've been experimenting with a new invertible BWT-like transform. Like the BWT, it groups together all of the maximal contexts (suffix tree nodes), so it's realized by a suffix tree in the sense of [1]. Unlike the BWT, it doesn't use lexicographic order to determine an ordering among contexts. A total order is derived from two constraints:
- Every suffix is next to (at least) one of its longest matches (by trailing context)
- Earlier suffixes come before later suffixes
The output of the transform is a pair of (transformed string, index), like the BWT. The index gives the position of the last symbol in the transformed string. The first symbol in the input string has null context (it doesn't wrap around), and the null context is placed first, so the first symbol of the input and transform are always the same, and decoding starts at the beginning. The first symbol determines the second context (after the null context), so the second symbol of the input and transform are also always the same.
Ex. "BANANAS" => ("BANNSAA", 4)
"ABRACADABRA" => ("ABCDBRRAAAA",
"QWERTY" => ("QWERTY", 5)
The transform was inspired by PPM, LZ and other algorithms that compress in the original order. It's also somewhat similar to the Schindler transform, in the sense that that transform falls back on the positional order to break ties, so it's a mix of sort and positional orders. Unlike the ST, this uses unbounded contexts, and prefers the positional order to the sort order. It would be possible to make a bounded-context variant of this, too.
Overall, this transform compresses very similarly to the BWT, but on some types of data, this beats the BWT by about 1%. That seems to indicate that some data (e.g., source code) benefits slightly from being ordered positionally rather than lexicographically.
One thing that became clear when I was developing this is that the order of the BWT makes decoding very simple. To decode this, it seems to be required to materialize the suffix tree as you go. It doesn't have the property that symbols in the first and last columns are in the same relative order, and, in fact, I realized that that property is unique to the BWT. Having that property implies that a transform is completely sorted. So, to branch out significantly beyond the BWT, it's necessary to break that invariant and complicate the decoder. But, I believe that an unlimited number of new transforms are possible, and this experiment provides an example of how you might write encoders and decoders for them.
I've attached code and 64-bit windows binaries.
1. Paolo Ferragina, Raffaele Giancarlo, Giovanni Manzini, and Marinella Sciortino. 2005. Boosting textual compression in optimal linear time. J. ACM 52, 4 (July 2005), 688-713.