Thread: A new algorithm for compression using order-n models

1. A new algorithm for compression using order-n models

Hi everybody ,

I want to present a (hopefully^^) new compression scheme I've developed.
For the sake of simplicity I only worked on a version which uses a binary alphabet and
circular data (=> no special ending character).

The Idea
Imagine encoding the file with an order-0 model. We basically use a Markov chain with
only a single state:

Encoding an order-1 model uses a 2 state Markov chain(where each state represents a context):

The idea is instead of transmitting the Markov model and the way through it, we only transmit the order-0 model
and information on how to build the order-1 model from it. We can repeat this step until order-n where there's
always only 1 edge between the nodes and then rebuild the data without any additional knowledge.

This means we encode the length of the string and the numbers of zeros in it (so we also know the number of ones).
At each order-x we create all possible strings and create the order-(x+1) model matching this string. Now we count
how many (N) different order-(x+1) models there are and encode the one matching our data using log2(N) bits.

Observations
Of course we cannot create all possible strings and count the different order-(c+1) models in any meaningful time < O(e^n).
But I've made some observations that allow to encode data in this way using O(n) time and O(n) space.

1. Each edge represents a node in the next Markov chain
The number each edge is taken equals the number the state is reached in the next chain.

2. Each chain can be broken down to groups of 4 states (of which some can be duplicates)
The basic chunk of these finite state machines is the following:

Where X is a context of length x-1. We can now compute the number of possible next chains constant time
for each of these chunks. The number edge 0X -> X0 is taken + the number edge 0X -> X0 is taken =
times state X0 is reached. We only need to encode the amount 1 edge is taken of every basic chunk to encode
the next fsm.

If at least 1 state isn't taken at all there is no need to encode any information.

3. Relation to the suffix tree
Each state can be found in the suffix tree. gathering the information needed to encode each chunk can be found
by simply walking the suffix tree using suffix links. The suffix links of 0X and 1X always point to X. From observation 2
we know that we only need to encode states where 0X, 1X, X0 and X1 exists.

If X0 and X1 exists there must be a node in the tree for X. Because there's only a limited number of nodes in the suffix tree O(n)
this limits the number of codings to O(n).

Implementation
These observations leads to a linear time encoding of the data. But it still remains to be shown that it is also possible to do this using O(n)
memory. This took me a while ...

When encoding from order-0 to order-n we always encode the order-x before the order-(x+1) chunks. This equals walking the lcp array
from values of 0 to n. But this would still require O(n*log n) memory for the lcp array and constant time range minimum queries.

Instead I basically use an algorithm from Dipl.-Inf. Timo Beller (http://www.uni-ulm.de/in/theo/m/beller.html) which creates the lcp array
in ascending order from the BWT which can be computed in O(n) memory (See A Linear-Time Burrows-Wheeler Transform Using Induced Sorting
Daisuke Okanohara, Kunihiko Sadakane). Timo Beller uses an implementation which uses O(n * log(n)) bits to store the w-Intervals but I can show
that this can be done using O(n) time and O(n) memory (if we don't store the lcp array) using distance coding and
Elias-gamma coding (see tools/Queue.h and BCE.h / Intervals are created in ascending order for each character).

So the framework looks like this for encoding:
1. Do the Burrows-Wheeler-Transformation of the input data
2. "Build" the LCP array in ascending order and
3. For each w-Interval X1 and X0 encode the basic chunk X using a rank index.

Decoding works pretty much the same way:
1. Rebuild the rank index using the data from coder.
2. Unbwt using the rebuild rank index

The Code
I still have a lot to share with you (e.g. how to rebuild the rank index from the coder) but I guess you wanna see that it works first.
The source code was developed for cmake/gcc in Fedora using c++11.
So here I release to versions of my code:

1. BCE v0.2
This version is exactly what I described above but it uses SAIS-lite from Yuta Mori instead of the direct approach.
(Yeah I was a little bit lazy with that ...) This means it uses an awful lot of memory (a pointer for each bit).

Source: http://kta-design.de/releases/bce2.tar.gz

2. Bce v0.3
This version does a regular BWT on char level then splits them into 8 chunks (always 1 bit per context) and encodes them
in a round robin fashion (hard to explain maybe the code says more see BCE.cpp and the modified tools/SAIS.h).
It uses a lot less memory (about 5 N). It should be able to encode enwik9 on 64 bit machines.

Source: http://kta-design.de/releases/bce3.tar.gz

(The code was only tested on a 64 bit machine and needs some rewriting to work on a 32 bit machine)

The Results
My results on a Core i7-4770K, 8GB DDR3, Samsung 840Pro 128 GB machine look like this:

bce2 -c enwik8: 4:11 min
bce2 -d enwik8.bce: 3:16 min
Compressed size: 22825468 Byte

bce3 -c enwik8: 1:40 min
bce3 -d enwik8.bce: 3:08 min
Compressed size: 22.729.148 Byte

bce3 -c enwik9: 19:11 min
bce3 -d enwik9.bce: 40:44 min
Compressed size: 180.732.702 Byte

I'm a 24 years old hobby programmer from Germany. I study process engineering not Informatics so please inform me
if I used false terms. Maybe someone could help me write a correct paper about this topic ? Feel free to ask any questions

I think there are still a lot of optimizations possible. Especially increasing the alphabet size could help a lot (I will continue working on this).
We also have additional information which hasn't been used yet (e.g. the lcp value).

Thanks for your attention,
Christoph

2. The Following 8 Users Say Thank You to Christoph Diegelmann For This Useful Post:

avitar (19th March 2015),Bulat Ziganshin (16th March 2015),Cyan (17th March 2015),encode (16th March 2015),Gonzalo (16th March 2015),Mike (17th March 2015),Nania Francesco (17th March 2015),willvarfar (3rd May 2016)

3. Welcome Christoph Diegelmann!

your post sounds very interesting.

Can you please provide a working binary?
This would make it easier to test your program/algorithm.

May be for 64bit-redhat-linux 6 or for another 64bit system?

best regards
Joerg

4. Hello Joerg,

you can find binaries which I compiled under Fedora 21 64 bit using gcc 4.9.2 here:
http://kta-design.de/releases/bce2
http://kta-design.de/releases/bce3

Before I forget about it: BCE2 can compress static Huffman encoded files without much loss (results in ~23 MB for enwik8 )
this speeds up compression and decompression without hurting compression much.

Christoph

5. Thank you very much for your quick answer.

But this binaries for fedora 21 seems not to work under Red Hat Enterprise Linux 6 ... ?

Should it work?

best regards

6. Sadly I don't have access to RHEL 6 so I can't try. But you should be able to build it using

cmake CMakeLists.txt
make

if you got cmake, make and gcc.

I just noticed the CMakeLists.txt contain "SET(CMAKE_BUILD_TYPE "-Release")" which should be
"SET(CMAKE_BUILD_TYPE "Release")" this means the times I posted were for the unoptimized code.
I will update the timings soon.

Update 20:21: Updated the source packages and the binaries (compiled for "-march=x86-64" hopefully they work ? :/ )
Update 21:42: Forgot to remove some test code in bce2 which lead to false decode. Fixed it and reuploaded binaries and source

7. I tested bce3 on LTCB and Silesia. On LTCB I couldn't test enwik9 with only 4 GB memory but enwik8 verified OK. On Silesia corpus, dickens, osdb, and sao decompressed output did not match the original file. I didn't test bce2.

http://mattmahoney.net/dc/text.html#1807 (ranked 41 of 189).
http://mattmahoney.net/dc/silesia.html (not updated).

It is an interesting algorithm which I haven't seen before. The Markov graph representation reminds me of DMC, but the idea of using a BWT to construct and transmit it rather than having the decoder reconstruct it is novel as far as I can tell.

8. Thanks for testing !
By now I only tested on enwik8, enwik9 and some small files for debugging which decompressed correctly.
I will try to fix it and reupload a new version (I'm also working on a windows port).

On my latest tests I also noticed that using 0X/(0X+1X) and X0/(X0+X1) as context when encoding a chunk I
can reduce the size to ~21,2 MB for enwik8 and 168 MB for enwik9. I used a simple adaptive range coder
for contexts where X0+X1 < 32.
Could you give me an idea how I could implement such an adaptive coding stage for contexts of unlimitied size ?
Currently I'm simply using an array "uint32_t stats_[32][32][32][32];" where the contexts are X0+X1, X0 and 0X
(the last array is for the symbol coded).

Update: Fixed the bug in bce3 leading to false decode of some files (At least Silesia compresses and decompresses
correctly on my machien with this version). Reuploaded the build and the source.

9. Originally Posted by Christoph Diegelmann
Hi everybody ,

The Idea
Imagine encoding the file with an order-0 model. We basically use a Markov chain with
only a single state:

Encoding an order-1 model uses a 2 state Markov chain(where each state represents a context):

The idea is instead of transmitting the Markov model and the way through it, we only transmit the order-0 model
and information on how to build the order-1 model from it. We can repeat this step until order-n where there's
always only 1 edge between the nodes and then rebuild the data without any additional knowledge.

This means we encode the length of the string and the numbers of zeros in it (so we also know the number of ones).
At each order-x we create all possible strings and create the order-(x+1) model matching this string. Now we count
how many (N) different order-(x+1) models there are and encode the one matching our data using log2(N) bits.

Christoph
I haven't got enough time to follow along to deeply at the moment but this sounds very similar to what M03 does to encode the BWT. Encode the order 0 model and then the information needed to define the order 1 model from the order 0 model. Continue doing so for each order until the model contains all unique states.

- Michael

10. Hey Michael,

yeah that sounds quite similar can you give me additional information about M03 ? I guess your working with
a character model rather than a bit model ?

Christoph

11. Originally Posted by Christoph Diegelmann
Hey Michael,

yeah that sounds quite similar can you give me additional information about M03 ? I guess your working with
a character model rather than a bit model ?

Christoph

Yes, my implementation is byte based rather than a bit model. Also, I don't bother with computing the LCP array. Rather, LCP is a byproduct of the modelling process. Juergen Abel did a write up on M03 that covers some of the concepts involved. That paper is at:

http://www.juergen-abel.info/Preprin...BWT_Stages.pdf

Basically, once you have the BWT you have the distribution of the symbols which precede the order zero contexts. From this you can infer the order one context boundaries and encode how distribution of symbols preceding the order zero contexts can be distributed to the order one contexts. Repeat until each context is unique (or all symbols preceding any unique context are all the same).

example (this is off the top of my head so I could make a mistake here).

input string is banana\$ (\$ == -1)
BWT is: annb\$aa

M03 works by encoding the distribution of symbols which precede a unique context as that context is divided into the context of the next highest order. To make the example clear, first a few definitions etc. A 'list' of symbols will be written inside of square brackets. In 'banana\$' the context 'an' appears twice. The symbols which precede these contexts are 'b' and 'n' and would therefore the list of symbols which precede the context 'an' would be [bn]. The symbols in the list are listed in lexicographical order.

The symbol list which precedes the order zero context for any given string is basically trivial to produce. The encoder encodes the count of each unique symbol type in the input (or the BWT, same thing) and the decoder decodes these counts to produce the same list.

For annb\$aa the encoder encodes 1\$,3a,1b,2n. The list of symbols which precede the order zero context is [\$aaabnn].

From this, the order one context boundaries can be inferred. The single order zero context [\$aaabnn] will be 'split' into four symbols lists representing the symbols which precede each of the four unique order one contexts '\$', 'a', 'b' and 'n'.
This means that [\$aaabnn] will be split into four lists [a][bnn][\$][aa]. The encoder encodes how [\$aaabnn] is distributed into [a][bnn][\$][aa] and is implementation specific. The decoder decodes that information and uses it to create the same list [a][bnn][\$][aa].

What does [a][bnn][\$][aa] indicate? It indicates that there are four unique order one contexts. (The model doesn't really need to know what those contexts are specifically to achieve compression. But it can be used to increase compressed in more advanced implementations of the algorithm.) There are four unique order one contexts in 'banana\$' and they are '\$', 'a', 'b' and 'n'. '\$' appears once and is preceded by 'a'. 'a' appears three times and is preceded by 'b' once and 'n' twice. 'b' appears once and is preceded by '\$'. and 'n' appears twice and is preceded by 'a' both times.

For clarity (I hope) we can display the order 0 and order 1 contexts as:

order 0 -> [\$aaabnnn ] <-- one order 0 'parent' context
order 1 -> [a][bnn][\$][aa] <-- four order 1 'child' contexts

This information allows us to infer the order 2 contexts. Take for example the three 'a' symbols which are in the order zero context. When they are in the order zero context all three are in the same list [\$aaabnn] but they are 'split' into two of the lists at order 1 contexts: [a][...][.][aa]. Therefore, the order one 'a' context (which appears three times) will split at order two into two contexts of length 1 and 2 at order two. Considering only what this split of 'a' mean for the moment:

order 0 -> [\$aaabnn ]
order 1 -> [a][bnn ][\$][aa]
^ ^^ <-- the three 'a' split for into two child contexts
^^^ <-- these are the three order one 'a' contexts
order 2 -> [n][bn] <-- therefore [bnn] will split at order two into [n][bn]

Notice that the two 'n' in the order zero context DO NOT split into separate lists at order one. Therefore the order one 'n' context will not be split at order two. To illustrate:

order 0 -> [\$aaabnn ]
order 1 -> [a][bnn][\$][aa]
^^ <-- the two 'n' do NOT split for into different child contexts
^^ <-- this are two order one 'n' contexts
order 2 -> [aa] <-- therefore [aa] will NOT split at order two

So, using how the order zero symbols are split into order one lists we can infer the context boundaries for order two. And only information needed to encode the transition from order one two order two is encoding that the order one list [bnn] transitions into the two order two contexts [n][bn]. After this both the encoder and decoder would now have the same order two context information.

order 0 -> [\$aaabnn ]
order 1 -> [a][bnn ][\$][aa]
order 2 -> [a][n][bn][\$][aa] <-- [bnn] has split into [n][bn]

The next step is to infer the order three context boundaries based on how the symbols in the order one context were split into the order two contexts. The only symbol which split was the 'n' which went from [bnn] to [n][bn]. Therefore the order three context boundaries will be:

order 0 -> [\$aaabnn ]
order 1 -> [a][bnn ][\$][aa ]
order 2 -> [a][n][bn][\$][aa ]
^ ^ <-- the two 'n' split into two contexts
^^ <-- this context will then split into two contexts at order 3
order 3 -> [a][n][bn][\$][a][a] <-- order 3 contexts

The encoder encodes nothing as the list [aa] transitioning into [a][a] can be completely inferred by the decoder.

Finally, because [aa] has split at order three into [a][a] we can split the context which those two symbols represent:

order 0 -> [\$aaabnn ] <-- order 0 context
order 1 -> [a][bnn ][\$][aa ] <-- order 1 contexts
order 2 -> [a][n][bn ][\$][aa ] <-- order 2 contexts
order 3 -> [a][n][bn ][\$][a][a] <-- order 3 contexts
^ ^ <-- the two 'a' split into two contexts
^^ <-- this context will then split into two contexts at order 4
order 4 -> [a][n][n][b][\$][a][a]

The encoder encodes nothing as the list [bn] transitioning into [n][b]
All contexts are now unique and the BWT is known to be 'annb\$aa'

I hope this is at least somewhat clear but it is probably still not (^:

- Michael

12. The Following User Says Thank You to michael maniscalco For This Useful Post:

Bulat Ziganshin (3rd May 2016)

13. In fact it is extremly clear because bce3 does exactly the same ^^

Is there source for m03 available somewhere ?

14. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

Alexander Rhatushnyak (4th May 2016)

15. Mh kind of funny...
This somehow reminds me of the many things I succesfully re-invented xD
(probably many of you here already experienced the same)

16. Originally Posted by Christoph Diegelmann
Is there source for m03 available somewhere ?
I wrote my implementation in the weeks preceding the birth of my first child. I figured that I wouldn't have any time to return to the work once she was born. And I was correct (^:

The source code is somewhere on a hard drive in a laptop that doesn't start anymore and given how quickly I wrote it it wouldn't be the best example of my work. But I'll see if I can dig it up. It's probably easier, however, for me to re-write it. And there's a few improvements which I never got to add to the original so maybe a rewrite is the way to go. If I take that path I'll be sure to make the code available on github and/or my own site.

I'll post back here if I can dig up the original code.

- Michael

17. Originally Posted by Steve
Mh kind of funny...
This somehow reminds me of the many things I succesfully re-invented xD
(probably many of you here already experienced the same)
A valid invention just the same!
Certainly our implementations of the idea are different. Plus there are many ideas to add to the original work too.

18. I had some time recently so I modified bce3 to calculate the number of bits needed to encode a given char to be able to see how it assigns probabilities.

I made the data accessible at my domain:
http://www.kta-design.de/releases/enwik8/
http://www.kta-design.de/releases/text7/ // first 10 MB of text8

Currently the colors are calculated like this:
green channel = sqrt(probability);
red channel = 1 - green channel;

I'd love to hear your thoughts about the output

19. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

willvarfar (4th May 2016)

20. "It's starting to look a lot like Christmas."

It seems pretty cool... but be aware that several percent of men are color blind (8 percent of men with Northern European ancestry), most of them specifically red/green color colorblind. So if you use color, the red-green axis is the one to avoid. Maybe somebody with a better understanding of color spaces can recommend a better axis.

21. I googled up a web page about color blindness, and found a cross-section of retina layers, with false color to emphasize rods vs. codes.

I guess their "normal human retina" picture is only for people with normal human retinas.

22. I simply used red/green because of the traffic light feeling "this char is ok this one is not". If someone wants alternate coloring I'll add it

23. Originally Posted by michael maniscalco
it was found at http://www.juergen-abel.info/files/p...bwt_stages.pdf

24. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

Paul W. (4th May 2016)

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•