Hi everybody,
I want to present a (hopefully^^) new compression scheme I've developed.
For the sake of simplicity I only worked on a version which uses a binary alphabet and
circular data (=> no special ending character).
The Idea
Imagine encoding the file with an order-0 model. We basically use a Markov chain with
only a single state:
Encoding an order-1 model uses a 2 state Markov chain(where each state represents a context):
The idea is instead of transmitting the Markov model and the way through it, we only transmit the order-0 model
and information on how to build the order-1 model from it. We can repeat this step until order-n where there's
always only 1 edge between the nodes and then rebuild the data without any additional knowledge.
This means we encode the length of the string and the numbers of zeros in it (so we also know the number of ones).
At each order-x we create all possible strings and create the order-(x+1) model matching this string. Now we count
how many (N) different order-(x+1) models there are and encode the one matching our data using log2(N) bits.
Observations
Of course we cannot create all possible strings and count the different order-(c+1) models in any meaningful time < O(e^n).
But I've made some observations that allow to encode data in this way using O(n) time and O(n) space.
1. Each edge represents a node in the next Markov chain
The number each edge is taken equals the number the state is reached in the next chain.
2. Each chain can be broken down to groups of 4 states (of which some can be duplicates)
The basic chunk of these finite state machines is the following:
Where X is a context of length x-1. We can now compute the number of possible next chains constant time
for each of these chunks. The number edge 0X -> X0 is taken + the number edge 0X -> X0 is taken =
times state X0 is reached. We only need to encode the amount 1 edge is taken of every basic chunk to encode
the next fsm.
If at least 1 state isn't taken at all there is no need to encode any information.
3. Relation to the suffix tree
Each state can be found in the suffix tree. gathering the information needed to encode each chunk can be found
by simply walking the suffix tree using suffix links. The suffix links of 0X and 1X always point to X. From observation 2
we know that we only need to encode states where 0X, 1X, X0 and X1 exists.
If X0 and X1 exists there must be a node in the tree for X. Because there's only a limited number of nodes in the suffix tree O(n)
this limits the number of codings to O(n).
Implementation
These observations leads to a linear time encoding of the data. But it still remains to be shown that it is also possible to do this using O(n)
memory. This took me a while...
When encoding from order-0 to order-n we always encode the order-x before the order-(x+1) chunks. This equals walking the lcp array
from values of 0 to n. But this would still require O(n*log n) memory for the lcp array and constant time range minimum queries.
Instead I basically use an algorithm from Dipl.-Inf. Timo Beller (http://www.uni-ulm.de/in/theo/m/beller.html) which creates the lcp array
in ascending order from the BWT which can be computed in O(n) memory (See A Linear-Time Burrows-Wheeler Transform Using Induced Sorting
Daisuke Okanohara, Kunihiko Sadakane). Timo Beller uses an implementation which uses O(n * log(n)) bits to store the w-Intervals but I can show
that this can be done using O(n) time and O(n) memory (if we don't store the lcp array) using distance coding and
Elias-gamma coding (see tools/Queue.h and BCE.h / Intervals are created in ascending order for each character).
So the framework looks like this for encoding:
- Do the Burrows-Wheeler-Transformation of the input data
- "Build" the LCP array in ascending order and
- For each w-Interval X1 and X0 encode the basic chunk X using a rank index.
Decoding works pretty much the same way:
- Rebuild the rank index using the data from coder.
- Unbwt using the rebuild rank index
The Code
I still have a lot to share with you (e.g. how to rebuild the rank index from the coder) but I guess you wanna see that it works first.
The source code was developed for cmake/gcc in Fedora using c++11.
So here I release to versions of my code:
1. BCE v0.2
This version is exactly what I described above but it uses SAIS-lite from Yuta Mori instead of the direct approach.
(Yeah I was a little bit lazy with that ...) This means it uses an awful lot of memory (a pointer for each bit).
Source: http://kta-design.de/releases/bce2.tar.gz
2. Bce v0.3
This version does a regular BWT on char level then splits them into 8 chunks (always 1 bit per context) and encodes them
in a round robin fashion (hard to explain maybe the code says more see BCE.cpp and the modified tools/SAIS.h).
It uses a lot less memory (about 5 N). It should be able to encode enwik9 on 64 bit machines.
Source: http://kta-design.de/releases/bce3.tar.gz
(The code was only tested on a 64 bit machine and needs some rewriting to work on a 32 bit machine)
The Results
My results on a Core i7-4770K, 8GB DDR3, Samsung 840Pro 128 GB machine look like this:
bce2 -c enwik8: 4:11 min
bce2 -d enwik8.bce: 3:16 min
Compressed size: 22825468 Byte
bce3 -c enwik8: 1:40 min
bce3 -d enwik8.bce: 3:08 min
Compressed size: 22.729.148 Byte
bce3 -c enwik9: 19:11 min
bce3 -d enwik9.bce: 40:44 min
Compressed size: 180.732.702 Byte
About me
I'm a 24 years old hobby programmer from Germany. I study process engineering not Informatics so please inform me
if I used false terms. Maybe someone could help me write a correct paper about this topic ? Feel free to ask any questions
I think there are still a lot of optimizations possible. Especially increasing the alphabet size could help a lot (I will continue working on this).
We also have additional information which hasn't been used yet (e.g. the lcp value).
Thanks for your attention,
Christoph