Results 1 to 19 of 19

Thread: A new algorithm for compression using order-n models

  1. #1
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts

    A new algorithm for compression using order-n models

    Hi everybody ,

    I want to present a (hopefully^^) new compression scheme I've developed.
    For the sake of simplicity I only worked on a version which uses a binary alphabet and
    circular data (=> no special ending character).

    The Idea
    Imagine encoding the file with an order-0 model. We basically use a Markov chain with
    only a single state:
    Click image for larger version. 

Name:	bce0.png 
Views:	157 
Size:	2.6 KB 
ID:	3420
    Encoding an order-1 model uses a 2 state Markov chain(where each state represents a context):
    Click image for larger version. 

Name:	bce1.png 
Views:	209 
Size:	5.2 KB 
ID:	3421
    The idea is instead of transmitting the Markov model and the way through it, we only transmit the order-0 model
    and information on how to build the order-1 model from it. We can repeat this step until order-n where there's
    always only 1 edge between the nodes and then rebuild the data without any additional knowledge.

    This means we encode the length of the string and the numbers of zeros in it (so we also know the number of ones).
    At each order-x we create all possible strings and create the order-(x+1) model matching this string. Now we count
    how many (N) different order-(x+1) models there are and encode the one matching our data using log2(N) bits.

    Observations
    Of course we cannot create all possible strings and count the different order-(c+1) models in any meaningful time < O(e^n).
    But I've made some observations that allow to encode data in this way using O(n) time and O(n) space.

    1. Each edge represents a node in the next Markov chain
    The number each edge is taken equals the number the state is reached in the next chain.

    2. Each chain can be broken down to groups of 4 states (of which some can be duplicates)
    The basic chunk of these finite state machines is the following:

    Click image for larger version. 

Name:	bce2.png 
Views:	193 
Size:	8.6 KB 
ID:	3422
    Where X is a context of length x-1. We can now compute the number of possible next chains constant time
    for each of these chunks. The number edge 0X -> X0 is taken + the number edge 0X -> X0 is taken =
    times state X0 is reached. We only need to encode the amount 1 edge is taken of every basic chunk to encode
    the next fsm.

    If at least 1 state isn't taken at all there is no need to encode any information.

    3. Relation to the suffix tree
    Each state can be found in the suffix tree. gathering the information needed to encode each chunk can be found
    by simply walking the suffix tree using suffix links. The suffix links of 0X and 1X always point to X. From observation 2
    we know that we only need to encode states where 0X, 1X, X0 and X1 exists.

    If X0 and X1 exists there must be a node in the tree for X. Because there's only a limited number of nodes in the suffix tree O(n)
    this limits the number of codings to O(n).

    Implementation
    These observations leads to a linear time encoding of the data. But it still remains to be shown that it is also possible to do this using O(n)
    memory. This took me a while ...

    When encoding from order-0 to order-n we always encode the order-x before the order-(x+1) chunks. This equals walking the lcp array
    from values of 0 to n. But this would still require O(n*log n) memory for the lcp array and constant time range minimum queries.

    Instead I basically use an algorithm from Dipl.-Inf. Timo Beller (http://www.uni-ulm.de/in/theo/m/beller.html) which creates the lcp array
    in ascending order from the BWT which can be computed in O(n) memory (See A Linear-Time Burrows-Wheeler Transform Using Induced Sorting
    Daisuke Okanohara, Kunihiko Sadakane). Timo Beller uses an implementation which uses O(n * log(n)) bits to store the w-Intervals but I can show
    that this can be done using O(n) time and O(n) memory (if we don't store the lcp array) using distance coding and
    Elias-gamma coding (see tools/Queue.h and BCE.h / Intervals are created in ascending order for each character).

    So the framework looks like this for encoding:
    1. Do the Burrows-Wheeler-Transformation of the input data
    2. "Build" the LCP array in ascending order and
    3. For each w-Interval X1 and X0 encode the basic chunk X using a rank index.


    Decoding works pretty much the same way:
    1. Rebuild the rank index using the data from coder.
    2. Unbwt using the rebuild rank index


    The Code
    I still have a lot to share with you (e.g. how to rebuild the rank index from the coder) but I guess you wanna see that it works first.
    The source code was developed for cmake/gcc in Fedora using c++11.
    So here I release to versions of my code:

    1. BCE v0.2
    This version is exactly what I described above but it uses SAIS-lite from Yuta Mori instead of the direct approach.
    (Yeah I was a little bit lazy with that ...) This means it uses an awful lot of memory (a pointer for each bit).

    Source: http://kta-design.de/releases/bce2.tar.gz

    2. Bce v0.3
    This version does a regular BWT on char level then splits them into 8 chunks (always 1 bit per context) and encodes them
    in a round robin fashion (hard to explain maybe the code says more see BCE.cpp and the modified tools/SAIS.h).
    It uses a lot less memory (about 5 N). It should be able to encode enwik9 on 64 bit machines.

    Source: http://kta-design.de/releases/bce3.tar.gz

    (The code was only tested on a 64 bit machine and needs some rewriting to work on a 32 bit machine)

    The Results
    My results on a Core i7-4770K, 8GB DDR3, Samsung 840Pro 128 GB machine look like this:

    bce2 -c enwik8: 4:11 min
    bce2 -d enwik8.bce: 3:16 min
    Compressed size: 22825468 Byte

    bce3 -c enwik8: 1:40 min
    bce3 -d enwik8.bce: 3:08 min
    Compressed size: 22.729.148 Byte

    bce3 -c enwik9: 19:11 min
    bce3 -d enwik9.bce: 40:44 min
    Compressed size: 180.732.702 Byte

    About me
    I'm a 24 years old hobby programmer from Germany. I study process engineering not Informatics so please inform me
    if I used false terms. Maybe someone could help me write a correct paper about this topic ? Feel free to ask any questions

    I think there are still a lot of optimizations possible. Especially increasing the alphabet size could help a lot (I will continue working on this).
    We also have additional information which hasn't been used yet (e.g. the lcp value).

    Thanks for your attention,
    Christoph
    Last edited by Christoph Diegelmann; 17th March 2015 at 00:17. Reason: enwik9 time (optimized)

  2. The Following 8 Users Say Thank You to Christoph Diegelmann For This Useful Post:

    avitar (19th March 2015),Bulat Ziganshin (16th March 2015),Cyan (17th March 2015),encode (16th March 2015),Gonzalo (16th March 2015),Mike (17th March 2015),Nania Francesco (17th March 2015),willvarfar (3rd May 2016)

  3. #2
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Welcome Christoph Diegelmann!

    your post sounds very interesting.

    Can you please provide a working binary?
    This would make it easier to test your program/algorithm.

    May be for 64bit-redhat-linux 6 or for another 64bit system?

    best regards
    Joerg

  4. #3
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Hello Joerg,

    you can find binaries which I compiled under Fedora 21 64 bit using gcc 4.9.2 here:
    http://kta-design.de/releases/bce2
    http://kta-design.de/releases/bce3

    Before I forget about it: BCE2 can compress static Huffman encoded files without much loss (results in ~23 MB for enwik8 )
    this speeds up compression and decompression without hurting compression much.

    Christoph
    Last edited by Christoph Diegelmann; 16th March 2015 at 21:00.

  5. #4
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Thank you very much for your quick answer.

    But this binaries for fedora 21 seems not to work under Red Hat Enterprise Linux 6 ... ?

    Should it work?

    best regards

  6. #5
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Sadly I don't have access to RHEL 6 so I can't try. But you should be able to build it using

    cmake CMakeLists.txt
    make

    if you got cmake, make and gcc.

    I just noticed the CMakeLists.txt contain "SET(CMAKE_BUILD_TYPE "-Release")" which should be
    "SET(CMAKE_BUILD_TYPE "Release")" this means the times I posted were for the unoptimized code.
    I will update the timings soon.

    Update 20:21: Updated the source packages and the binaries (compiled for "-march=x86-64" hopefully they work ? :/ )
    Update 21:42: Forgot to remove some test code in bce2 which lead to false decode. Fixed it and reuploaded binaries and source
    Last edited by Christoph Diegelmann; 17th March 2015 at 14:20.

  7. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I tested bce3 on LTCB and Silesia. On LTCB I couldn't test enwik9 with only 4 GB memory but enwik8 verified OK. On Silesia corpus, dickens, osdb, and sao decompressed output did not match the original file. I didn't test bce2.

    http://mattmahoney.net/dc/text.html#1807 (ranked 41 of 189).
    http://mattmahoney.net/dc/silesia.html (not updated).

    It is an interesting algorithm which I haven't seen before. The Markov graph representation reminds me of DMC, but the idea of using a BWT to construct and transmit it rather than having the decoder reconstruct it is novel as far as I can tell.

  8. #7
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Thanks for testing !
    By now I only tested on enwik8, enwik9 and some small files for debugging which decompressed correctly.
    I will try to fix it and reupload a new version (I'm also working on a windows port).

    On my latest tests I also noticed that using 0X/(0X+1X) and X0/(X0+X1) as context when encoding a chunk I
    can reduce the size to ~21,2 MB for enwik8 and 168 MB for enwik9. I used a simple adaptive range coder
    for contexts where X0+X1 < 32.
    Could you give me an idea how I could implement such an adaptive coding stage for contexts of unlimitied size ?
    Currently I'm simply using an array "uint32_t stats_[32][32][32][32];" where the contexts are X0+X1, X0 and 0X
    (the last array is for the symbol coded).

    Update: Fixed the bug in bce3 leading to false decode of some files (At least Silesia compresses and decompresses
    correctly on my machien with this version). Reuploaded the build and the source.
    Last edited by Christoph Diegelmann; 28th August 2015 at 20:11.

  9. #8
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    9
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    Hi everybody ,

    The Idea
    Imagine encoding the file with an order-0 model. We basically use a Markov chain with
    only a single state:
    Click image for larger version. 

Name:	bce0.png 
Views:	157 
Size:	2.6 KB 
ID:	3420
    Encoding an order-1 model uses a 2 state Markov chain(where each state represents a context):
    Click image for larger version. 

Name:	bce1.png 
Views:	209 
Size:	5.2 KB 
ID:	3421
    The idea is instead of transmitting the Markov model and the way through it, we only transmit the order-0 model
    and information on how to build the order-1 model from it. We can repeat this step until order-n where there's
    always only 1 edge between the nodes and then rebuild the data without any additional knowledge.

    This means we encode the length of the string and the numbers of zeros in it (so we also know the number of ones).
    At each order-x we create all possible strings and create the order-(x+1) model matching this string. Now we count
    how many (N) different order-(x+1) models there are and encode the one matching our data using log2(N) bits.

    Christoph
    I haven't got enough time to follow along to deeply at the moment but this sounds very similar to what M03 does to encode the BWT. Encode the order 0 model and then the information needed to define the order 1 model from the order 0 model. Continue doing so for each order until the model contains all unique states.

    - Michael

  10. #9
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Hey Michael,

    yeah that sounds quite similar can you give me additional information about M03 ? I guess your working with
    a character model rather than a bit model ?

    Christoph

  11. #10
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    9
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    Hey Michael,

    yeah that sounds quite similar can you give me additional information about M03 ? I guess your working with
    a character model rather than a bit model ?

    Christoph

    Yes, my implementation is byte based rather than a bit model. Also, I don't bother with computing the LCP array. Rather, LCP is a byproduct of the modelling process. Juergen Abel did a write up on M03 that covers some of the concepts involved. That paper is at:

    http://www.juergen-abel.info/Preprin...BWT_Stages.pdf

    Basically, once you have the BWT you have the distribution of the symbols which precede the order zero contexts. From this you can infer the order one context boundaries and encode how distribution of symbols preceding the order zero contexts can be distributed to the order one contexts. Repeat until each context is unique (or all symbols preceding any unique context are all the same).

    example (this is off the top of my head so I could make a mistake here).

    input string is banana$ ($ == -1)
    BWT is: annb$aa

    M03 works by encoding the distribution of symbols which precede a unique context as that context is divided into the context of the next highest order. To make the example clear, first a few definitions etc. A 'list' of symbols will be written inside of square brackets. In 'banana$' the context 'an' appears twice. The symbols which precede these contexts are 'b' and 'n' and would therefore the list of symbols which precede the context 'an' would be [bn]. The symbols in the list are listed in lexicographical order.

    The symbol list which precedes the order zero context for any given string is basically trivial to produce. The encoder encodes the count of each unique symbol type in the input (or the BWT, same thing) and the decoder decodes these counts to produce the same list.

    For annb$aa the encoder encodes 1$,3a,1b,2n. The list of symbols which precede the order zero context is [$aaabnn].

    From this, the order one context boundaries can be inferred. The single order zero context [$aaabnn] will be 'split' into four symbols lists representing the symbols which precede each of the four unique order one contexts '$', 'a', 'b' and 'n'.
    This means that [$aaabnn] will be split into four lists [a][bnn][$][aa]. The encoder encodes how [$aaabnn] is distributed into [a][bnn][$][aa] and is implementation specific. The decoder decodes that information and uses it to create the same list [a][bnn][$][aa].

    What does [a][bnn][$][aa] indicate? It indicates that there are four unique order one contexts. (The model doesn't really need to know what those contexts are specifically to achieve compression. But it can be used to increase compressed in more advanced implementations of the algorithm.) There are four unique order one contexts in 'banana$' and they are '$', 'a', 'b' and 'n'. '$' appears once and is preceded by 'a'. 'a' appears three times and is preceded by 'b' once and 'n' twice. 'b' appears once and is preceded by '$'. and 'n' appears twice and is preceded by 'a' both times.

    For clarity (I hope) we can display the order 0 and order 1 contexts as:

    order 0 -> [$aaabnnn ] <-- one order 0 'parent' context
    order 1 -> [a][bnn][$][aa] <-- four order 1 'child' contexts


    This information allows us to infer the order 2 contexts. Take for example the three 'a' symbols which are in the order zero context. When they are in the order zero context all three are in the same list [$aaabnn] but they are 'split' into two of the lists at order 1 contexts: [a][...][.][aa]. Therefore, the order one 'a' context (which appears three times) will split at order two into two contexts of length 1 and 2 at order two. Considering only what this split of 'a' mean for the moment:

    order 0 -> [$aaabnn ]
    order 1 -> [a][bnn ][$][aa]
    ^ ^^ <-- the three 'a' split for into two child contexts
    ^^^ <-- these are the three order one 'a' contexts
    order 2 -> [n][bn] <-- therefore [bnn] will split at order two into [n][bn]


    Notice that the two 'n' in the order zero context DO NOT split into separate lists at order one. Therefore the order one 'n' context will not be split at order two. To illustrate:

    order 0 -> [$aaabnn ]
    order 1 -> [a][bnn][$][aa]
    ^^ <-- the two 'n' do NOT split for into different child contexts
    ^^ <-- this are two order one 'n' contexts
    order 2 -> [aa] <-- therefore [aa] will NOT split at order two


    So, using how the order zero symbols are split into order one lists we can infer the context boundaries for order two. And only information needed to encode the transition from order one two order two is encoding that the order one list [bnn] transitions into the two order two contexts [n][bn]. After this both the encoder and decoder would now have the same order two context information.

    order 0 -> [$aaabnn ]
    order 1 -> [a][bnn ][$][aa]
    order 2 -> [a][n][bn][$][aa] <-- [bnn] has split into [n][bn]


    The next step is to infer the order three context boundaries based on how the symbols in the order one context were split into the order two contexts. The only symbol which split was the 'n' which went from [bnn] to [n][bn]. Therefore the order three context boundaries will be:

    order 0 -> [$aaabnn ]
    order 1 -> [a][bnn ][$][aa ]
    order 2 -> [a][n][bn][$][aa ]
    ^ ^ <-- the two 'n' split into two contexts
    ^^ <-- this context will then split into two contexts at order 3
    order 3 -> [a][n][bn][$][a][a] <-- order 3 contexts


    The encoder encodes nothing as the list [aa] transitioning into [a][a] can be completely inferred by the decoder.

    Finally, because [aa] has split at order three into [a][a] we can split the context which those two symbols represent:

    order 0 -> [$aaabnn ] <-- order 0 context
    order 1 -> [a][bnn ][$][aa ] <-- order 1 contexts
    order 2 -> [a][n][bn ][$][aa ] <-- order 2 contexts
    order 3 -> [a][n][bn ][$][a][a] <-- order 3 contexts
    ^ ^ <-- the two 'a' split into two contexts
    ^^ <-- this context will then split into two contexts at order 4
    order 4 -> [a][n][n][b][$][a][a]


    The encoder encodes nothing as the list [bn] transitioning into [n][b]
    All contexts are now unique and the BWT is known to be 'annb$aa'

    I hope this is at least somewhat clear but it is probably still not (^:

    - Michael

  12. The Following User Says Thank You to michael maniscalco For This Useful Post:

    Bulat Ziganshin (3rd May 2016)

  13. #11
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    In fact it is extremly clear because bce3 does exactly the same ^^

    Is there source for m03 available somewhere ?
    Last edited by Christoph Diegelmann; 19th March 2015 at 23:27.

  14. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

    Alexander Rhatushnyak (4th May 2016)

  15. #12
    Member
    Join Date
    Dec 2011
    Location
    Germany / Hessen
    Posts
    18
    Thanks
    0
    Thanked 1 Time in 1 Post
    Mh kind of funny...
    This somehow reminds me of the many things I succesfully re-invented xD
    (probably many of you here already experienced the same)

  16. #13
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    9
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    Is there source for m03 available somewhere ?
    I wrote my implementation in the weeks preceding the birth of my first child. I figured that I wouldn't have any time to return to the work once she was born. And I was correct (^:

    The source code is somewhere on a hard drive in a laptop that doesn't start anymore and given how quickly I wrote it it wouldn't be the best example of my work. But I'll see if I can dig it up. It's probably easier, however, for me to re-write it. And there's a few improvements which I never got to add to the original so maybe a rewrite is the way to go. If I take that path I'll be sure to make the code available on github and/or my own site.

    I'll post back here if I can dig up the original code.

    - Michael

  17. #14
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    109
    Thanks
    9
    Thanked 80 Times in 25 Posts
    Quote Originally Posted by Steve View Post
    Mh kind of funny...
    This somehow reminds me of the many things I succesfully re-invented xD
    (probably many of you here already experienced the same)
    A valid invention just the same!
    Certainly our implementations of the idea are different. Plus there are many ideas to add to the original work too.

  18. #15
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I had some time recently so I modified bce3 to calculate the number of bits needed to encode a given char to be able to see how it assigns probabilities.

    I made the data accessible at my domain:
    http://www.kta-design.de/releases/enwik8/
    http://www.kta-design.de/releases/text7/ // first 10 MB of text8

    Currently the colors are calculated like this:
    green channel = sqrt(probability);
    red channel = 1 - green channel;

    I'd love to hear your thoughts about the output

  19. The Following User Says Thank You to Christoph Diegelmann For This Useful Post:

    willvarfar (4th May 2016)

  20. #16
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    "It's starting to look a lot like Christmas."

    It seems pretty cool... but be aware that several percent of men are color blind (8 percent of men with Northern European ancestry), most of them specifically red/green color colorblind. So if you use color, the red-green axis is the one to avoid. Maybe somebody with a better understanding of color spaces can recommend a better axis.

  21. #17
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I googled up a web page about color blindness, and found a cross-section of retina layers, with false color to emphasize rods vs. codes.

    Click image for larger version. 

Name:	Normal_Human_Retina.jpg 
Views:	149 
Size:	42.5 KB 
ID:	4365

    I guess their "normal human retina" picture is only for people with normal human retinas.

    https://nei.nih.gov/health/color_blindness/facts_about

  22. #18
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I simply used red/green because of the traffic light feeling "this char is ok this one is not". If someone wants alternate coloring I'll add it

  23. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    Quote Originally Posted by michael maniscalco View Post
    it was found at http://www.juergen-abel.info/files/p...bwt_stages.pdf

  24. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Paul W. (4th May 2016)

Similar Threads

  1. Anyone know which compression algorithm does this?
    By hjazz in forum Data Compression
    Replies: 8
    Last Post: 24th March 2014, 05:49
  2. Help identify compression algorithm?
    By DotDotDot in forum Data Compression
    Replies: 0
    Last Post: 1st June 2013, 09:15
  3. Natural Language Processing, PPMd, PPMZ and frozen models
    By patentsguy in forum Data Compression
    Replies: 7
    Last Post: 6th June 2012, 21:01
  4. Hierarchy compression algorithm and more
    By teddybot in forum The Off-Topic Lounge
    Replies: 7
    Last Post: 3rd May 2012, 01:16
  5. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 01:32

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •