Results 1 to 11 of 11

Thread: Preprocessing using a low order model

  1. #1
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts

    Preprocessing using a low order model

    I have seen several programs use the strategy of first compressing data using a low order model, followed by a second pass of using higher order models. One advantage seems to be speed: the low order pass can quickly make the data smaller so that there is less data to process in the more expensive pass. Some example programs:

    m1x2 v0.6: http://mattmahoney.net/dc/text.html#1722
    preprocesses the input by pre-compressing it with an order-1 12 bit length limited Huffman code

    http://encode.su/threads/2150-A-new-...ll=1#post42980
    "BCE2 can compress static Huffman encoded files without much loss (results in ~23 MB for enwik8 )
    this speeds up compression and decompression without hurting compression much."

    mcm 0.8: http://mattmahoney.net/dc/text.html#1449
    uses LZP preprocessing

    I want to experiment with doing this type of preprocessing in cmix (http://www.byronknoll.com/cmix.html), after dictionary preprocessing. I am hoping someone here might be able to point me to some standalone programs (or code) that can do this type of preprocessing. I made one attempt at using LZP from here (http://www.cbloom.com/src/index_lz.html) - this made cmix faster but significantly worse compression.
    Last edited by byronknoll; 5th May 2016 at 00:38.

  2. Thanks:

    Christoph Diegelmann (12th May 2016)

  3. #2
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I recently did some fast hacking on word-level huffman coding as first stage for bce2 (with text8
    for simplicity). Sadly this hurt compression quite heavy: the file even got bigger after encoding
    with bce2. I'd guess that it's not able to "learn" the encoding because of the big dictionary size.

    Probably a context mixing program could use a simple order-0 huffman tree internal to speed up
    without losing any compression. Instead of encoding 8 bit of each char it could simply encode the
    way through the huffman tree. Contexts could remain almost the same: Instead of shifting in
    the partially encoded byte one would simply add in the current node in the huffman tree. At char
    boundry the context could just be reverted to the normally used one.

    This would result in the same but less contexts.

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    LZP is actually a high order model when used as a preprocessor. It improves speed by making the input smaller, but in my experience it makes compression worse except for extremely long matches.

  5. #4
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    Matt is right, from my experience always using LZP will usually improve speed but hurt compression. What I do is conditionally use LZP based on the match model match length.

    I.e.
    If match length > ~15 use LZP for the byte.
    Else if match length > 0 use normal models + match model
    Else use normal models with no match model

    This approach improved compression ratio and speed for MCM text (enwik. For binary data LZP is usually good enough with smaller matches, so I always enable it here. Another thing that helps with LZP compression ratio is using mispredicted byte and o0 as SSE context.

  6. Thanks:

    byronknoll (5th May 2016)

  7. #5
    Member
    Join Date
    May 2008
    Location
    brazil
    Posts
    163
    Thanks
    0
    Thanked 3 Times in 3 Posts
    Preprocessing in my opinion is make data dumber (compression is make data smarter). Preprocessing is increase file size or create more better patterns for the compressor.
    Preprocessing depends on compressor type.

  8. #6
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Thanks for the replies. How about the approach of using an order-0 or order-1 static Huffman encoding? Can anyone recommend a program (or code) I can use to try out this preprocessing? Also, does anyone have a link to M1x2 source code? The download links here are broken: http://mattmahoney.net/dc/text.html#1722

  9. #7
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    279
    Thanks
    117
    Thanked 161 Times in 118 Posts
    Quote Originally Posted by byronknoll View Post
    Also, does anyone have a link to M1x2 source code? The download links here are broken: http://mattmahoney.net/dc/text.html#1722
    Look at Christopher Mattern (encode.su "toffer") site in the "M1 - Modeling and optimization in data compression" page https://sites.google.com/site/toffer86/m1-project
    You'll find source code (at least) in the most recent m1x2_0.6_100206.7z
    Last edited by Mauro Vezzosi; 5th May 2016 at 10:43. Reason: Text tuning

  10. Thanks (2):

    byronknoll (5th May 2016),Matt Mahoney (6th May 2016)

  11. #8
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,238
    Thanks
    764
    Thanked 505 Times in 391 Posts
    And what about text preprocessors like DRT - it could be treated as low order model?
    It's very often improves compression and reduces text files to about 60% of original size with means it's speed up compression by 40%. And it's fast.

  12. #9
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    543
    Thanks
    239
    Thanked 94 Times in 74 Posts

  13. #10
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    @Darek/Gonzalo, cmix already does dictionary preprocessing (using the same transformation as paq8hp12). I am looking for a secondary preprocessing pass, after the dictionary encoding. The Huffman precoding used in M1x2 v0.6 looks promising.

  14. #11
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,238
    Thanks
    764
    Thanked 505 Times in 391 Posts
    Quote Originally Posted by byronknoll View Post
    @Darek/Gonzalo, cmix already does dictionary preprocessing (using the same transformation as paq8hp12). I am looking for a secondary preprocessing pass, after the dictionary encoding. The Huffman precoding used in M1x2 v0.6 looks promising.
    at least we tried...

Similar Threads

  1. Optimal Preprocessing/Parsing for LZ?
    By comp1 in forum Data Compression
    Replies: 68
    Last Post: 11th February 2015, 20:27
  2. Replies: 13
    Last Post: 17th May 2014, 07:11
  3. fpaq0f new order 0 model
    By Matt Mahoney in forum Forum Archive
    Replies: 31
    Last Post: 5th February 2008, 23:08
  4. CM model
    By toffer in forum Forum Archive
    Replies: 40
    Last Post: 26th January 2008, 07:59
  5. Model orders
    By Shelwien in forum Forum Archive
    Replies: 50
    Last Post: 22nd December 2007, 22:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •