Results 1 to 29 of 29

Thread: CM4

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts

    CM4

    Here's one of two new context modellers I'm working on. It uses PAQ counters, a match model and SSE in the following way:

    The counters count up to order-10, but after each bit, the counter pointer will only advance to another position within 512 bytes of the original position, rather than a completely new index for every bit. Basically, index = hash + ctx0 + (ctx0 % 79 + ctx0 % 52). Accessing memory this way almost doubles the overall speed. Nibble-based hashing is a cleaner idea, but I found it had more collisions, so it's in the other (much faster) program.

    The match model is a circular buffer of text and a hash table mapping order-8 hashes to indices, so a match can be tested in O(1) time. If there is no match, the probability is taken from the counters (no SSE). Matches are broken when the expected match bit is wrong. Otherwise, SSE sets the probability to sse[6-bit quantized probability][((log2(match length) + bit index) << 1) | expected match bit] and updates itself with *sse += ((cur_bit << SSE_SCALE) - *sse) >> UPDATE_WEIGHT.

    It achieves good compression with enwik, but it's too slow for practical use and needs a lot of memory (although it's just for benchmarking). I have some ideas from ccm and elsewhere in another simpler CM (low order probability + SSE, if matched, another SSE + match), I will post that when it's ready.
    Attached Files Attached Files
    Last edited by cade; 23rd January 2014 at 02:52. Reason: Fixed a term in the description

  2. The Following User Says Thank You to cade For This Useful Post:

    Matt Mahoney (23rd January 2014)

  3. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Very nice. It moves up to 24'th in LTCB, up from 38. http://mattmahoney.net/dc/text.html#1707

  4. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    cade (23rd January 2014)

  5. #3
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Thanks for taking the time to test all of our compressors.

    Here's something small I found interesting with the encoder in PAQ8:
    Code:
    if (mode==DECOMPRESS) y=x<=xmid; else y=i;
    y ? (x2=xmid) : (x1=xmid+1);
    Decoding needs two branches, switching it to this (single difference) made a large difference in decoding speed: 0.41 vs 0.49 seconds in a test case with simple CM.
    Code:
    int bit;
    x <= xmid ? (x2 = xmid, bit = 1) : (x1 = xmid + 1, bit = 0);
    Back to experimenting. The range coder is 21-25% of the total encoding and decoding time for the fast CM4, I wonder if there are faster ways to do it.

  6. #4
    Member
    Join Date
    Jun 2008
    Location
    G
    Posts
    372
    Thanks
    26
    Thanked 22 Times in 15 Posts
    did you try FSE?

  7. #5
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Not yet, do you think it's appropriate for a binary range coder? The overhead for FSE seems a lot higher.

    This is what I have right now (encoder):
    Code:
    uint32 xmid = (x2 - x1) >> PSCALE_BITS;
        xmid *= pr;
        xmid += x1;
    
        bit ? (x2 = xmid) : (x1 = xmid + 1);
    
        while (((x1 ^ x2) & FLUSH_THRESHOLD) == 0)
        {
            io.ByteWrite(x2 >> 24);
            x1 <<= 8;
            x2 = (x2 << 8) | 0xFF;
        }
    The branchless alternative to update x1 and x2 (range's high and low) is this:
    Code:
    const int bitmask = bit - 1;
    x1 = (bitmask & (xmid + 1)) + (~bitmask & x1);
    x2 = (bitmask & x2) + (~bitmask & xmid);
    However, the additional instructions turn out slower than a branch. Besides this, I try to reduce memory look-ups elsewhere. I will post CM4 and the source when I can't improve the speed any further.

    Current results: 22.6 seconds to compress enwik8 to 23,124 KB, 25.7 seconds to decompress.

  8. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by cade View Post
    Decoding needs two branches, switching it to this (single difference) made a large difference in decoding speed: 0.41 vs 0.49 seconds in a test case with simple CM.
    Code:
    int bit;
    x <= xmid ? (x2 = xmid, bit = 1) : (x1 = xmid + 1, bit = 0);
    Thanks. It seems I made the same mistake in ZPAQ. I just uploaded v6.48 with your optimization (in libzpaq :: Decoder). A couple of tests:

    zpaq t enwik8-method-5.zpaq goes from 98 to 96 seconds on a 2.6 GHz M620, 64 bit Ubuntu.
    zpaq t enwik8-method-3.zpaq -threads 1 goes from 24.0 to 23.5 seconds on a 2.0 GHz T3200, Win32.

  9. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    BTW I also tried some variation including one with no branches:
    (Note: x1=low, x2=high, xmid=mid, bit=y).

    Code:
      // one branch, as used in libzpaq v6.48
      int y;
      if (curr<=mid) y=1, high=mid;
      else y=0, low=mid+1;
      ...
      return y;
    
      // one branch
      int y=(curr<=mid) ? (high=mid, 1) : (low=mid+1, 0);
      ...
      return y
    
      // no branches
      int y=(curr>mid)-1;
      high=(mid&y)|(high&~y);
      low=((mid+1)&~y)|(low&y);
      ...
      return y&1;
    On a 2.0 GHz T3200, Win32, g++ 4.8.1 -O3 with and without -march=i686, the one-branch versions run at the same speed and faster than the no branch version. The test was "zpaq t enwik8-method-s7.0c0.0.255.255.255.zpaq -q" which ran in 57.0 seconds with 1 branch or 59.0 with none. The model is a simple indirect order 3 context model in streaming mode with a single block (1 thread) with decompressed output discarded after verifying its SHA-1 hash. I guess that the 1 branch version is optimized using CMOV instructions but I didn't check.

  10. #8
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    It's interesting that most range coders I've seen are identical to PAQ without changing this. It's also interesting that MSVC is equally fast with or without another branch, I guess that it identifies the second branch as completely dependent on the first.

    Another small idea: I disable ASSERT in release mode, such as:
    Code:
    ASSERT(xmid >= x1 && xmid < x2);
    ASSERT(x1 <= x2);
    In any case, CM4 (not _EXT) doesn't do much per bit, which is why a large portion of its time is spent in the range coder. My next idea is to try 4 separate ranges in 128-bit SIMD registers and handle encoding 4 bits at a time, but I don't know if this will be a good idea. I've attached a test version. Decompression is still slower than compression for me, something I should investigate further.
    Attached Files Attached Files

  11. The Following User Says Thank You to cade For This Useful Post:

    Nania Francesco (25th January 2014)

  12. #9
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    CM4_EXT Jan 21, 2013 crash:

    terminate called after throwing an instance of 'std::bad_alloc'
    what(): std::bad_alloc

    This application has requested the Runtime to terminate it in an unusual way.
    Please contact the application's support team for more information.

    Jan 23, 2014 version work:

    601.512.592 bytes, IIS log:

    33.821.396 bytes, 17,055 sec. - 11,781 sec., WinZpaq - 3
    15.343.602 bytes, 67.876 sec. - 71.265 sec., cm4_2014_1_23
    15.043.561 bytes, 136,336 sec. - 118,527 sec., WinZpaq - 4

    967.021.395 bytes, html text:

    151.175.608 bytes, 59,966 sec. - 40,657 sec., WinZpaq - 3
    131.396.698 bytes, 245,108 sec. - 183,223 sec., WinZpaq - 4
    119.567.919 bytes, 125.584 sec. - 136.212 sec., cm4_2014_1_23

    574.219.767 bytes, testfile3:

    469.856.513 bytes, 143,413 sec. - 94,508 sec., WinZpaq - 3
    469.839.698 bytes, 282,575 sec. - 98,822 sec., WinZpaq - 4
    466.936.195 bytes, 156.434 sec. - 164.781 sec., cm4_2014_1_23

    1.000.000.000 bytes, enwik9:

    270.598.308 bytes, 105,064 sec. - 65,613 sec., WinZpaq - 3
    205.724.906 bytes, 152.455 sec. - 168.093 sec., cm4_2014_1_23
    189.859.859 bytes, 321,237 sec. - 221,361 sec., WinZpaq - 4

    All RAM-disk
    All 1 thread

    cm4 Jan 23, 2014
    zpaq64 v6.48 Jan 23 2014

  13. The Following 3 Users Say Thank You to Sportman For This Useful Post:

    cade (25th January 2014),Matt Mahoney (27th January 2014),samsat1024 (25th January 2014)

  14. #10
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    CM4_EXT requires a constant 1856 MB for its model, plus some overhead (IO buffers) to ~1900 MB. If that much isn't available in main memory (without swapping), it can't run.

    I'm surprised that there are faster scripts for zpaq, do you have any further information about what 3 and 4 represent?

  15. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    Quote Originally Posted by cade View Post
    CM4_EXT requires a constant 1856 MB for its model, plus some overhead (IO buffers) to ~1900 MB. If that much isn't available in main memory (without swapping), it can't run.

    do you have any further information about what 3 and 4 represent?
    There is 15GB free memory available, CM4_EXT crash happen directly after start when 331,660kb memory is allocated.

    I used zpaq default -method 3 and -method 4.

  16. #12
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Some updates:

    CM4_EXT:
    Compiled as 64-bit. Note that it won't work with files above 4 GB. I don't plan to continue this right now, it's just more and more piled up on normal CM4.

    CM4:
    Compiled as 64-bit. It supports files above 4 GB. Improvements in compression and decompression speeds.
    Attached Files Attached Files

  17. #13
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    No _ext crash anymore and quicker:

    1,000,000,000 bytes, enwik9:

    205,724,908 bytes, 134.712 sec. - 151.064 sec., cm4_2014_1_25
    169,967,366 bytes, 1635.491 sec. - 1668.092 sec., cm4_ext_2014_1_25

  18. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    zpaq -method 3 uses byte-aligned LZ77 followed by a context model of parse state + order 1 for literals.
    -method 4 usually uses BWT modeled with order 0-1 ICM-ISSE chain, but sometimes LZ77 like 3 based on (not very good) heuristics.
    Both methods include a test for compressibility and may store uncompressed or a faster LZ77 variant. They also use E8E9 filter if x86 content is detected, deduplication, and multi-threading in 64 MB blocks compressed independently.

    I previously tested cm4_ext in Linux under Wine but I got the same out of memory error with 4 GB available. I will see if the 64 bit compile fixes the problem.

  19. #15
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Interesting. I only do single-threading right now, I wonder if that's the major difference in speed (probably is).

    The overhead of loading 4 probabilities and bits makes a SIMD binary range coder negligibly faster than the sequential version. It has a larger impact on arithmetic coding. Back to experimenting.

  20. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Quote Originally Posted by cade View Post
    Some updates:

    CM4_EXT:
    Compiled as 64-bit. Note that it won't work with files above 4 GB. I don't plan to continue this right now, it's just more and more piled up on normal CM4.

    CM4:
    Compiled as 64-bit. It supports files above 4 GB. Improvements in compression and decompression speeds.
    I get the following error running in Wine in 64 bit Ubuntu:
    Code:
    matt@matt-Latitude-E6510:~/Downloads$ ./CM4_2014_1_25.exe
    wine: Call from 0x7f192a8a0c67 to unimplemented function KERNEL32.dll.LCMapStringEx, aborting
    wine: Unimplemented function KERNEL32.dll.LCMapStringEx called at address 0x7f192a8a0c67 (thread 0022), starting debugger...
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram Unhandled Tag type 0x26 at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x468640,symt:0x52025c)
    fixme:dbghelp_dwarf:compute_location Only supporting one breg (rbx/329 -> r8/336)
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram Unhandled Tag type 0x26 at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x197a5e0,symt:0x19f9544)
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram_block Unhandled Tag type 0xf at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x19ba7d8,symt:(nil))
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram_block Unhandled Tag type 0xf at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x19ba7d8,symt:(nil))
    matt@matt-Latitude-E6510:~/Downloads$ ./CM4_EXT_2014_1_25.exe
    wine: Call from 0x7fbc54854c67 to unimplemented function KERNEL32.dll.LCMapStringEx, aborting
    wine: Unimplemented function KERNEL32.dll.LCMapStringEx called at address 0x7fbc54854c67 (thread 0024), starting debugger...
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram Unhandled Tag type 0x26 at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x489640,symt:0x53122c)
    fixme:dbghelp_dwarf:compute_location Only supporting one breg (rbx/329 -> r8/336)
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram Unhandled Tag type 0x26 at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x199b5e0,symt:0x1a1a544)
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram_block Unhandled Tag type 0xf at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x19db7d8,symt:(nil))
    fixme:dbghelp_dwarf:dwarf2_parse_subprogram_block Unhandled Tag type 0xf at ctx(0x23a960,L"libc.so.6"), for debug_info(abbrev:0x19db7d8,symt:(nil))

  21. #17
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    I use MSVC 2012 to compile the 64-bit versions. I don't know if it will work with Wine, but according to this, it was fixed:
    http://bugs.winehq.org/show_bug.cgi?id=29730

  22. #18
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    I guess the fix for Ubuntu isn't there. I tried "sudo apt-get install wine" to make sure I had the latest version, but it didn't help.

    To make 64 bit .exe I use g++ 4.8.1 rev 5 -m64 in Win32 from http://sourceforge.net/projects/ming...ds-win32/sjlj/
    The g++ from http://www.mingw.org/ doesn't support -m64 in Win32.

    BTW I didn't know if I should choose threads-win32 or thread-posix or choose dwarf or sjlj, so I guessed.

    I also have MSVC cl Version 16.00.30319.01 for 80x86 but it won't cross compile 64 bit executables.

  23. #19
    Member
    Join Date
    Aug 2013
    Location
    Greece
    Posts
    55
    Thanks
    37
    Thanked 21 Times in 15 Posts
    To compile 64-bit windows exe using mingw from the Ubuntu repositories, you should install package "g++-mingw-w64-x86-64" from the universe repos (I am using Ubuntu 64-bit 13.10).

    example for zpaq64.exe:
    x86_64-w64-mingw32-g++ -O3 -s -m64 -static -DNDEBUG zpaq.cpp libzpaq.cpp divsufsort.c -fopenmp -o zpaq64.exe

    It works fine under wine 1.7.11.

    EDIT: ...but it is an older version (4.6.3)
    Last edited by msmaniac; 27th January 2014 at 20:36.

  24. #20
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    I couldn't get 64-bit to build on Ubuntu, so here's a new 64-bit version for Windows and a 32-bit version for Ubuntu. Building with MSVC gives speeds 10-15% faster than with g++. In this version, I improved the match model (better compression, same speed).
    Attached Files Attached Files

  25. #21
    Member ivan2k2's Avatar
    Join Date
    Nov 2012
    Location
    Russia
    Posts
    35
    Thanks
    13
    Thanked 6 Times in 3 Posts
    CM4_2014_1_27.exe tested on Silesia Corpus files.

    Code:
    dickens      10192446
    dickens.cm4   2467453
    
    mozilla      51220480
    mozilla.cm4  15805586
    
    mr            9970564
    mr.cm4        2285812
    
    nci          33553445
    nci.cm4       1689588
    
    ooffice       6152192
    ooffice.cm4   2623290
    
    osdb         10085684
    osdb.cm4      2589065
    
    reymont       6627202
    reymont.cm4   1182180
    
    samba        21606400
    samba.cm4     3592239
    
    sao           7251944
    sao.cm4       4851965
    
    webster      41458703
    webster.cm4   7075569
    
    x-ray         8474240
    x-ray.cm4     3931779
    
    xml           5345280
    xml.cm4        423068
    Decompression is ok.
    Last edited by ivan2k2; 27th January 2014 at 21:58.

  26. #22
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    1,000,000,000 bytes, enwik9:

    204,056,278 bytes, 134.579 sec. - 156.662 sec., cm4_2014_1_27

  27. #23
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    More information and full source posted:
    http://nauful.com/pages/cm4.php

  28. The Following 4 Users Say Thank You to cade For This Useful Post:

    Bulat Ziganshin (1st February 2014),Mat Chartier (31st January 2014),Mike (1st February 2014),samsat1024 (30th January 2014)

  29. #24
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    Good new compressor! Though the sources are strangely similar to a certain a CCM compressor.

  30. #25
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Thanks! I wrote in the first post that I had some new ideas from ccm and other CMs, they are different ideas: counter model is PAQ (similar generator was posted by Matt Mahoney somewhere before), match model is different, SSE is completely unrelated (only similarity is the mixing of o2 + o4), etc.

    The only "new" idea I found from ccm was not doing a hash look-up every bit, and even then, the look-up table approach (edit: for advancing) was not so interesting to me. In fact, the most interesting idea I found (new for me, and only for me ) was this:
    http://encode.su/threads/1215-SSE%28o2-o4%29-CM-coder
    Last edited by cade; 3rd February 2014 at 19:39.

  31. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    > The only "new" idea I found from ccm was not doing a hash look-up every bit

    PAQ does 3 hash lookups per byte. ZPAQ does 2. There is a space-time tradeoff. A PAQ context hash points to 8 bytes (7 counters and a checksum). A ZPAQ ICM or ISSE context hash points to 16 bytes.

  32. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    cade (3rd February 2014)

  33. #27
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    To be specific, I was referring to doing just one look-up for offset zero, then advancing with nibbles to process the entire byte with o0. It's probably not a new idea for others (it's in encode and decode in sr2 and newer), just something I hadn't seen before.

  34. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    OK, I think you mentioned that. After computing the byte hash, you can do a quick computation to get the second nibble hash index. I did something like this in ZPAQ. The first hash is computed in ZPAQL. The second hash is computed by adding (16 + first nibble) x 16 to the first hash. Both result in one cache miss and a search of 3 buckets to find the matching check byte or lowest priority bucket to replace (at most 5 branch instructions). To stay within a 64 byte cache line boundary, it computes hash indexes of H, H xor 1, H xor 2 for 16 byte elements.

  35. #29
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    131
    Thanks
    31
    Thanked 29 Times in 19 Posts
    Here are enough small updates for a new version: smaller model, new mixer, faster match model, replaced the silly state generator with a decent one, and more cache-friendly structures (for example, advancing is next[bit][cur] instead of state[cur].next[bit]).

    Ratio is worse because I use a lot less memory (~15 MB now), but speed is faster with the updates (6.5 MB/sec compressing enwik8 ).

    enwik8:
    25,117 KB in 15.4 seconds, decompresses in 16.7 seconds. Not so great for speed or ratio, just an experiment.
    Attached Files Attached Files

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •