Results 1 to 24 of 24

Thread: Unrolling arithmetic coding.

  1. #1
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts

    Unrolling arithmetic coding.

    I've been exploring the idea of optimising byte oriented arithmetic coding for a specific task. My data is largely stationary, although with correlations. This means I don't particularly need an adaptive model, so instead I can code using static probabilities with blocks. I'm interested in anyone elses explorations into optimising this sort of code. What I have right now doesn't match things like FSE of course, but it's not THAT far off (within half the speed, for an O(1) entropy encoder).

    I started with one of Eugene Shelwien's range coders from coders6c2 zip. I note the main encoder and decoder functions contain:

    Code:
    void Encode (uint cumFreq, uint freq, uint totFreq)
     {
       low  += cumFreq * (range/= totFreq);
       range*= freq;
       ...
     }
    
     uint GetFreq (uint totFreq) {
         return code/(range/=totFreq);
     }
    In here we have a number of divisions, which drastically slow things down. However for a static coder we can cheat by normalising our input frequencies so that they sum to a totFreq value that's a binary of 2. Ie the code becomes:

    Code:
    void Encode (uint cumFreq, uint freq)
     {
       low  += cumFreq * (range >>= TOTFREQ_BITS);
       range*= freq;
       ...
     }
    
     uint GetFreq (void) {
         return code/(range >>= TOTFREQ_BITS);
     }
    This has a remarkable improvement in encoding speed, while decoding still suffers a bit due to the remaining division. I explored a few methods of replacing this with multiplications and bitshifting, but found by themselves they all have minimal benefits, and sometimes harm things.

    However I then explored manual unrolling by using multiple concurrent range coders and this is where the real win comes from. For a basic order(0) codec with precomputed frequencies my encoding and decoding loops started like this. (Note this is a C variant instead of the C++ one above.)

    Code:
        // Encoder tight loop
        RC_output(&rc, (char *)cp);
        RC_StartEncode(&rc);
        for (i = 0; i < in_size; i++) {
            unsigned char c = in[i];
            RC_Encode(&rc, C[c], F[c]);
        }
        RC_FinishEncode(&rc);
    
        //...
    
        // Decoder tight loop
        RC_input(&rc, (char *)cp);
        RC_StartDecode(&rc);
        for (i = 0; i < out_sz; i++) {
            uint32_t freq = RC_GetFreq(&rc);
            c = D.R[freq];
            RC_Decode(&rc, D.fc[c].C, D.fc[c].F);
            out_buf[i] = c;
        }
        RC_FinishDecode(&rc);
    With some manual loop unrolling we can get this:

    Code:
        // Tight encoding loop
        i_end = in_size&~3;
        for (i = 0; i < i_end; i+=4) {
            unsigned char c[4];
            c[0] = in[i+0];
            c[1] = in[i+1];
            c[2] = in[i+2];
            c[3] = in[i+3];
    
            RC_Encode(&rc[0], C[c[0]], F[c[0]]);
            RC_Encode(&rc[1], C[c[1]], F[c[1]]);
            RC_Encode(&rc[2], C[c[2]], F[c[2]]);
            RC_Encode(&rc[3], C[c[3]], F[c[3]]);
        }
    
        // Tidy up any remainder that isn't a multiple of 4
        for (; i < in_size; i++) {
            unsigned char c;
            c = in[i];
            RC_Encode(&rc[0], C[c], F[c]);
        }

    Code:
        // Tight decoding loop, unrolled 4-ways
        i_end = out_sz&~3;
        for (i = 0; i < i_end; i+=4) {
            uint32_t freq[4];
            unsigned char c[4];
    
            freq[0] = RC_GetFreq(&rc[0]);
            freq[1] = RC_GetFreq(&rc[1]);
            freq[2] = RC_GetFreq(&rc[2]);
            freq[3] = RC_GetFreq(&rc[3]);
    
            c[0] = D.R[freq[0]];
            c[1] = D.R[freq[1]];
            c[2] = D.R[freq[2]];
            c[3] = D.R[freq[3]];
    
            RC_Decode(&rc[0], D.fc[c[0]].C, D.fc[c[0]].F);
            RC_Decode(&rc[1], D.fc[c[1]].C, D.fc[c[1]].F);
            RC_Decode(&rc[2], D.fc[c[2]].C, D.fc[c[2]].F);
            RC_Decode(&rc[3], D.fc[c[3]].C, D.fc[c[3]].F);
    
            out_buf[i+0] = c[0];
            out_buf[i+1] = c[1];
            out_buf[i+2] = c[2];
            out_buf[i+3] = c[3];
        }
    
        // Tidy up any remainder that isn't a multiple of 4
        for (; i < out_sz; i++) {
            uint32_t freq = RC_GetFreq(&rc[0]);
            unsigned char c = D.R[freq];
            RC_Decode(&rc[0], D.fc[c].C, D.fc[c].F);
            out_buf[i] = c;
        }

    Calling RC_GetFreq 4 times on interleaved range coders, each with non-overlapping input and outputs, means that the expensive divisions can be pipelined. Eg see the Sandy Bridge data sheet at http://www.agner.org/optimize/, which indicates IDIV takes 9 cycles for 32-bit values and has a 20-27 cycle latency before the result is available. During that latency time it appears we can be computing other results, including other divisions.

    The net difference is that the code went from 5.7s encode and 16.4s decode on a 1Gb file to 6.8s encoder (slower) and 9.4s decode (much faster) on the same file. That's breaking 100MB/s.

    Order(1) modelling is a bit trickier, as we need the last byte output in order to use the frequencies for the next decode cycle. However if we position 8 (I found 8 worked better than 4 for O1) codecs at 1/8th of the buffer apart then they can all use their last byte value fine, with the small exception that on startup we need 8 fresh "last" values. (I could store 7 of these in the file, but for now I just set all 8 to zero.) The effect on order-1 unrolling is equally profound, giving around 100% throughput gains in decoding.

    I'm assuming that this is something that people have previously observed, but I've never seen any public code exploiting this. The gains are really quite remarkable given I haven't got any (explicit) SIMD/SSE calls in there. The code is unapologetically messy right now as it's very much a work in progress, but if anyone is interested I uploaded a copy to our ftp server at
    ftp://ftp.sanger.ac.uk/pub/users/jkb/arith_static.c

    To compare the non-unrolled version comment out the #define UNROLLED at the start. Note the two variants aren't binary compatible with each other.

  2. Thanks (4):

    Bulat Ziganshin (28th January 2014),cade (28th January 2014),Cyan (28th January 2014),Matt Mahoney (29th January 2014)

  3. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    I just noticed that version has a commented out division in the RC_GetFreq and replaced by precomputed multipliers, but it's marginal change and can be put back. I can't be bothered to try and put a new one there given it takes 15 mins for our ftp site to refresh (gah, no forced push method!).

    Anyway this is very much a work in progress, I just thought someone here may be interested in the methods used and I'm sure others can teach me more.

  4. #3
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
    Eugene has studied vectorised encoders long time ago, the results were good, it's somewhere in this forum.

  5. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    883
    Thanks
    478
    Thanked 276 Times in 116 Posts
    I ended up with the same problem as you when experimenting block-based arithmetic coder (see Range0 - http://fastcompression.blogspot.fr/p...py-coders.html)
    Pipelining divisions is a really good idea to hide this operation latency.

    It reminds me of some notes from Carmack, using the same methodology within Quake1.

  6. #5
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it. With SSE2, you would probably see worse results caused by the overhead of loading unaligned bytes.

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    Quote Originally Posted by m^2 View Post
    Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
    i seriously doubt that modern compilers can vectorize such code, imho they usually do it with just one operation such as a[i]=b[i]+c[i]. here, we have 1) division and 2) normalization loop, so it definitely out of SSE abilities. i wonder how Eugene managed it - probably he vectorized only part of computations
    Last edited by Bulat Ziganshin; 28th January 2014 at 22:28.

  8. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    Quote Originally Posted by cade View Post
    Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it.
    compiler need to manage more variables, so we have more loads/stores. it may be fixed by encoding into 4 separate buffers sequentially

  9. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by cade View Post
    Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it. With SSE2, you would probably see worse results caused by the overhead of loading unaligned bytes.
    I haven't had time to investigate yet; this is all a bit hot off the press

    I can probablyt check for cache misses etc using "perf stat" on linux. (A wonderful tool!)

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    cache misses has nothing common to misaligned load penalty and for such linear algorithm they should be near 0. afaik, unaligned SSE loads are free or almost free on modern Intel cpus

  11. Thanks:

    cade (29th January 2014)

  12. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    I don't think there should be unaligned accesses. The O(1) unrolling positions the loop start on a boundary of 8 input bytes, so that seems reasonable. The O(0) codec just does byte 0,1,2,3 then 4,5,6,7, etc so also aligned.

    I'm not sure if it is actually using SSE without explicitly looking at the assembly generated. The primary speed up comes through avoidance latency of the division and the removal of a further division in both encoder and decoder by fixing totFreq. Those divisions really stall the pipeline.

    However that still doesn't explain the slow down in encoder when unrolling. It's probably simply the additional memcpy required to paste the various output buffers together. I could have a hybrid affair where they share a common output buffer. It would create some small interdependency between codecs, but it's not doing this via multi-threading so that's not particularly bad.

  13. #11
    Member
    Join Date
    Dec 2013
    Location
    Italy
    Posts
    342
    Thanks
    12
    Thanked 34 Times in 28 Posts
    If "char" is 8 bit, this could be slow.
    Remember that, tipically, on "big" Intel CPU (Core, back to PPRO) the load function is 64 byte wide.
    Therefore if you load 1 byte\64 aligned, you get the remainig 63 "gratis" (at the same latency).
    And there can be microops coalescing (or not)
    So you really need the dis-assembly to understand what exactly happens.
    Code:
    for (i = 0; i < i_end; i+=4) {
            unsigned char c[4];
            c[0] = in[i+0];
            c[1] = in[i+1];
            c[2] = in[i+2];
            c[3] = in[i+3];

  14. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    All these versions require 2 multiplications per step - you can reduce it to 1 using range version of ANS instead: 8th page of http://arxiv.org/pdf/1311.2540

    For natural f[s] (quantization of probabilities):
    freq[s] = f[s] / 2^n
    cumul[s] = sum_{i<s} f[s]
    mask = 2^n - 1
    symbol[x=0 ... n-1] = min_s : cumul[s]>x - which symbol is in x-th position in (0..01..1...) length 2^n range: having f[s] appearances of symbol s

    encoding of symbol s requires 1 division modulo f[s]:
    C(s,x) = (floor(x/f[s]) << n) + mod(x,f[s]) + cumul[s]

    decoding from x has one multiplication and needs symbol[] table or function deciding in which subrange we are:
    s = symbol [ x & mask ]
    x = f[s] * (x >> n) + (x & mask) - cumul[s]
    Last edited by Jarek; 29th January 2014 at 00:39.

  15. Thanks:

    JamesB (29th January 2014)

  16. #13
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    Quote Originally Posted by fcorbelli View Post
    If "char" is 8 bit, this could be slow.
    Remember that, tipically, on "big" Intel CPU (Core, back to PPRO) the load function is 64 byte wide.
    Therefore if you load 1 byte\64 aligned, you get the remainig 63 "gratis" (at the same latency).
    that's true for loading from memory to LLC (last level cache), but this code works with data in L1 cache so load delays are 4 cycles

  17. #14
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by m^2 View Post
    Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
    Eugene has studied vectorised encoders long time ago, the results were good, it's somewhere in this forum.
    Thanks, I found http://encode.su/threads/1153-Simple...ll=1#post23197 now and his link to http://nishi.dreamhosters.com/u/vecrc_v0.rar

    It sounds like it is bit-wise with the vectorisation being done to update 8 bits at a time. Certainly an interesting method. Oddly it crashes on exit (after doing all the encoding/decoding and flushing results), but also takes around a minute each way so it's substantially slower. It does however do a reasonable job at compression and it's obviously an adaptive model. My own hacked up coder isn't so hot for general data as the storage cost for storing all the symbol frequencies is quite high unless I boost the block size further. (However it was designed for a specific purpose.)

  18. #15
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    There are reasons why the coders in modern applications like HEVC are all *binary* coders and not multi-symbol arithmetic coders. A division can really hurt performance a lot, and even worse, is hard to implement in hardware. Most of the modern approaches are entirely table driven, where the states in the table both reflect the partitioning of the interval, and the probability estimate. In practise, nobody uses an arithmetic coder with multiplication and division anymore. It's just too slow.

  19. #16
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Regarding range version of ANS (rANS) I have mentioned - using a single multiplication instead of two for range coding, there is a nice explanation and comparison on Charles Bloom blog ( http://cbloomrants.blogspot.com/ ), saying that its implementation of Fabian 'ryg' Giesen (I couldn't find details) decodes 140-167 "mb/s", about twice slower than FSE.
    Unfortunately encoding speed is about twice slower, probably because of required division.

    update: Here is Fabian's implementation: https://github.com/rygorous/ryg_rans
    Last edited by Jarek; 3rd February 2014 at 00:03.

  20. Thanks:

    JamesB (3rd February 2014)

  21. #17
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by thorfdbg View Post
    There are reasons why the coders in modern applications like HEVC are all *binary* coders and not multi-symbol arithmetic coders. A division can really hurt performance a lot, and even worse, is hard to implement in hardware. Most of the modern approaches are entirely table driven, where the states in the table both reflect the partitioning of the interval, and the probability estimate. In practise, nobody uses an arithmetic coder with multiplication and division anymore. It's just too slow.
    Do you have any examples? I've tried fpaq0pv4b which seems to be the best performing binary coder in the fpaq series and despite claiming to be a simple order-0 test for arithmetic coders, it's simply not able to keep up with a byte wise arithmetic coder despite the divisions.

    I'm sure there are ways of optimising the bitwise ones further, but I don't know of any open source code for this. Pointers would be most welcome.

    Meanwhile I'm exploring rANS. Thanks Jarek and Ryg.

  22. #18
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by Jarek View Post
    Regarding range version of ANS (rANS) I have mentioned - using a single multiplication instead of two for range coding, there is a nice explanation and comparison on Charles Bloom blog ( http://cbloomrants.blogspot.com/ ), saying that its implementation of Fabian 'ryg' Giesen (I couldn't find details) decodes 140-167 "mb/s", about twice slower than FSE.
    Unfortunately encoding speed is about twice slower, probably because of required division.

    update: Here is Fabian's implementation: https://github.com/rygorous/ryg_rans
    After a few gotchas and a bug fix (thanks Fabian) I now have a version of my O0 and O1 coder using rANS as a more or less drop-in replacement for the byte wise range coder.

    The results are indeed an improvement as predicted.

    On my 1Gb relatively low-entropy file it is now encoding at 109MB/sec and decoding at 167MB/sec. This compares well to the 139MB/sec encoding and 93MB/sec decode of the range coder version (which wasn't exactly tardy, on a 2.2Ghz Xeon E5-2660). I'm definitely happy with it, although it's perhaps a bit bleeding edge! I know the tANS version would be faster still, but I'm less sure about making an order-1 variant of that algorithm.

    Thanks all.

  23. Thanks:

    Jarek (4th February 2014)

  24. #19
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    I decided I'd upload the source and some prebuilt 64-bit linux binaries here.
    I tried cross-compiling for windows and running under Wine but it fails for some unknown reason. I don't know if this means it won't work under windows or whether it is simply my crufty and hacky wrapper that isn't portable.

    Consider these simply as a demonstration, but I expect they'd do very well on the various benchmarks compared to most other o0 and o1 pure entropy encoders.

    One thing I found is that by unrolling/interleaving the rANS encoders they run pretty close to the tANS implementation in FSE. That's good news.
    Attached Files Attached Files

  25. Thanks (2):

    Bulat Ziganshin (7th February 2014),ne0n (7th February 2014)

  26. #20
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by JamesB View Post
    Do you have any examples? I've tried fpaq0pv4b which seems to be the best performing binary coder in the fpaq series and despite claiming to be a simple order-0 test for arithmetic coders, it's simply not able to keep up with a byte wise arithmetic coder despite the divisions.
    My typical example would be the MQCoder, though any of the MPEG encoders would also work. I have an implementation I could share, but it is not optimized. In fact, you can get rid of all branches but one, and divisions and multiplications are neither present, only shifts, additions and subtractions. The trick is of course to binarize the data correctly. Again, a trivial example might be the AC coding variant of 10918-1 (JPEG).

  27. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    thorfdbg, it will be great if you can share code/explanations. i heard about that 10-20 years ago, but don't yet understand how it works

  28. #22
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts
    So that's page 54 onwards of www.w3.org/Graphics/JPEG/itu-t81.pdf

    Code is available in http://www.ijg.org/files/jpegsrc.v9a.tar.gz jcarith.c, linked to from http://www.ijg.org/

    I've no idea how this compares speed wise. I couldn't find a single function to compress or decompress an entire block of memory, rather just lots and lots of embedded calls to arith_encode and arith_decode to emit specific symbols.
    Last edited by JamesB; 7th February 2014 at 19:50.

  29. #23
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    155
    Thanks
    39
    Thanked 45 Times in 25 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    thorfdbg, it will be great if you can share code/explanations. i heard about that 10-20 years ago, but don't yet understand how it works
    Here's a cleaner explanation of the closely related QM coder. The idea of input to it for images is bit planes (eg MSB for every byte, then 2nd, 3rd, ... to LSB). I haven't tried it yet since arithmetic coding is fast enough for me right now, but I had problems when I had too many symbols, so I will probably try this later (with bit planes).

    http://www.cs.ucf.edu/courses/cap5015/QM_coder.pdf

  30. #24
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    493
    Thanks
    177
    Thanked 174 Times in 118 Posts

    Updated version

    I updated the demonstration of the unrolling (technically I should call it interleaving or striping I guess as it's not unrolling in the usual sense of the word given it produces different output) and fixed a few bugs found by testing on Matt's simple.zpaq archive of awkward files.

    Speed wise it's largely unchanged. I also did some basic benchmarking with manual code changes to test different amounts of interleaving/striping on an Intel and AMD system. As can be seen it's beneficial in all cases. These are for the rANS_static demo:

    Code:
    > rANS_static, manual unrolling tests in MB/s.
    >
    > Intel   1-way           2-way           4-way           8-way
    > QQ-O0   138/109         156/167         159/225         130/184
    > qq-O0   141/116         160/180         151/232         135/217
    >
    > QQ-O1    99/57          102/91           98/123          87/133
    > qq-O1   112/91          114/148         113/187         111/194
    >
    >
    > AMD     1-way           2-way           4-way           8-way
    > QQ-O0    73/91           80/130          77/148          87/125
    > qq-O0    90/99          101/140         102/169         102/149
    >
    > QQ-O1    58/34           56/53           54/69           51/66
    > qq-O1    82/55           84/83           80/99           81/95
    >
    > Conclusion:
    > O1 encoding should be 2-way done 4 times (hence 8-way).
    > O1 decoding should be 8-way
    As before, many thanks to Eugene Shelwien, Jarek Duda and Fabian Giesen for their code on which this is largely based.
    Attached Files Attached Files

  31. Thanks (2):

    Bulat Ziganshin (12th February 2014),Cyan (12th February 2014)

Similar Threads

  1. SMAC : SMall Arithmetic Coding
    By Jean-Marie Barone in forum Data Compression
    Replies: 79
    Last Post: 14th July 2018, 19:56
  2. jpegrescan: now with arithmetic coding!
    By Mangix in forum Data Compression
    Replies: 8
    Last Post: 5th July 2013, 22:01
  3. Arithmetic Coding : Underflow problem
    By Jean-Marie Barone in forum Data Compression
    Replies: 11
    Last Post: 24th October 2012, 02:53
  4. Arithmetic coding broken in IJG-code
    By thorfdbg in forum Data Compression
    Replies: 1
    Last Post: 10th September 2012, 21:22
  5. Raising performance bar in arithmetic coding
    By stbit in forum Data Compression
    Replies: 43
    Last Post: 28th April 2011, 09:07

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •