Results 1 to 15 of 15

Thread: Rans tricks

  1. #1
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    159
    Thanks
    23
    Thanked 68 Times in 39 Posts

    Rans tricks

    I managed to optimize rans close to Finite State Entropy's speeds without using platform specific intrinsics or assembly: https://github.com/loxxous/rans-tricks

    I doubt this is worth a forum post but hey there's no point keeping it to myself.

    I've included 3 exe's: rans_std uses Fabians original rans_byte header, rans_trick_8 uses 8 bit branchless renormalization (identical compression to rans_std, just faster), rans_trick_16 is 16 bit branchless renormalization which is a little bit faster since it uses a non-iterative approach.

    Usage example: "rans_std.exe D:\downloads\file.dat", it only benchmarks files up to 100MB large because I'm lazy, it's mainly just a proof of concept.

    In my tests adding more than 4 states does not improve performance since now it's limited by operation dependencies, the next step would be intrinsics but that's rans_static's territory

    Let me know what you think.
    Attached Files Attached Files
    Last edited by Lucas; 27th April 2018 at 08:25. Reason: Updated trick_16 with branchless encoding

  2. Thanks (6):

    Bulat Ziganshin (15th May 2018),Cyan (27th April 2018),JamesB (27th April 2018),Jarek (30th April 2018),jibz (27th April 2018),Stephan Busch (27th April 2018)

  3. #2
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    485
    Thanks
    167
    Thanked 166 Times in 114 Posts
    The code in RansEncPutSymbol is definitely a nice improvement, although a bit compiler specific in turns of gains. I experimented with trying to get cmov instructions used to avoid the branching, but it's very hard to get compilers using them!

    With gcc 7.1 I found it improved encoding speed for poorer compressed data and harmed speed on highly compressible data, presumably due to how well the branch predictor is working. With clang 5.0 it was faster at both (although again much better on harder to compress data).

    The RansEncRenorm looks to be doing 16-bit normalisation instead of 8-bit, which differs from Ryg's original rans_byte. I guess this is where much the extra (initial) speed is coming from. Good gains. I ought to try this myself as the encode speeds have always been my weakness. Food for thought!

    Edit: that system was pretty old. On a newer machine the speed even on low-entropy data it's not losing time, and gains elsewhere. Nice trickery.
    Last edited by JamesB; 27th April 2018 at 17:49.

  4. #3
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    159
    Thanks
    23
    Thanked 68 Times in 39 Posts
    The decoder re-normalization I did for trick_8 was this:



    uint8_t *ptr = *pptr;
    int32_t s = x < RANS_BYTE_L;
    uint32_t tx = x;
    do
    {
    tx = (tx << 8) | *ptr;
    ptr += s;
    }
    while(tx < RANS_BYTE_L);
    *pptr = ptr;
    x = (-s & (tx - x)) + x;



    It avoided a branch but still had to iterate, I only posted the source for the 16-bit version since it's the fastest.

  5. #4
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    136
    Thanks
    20
    Thanked 93 Times in 30 Posts
    Quote Originally Posted by Lucas View Post
    x = (-s & (tx - x)) + x;
    I use the same trick in my M99 coder. I also found it faster than cmov.

  6. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    485
    Thanks
    167
    Thanked 166 Times in 114 Posts
    Cmov can be faster, but it's hard to get it used by the compiler.

    Benchmarking the above code vs some assembly using cmov on two files that encode to 1.84 bits per byte and 4.54 bits per byte gives me these decode speeds (best of 10):

    asm cmov: 507.7 MB/s, 495.3 MB/s
    no-branch: 414.8 MB/s, 411.8 MB/s
    original ryg: 358.1 MB/s, 365.8 MB/s

    The assembly version is:

    Code:
    static inline void RansDecRenorm(RansState* r, uint8_t** pptr) {
        uint32_t x = *r;
        uint8_t  *ptr = *pptr;
        __asm__ ("movzbl (%0), %%eax\n\t"
                 "mov    %1, %%edx\n\t"
                 "shl    $0x8,%%edx\n\t"
                 "or     %%eax,%%edx\n\t"
                 "cmp    $0x800000,%1\n\t"
                 "cmovb  %%edx,%1\n\t"
                 "adc    $0x0,%0\n\t"
                 : "=r" (ptr), "=r" (x)
                 : "0" (ptr), "1" (x)
                 : "eax", "edx"
                 );
        if (x < 0x800000) x = (x << 8) | *ptr++;
        *pptr = ptr;
        *r = x;
    }
    It's only half branchless cmov because the second renorm half is rarely triggered and very predictable by the CPU. The 16-bit renorm version is around 660MB/s instead. Insane 32-way unrolling with AVX2 is 1.4GB/s, but that's a bit extreme and has other issues with small data sets.

    Clang prefers the assembly too, while icc runs tragically slow on it (about half speed) but does a much better job on Lucas' branchless variant. The branchless code certainly seems to be the optimal case for pure C, but a bigger question is whether we can get the compilers to generate code like the assembly above, maybe with hints about predicatability of branching?

  7. Thanks:

    RamiroCruzo (30th April 2018)

  8. #6
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    For what it's worth, the ryg_rans github repo contains a word-based renorm coder in rans_word_sse4. You can use the word-based renorm part without the SSE4.1 4-way interleaved decoder part.

    Also, I wish to emphasize that ryg_rans was written for clarity not speed!

    I just put up what I consider a "production quality" 2x interleaved rANS implementation (straight from BitKnit) here: https://gist.github.com/rygorous/b43...796b5f96671cbe (I just copy&pasted the code from BitKnit, apologies if I missed some dependency.)

    BitKnit uses 15-bit probabilities, 16-bit (word-granularity) renormalization, and a 2x alternating-stream-interleaved setup (as described way back in https://fgiesen.wordpress.com/2015/1...s-in-practice/). I've found this to be a fairly sweet spot for rANS coding with adaptive probabilities. [If you're purely static, you should be using tANS/FSE instead.] If you have any adaptation in the loop, it's hard to get a win from more than 2 streams, in my tests anyway, and the alternating-stream trick is really nice for 2 streams in particular since it basically turns into nothing.

    It also has a few other features like raw "bypass" coding (sending a few bits pass-through) and implicitly juggles 2 alternate streams without you having to do any manual interleaving - much more convenient to use when you have multiple paths through an encoder/decoder. The idea was to make an encoder that's basically a plug-in replacement for a regular arithmetic encoder, while still getting some of the nice rANS benefits. It assumes you're flushing the rANS encoder every 64k of input data so the encoder output buffer space is nice and bounded. This one's written for adaptive probabilities (because that's what BitKnit uses) so there's no attempt to speed up the divisions or anything like that.

  9. Thanks:

    JamesB (1st May 2018)

  10. #7
    Programmer
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    420
    Thanks
    28
    Thanked 153 Times in 18 Posts
    I can confirm fabians results.

    During rz's design I had to decide between adaptive or semi-static modeling. I went fully adaptive using rANS. I did lots of experiments but ended up using two alternating states and 16-bit renormalization, too.

    Alternating states are very convenient and provide good performance. I will probably change back to explicit interleaving, once the syntax and everything is frozen. Explicit interleaving can be faster, but is quite an obstacle, if you're experimenting with your codec's syntax.

    With 16-bit renormalization you don't need a loop or anything, if you keep your probability-precision narrow enough. If you're doing pointer arithmetics instead of branches, it doesn't make a difference.

  11. Thanks (2):

    Bulat Ziganshin (15th May 2018),Jarek (7th May 2018)

  12. #8
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    730
    Thanks
    227
    Thanked 230 Times in 141 Posts
    Hamid has found fresh poster from ALICE experiment in LHC: https://indico.cern.ch/event/773049/...584/Layout.pdf
    They have e.g. 25bit values, I have suggested him to do binning into smaller alphabets, but I see they finally directly use these huge 25bit alphabets with rANS, getting ~700 MB/s/core encoding:
    Click image for larger version. 

Name:	ALICE.png 
Views:	74 
Size:	63.2 KB 
ID:	7028

  13. Thanks:

    Shelwien (3rd November 2019)

  14. #9
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    560
    Thanks
    65
    Thanked 198 Times in 147 Posts
    The "binning into smaller alphabets" described is similar to extra bits like used for ex. in zlib, lzma, lzx,...
    Depending on the used data set, 16 bins can be too few.

    There are many possibilities to encode large values without using large alphabets:
    - Extra bits like in lzma encoding the exponent with a bitwise/bytewise range coder or ANS.
    - Gamma/Limited length Glomb coding: Using a bitwise entropy coder, you encode first the length and
    ​ then encode the remaining bits using the length as context.
    - Using "group varint" (variable byte like TurboByte in TurboPFor) encode the length (2x4 bits for 4 values),
    then encode each byte (1..4) with a bytewise range coder or ANS using the length as context (5 RC/ANS tables).

  15. #10
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    730
    Thanks
    227
    Thanked 230 Times in 141 Posts
    All such direct writing of youngest bits assumes uniform distribution there - we pay Kullback-Leibler more bits/symbol if it isn't.

    From the other side, directly using e.g. 25bit alphabets and storing such probabilities requires ~100MB headers to store the probability distribution ... hopefully they e.g. parametrize it somehow.

  16. Thanks:

    dnd (4th November 2019)

  17. #11
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    560
    Thanks
    65
    Thanked 198 Times in 147 Posts
    In the extra bits case, 25 bits values can be splitted in 8 bits exponent + 21 bits mantissa and encoded like:

    1 - encode 8-bits exponent (o0 or o1 using some previous context) (ANS: header for 256 bytes)
    2 - encode the first 8-msb bits of the mantissa using the exponent as context (ANS: header for 256x256 bytes)
    3 - encode the few remaining bits with direct bit i/o (0 to 13 bits)

    Assuming a header of 100 bytes per 256 bytes, you need only ~25k for the ANS header and only ~256k = (256+64k)x4 RAM.
    Storage + Memory requirements can be further reduced by using a 6 bits exponent.

    You'll get similar requirements for the length limited golomb coding.

  18. Thanks:

    Jarek (4th November 2019)

  19. #12
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    730
    Thanks
    227
    Thanked 230 Times in 141 Posts
    Indeed, generally entropy coding is only useful for nonuniform distributions - hence we should use it for higher bits ("bin number"), then lowest bits can be directly written - what assumes uniform distribution among them, approximating with straight segments in empirical CDFs (sorted values) like below.
    However, usually some improvement can be obtained if using various size bins - adapted to the actual distribution, e.g. 0 value has many appearances below (from https://arxiv.org/pdf/1511.00856 ) - suggesting to use a separate bin for it:

    Click image for larger version. 

Name:	bin1.png 
Views:	81 
Size:	68.1 KB 
ID:	7029

  20. #13
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    560
    Thanks
    65
    Thanked 198 Times in 147 Posts
    Thanks, very good visualization

  21. #14
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    560
    Thanks
    65
    Thanked 198 Times in 147 Posts
    Thanks, very good visualization.


    Quote Originally Posted by Jarek View Post
    ...usually some improvement can be obtained if using various size bins - adapted to the actual distribution, e.g. 0 value has many appearances below (from https://arxiv.org/pdf/1511.00856 ) - suggesting to use a separate bin for it:
    Using "extra bits", you have automatically the first values (4 for 6 bits exponent) encoded with full entropy coding (zero direct bits),
    no need to use a separate bin.
    You can use a simple "bitscanreverse/clz" to calculalate the exponent from the value (see lzma or LZ Offset Modeling Rambles).

  22. #15
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    730
    Thanks
    227
    Thanked 230 Times in 141 Posts
    For adaptive binning I thought that each bin has assigned some number of bits to read.
    While something similar is done in LZ, such division into bins can be optimized based on data to minimize Kullback-Leibler. However, doing it really optimally is not a simple task - I have ended up with a simple heuristic in this arxiv.
    Also, such adaptive binning requires bin search in encoder, e.g. binary.
    Probably simple binning is more practical, like instead of direct use of 25bit values, use entropy coder for highest 16 bits, and directly write lowest 9 bits.

Similar Threads

  1. Published rANS patent by Storeleap
    By Jarek in forum Data Compression
    Replies: 119
    Last Post: 19th February 2019, 12:21
  2. Replies: 33
    Last Post: 2nd July 2018, 21:18
  3. Replies: 45
    Last Post: 25th November 2016, 04:30
  4. AVX-512 and interleaved rANS
    By JamesB in forum Data Compression
    Replies: 41
    Last Post: 6th November 2016, 15:26
  5. Replies: 4
    Last Post: 3rd February 2012, 04:35

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •