Page 1 of 2 12 LastLast
Results 1 to 30 of 34

Thread: Range ANS (rANS) - faster direct replacement for range coding

  1. #1
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts

    Range ANS (rANS) - direct alternative for range coding

    Here is implementation of the "range variant of ANS" (rANS) by Fabian "ryg" Giesen: https://github.com/rygorous/ryg_rans
    Thanks to being simpler and using single multiplication per decoded symbol (8th page of http://arxiv.org/pdf/1311.2540 ), accordingly to tests on Charles Bloom blog ( http://cbloomrants.blogspot.com/ ), the current version has faster decoding than range coding at cost of slower encoding.

    Algorithm: for natural f[s] (quantization of probabilities):
    freq[s] = f[s] / 2^n
    cumul[s] = sum_{i<s} f[s]
    mask = 2^n - 1
    symbol[x=0 ... n-1] = s: cumul[s-1]<= x < cumul[s]
    - which symbol is in x-th position in (0..01..1...) length 2^n range: having f[s] appearances of symbol s

    encoding of symbol s requires 1 division modulo f[s]:
    C(s,x) = (floor(x/f[s]) << n) + mod(x , f[s]) + cumul[s]

    decoding from x has one multiplication and needs symbol[] table or function deciding in which subrange we are:
    s = symbol [ x & mask ]
    x = f[s] * (x >> n) + (x & mask) - cumul[s]


    Here are benchmarks of the current version from Charles blog (arith is range coder):

    huf = order-0 static Huffman
    fse = Yann's implementation of tANS
    rans = ryg's implementation of rANS
    arith = arithmetic coder with static power of 2 cumulative frequency total and decode table

    For fse, rans, and arith I use a 12-bit table (the default in fse.c)
    huf uses a 10-bit table and does not limit code length

    Runs are on x64 code, but the implementations are 32 bit. (no 64-bit math used)
    Some notes on the four implementations will follow. First the raw results : inName : book1
    H = 4.527

    arith 12: 768,771 -> 435,378 = 4.531 bpb = 1.766 to 1
    arith encode : 0.006 seconds, 69.44 b/kc, rate= 120.08 mb/s
    arith decode : 0.011 seconds, 40.35 b/kc, rate= 69.77 mb/s

    rans 12: 768,771 -> 435,378 = 4.531 bpb = 1.766 to 1
    rans encode : 0.010 seconds, 44.04 b/kc, rate= 76.15 mb/s
    rans decode : 0.006 seconds, 80.59 b/kc, rate= 139.36 mb/s

    fse : 768,771 -> 435,981 = 4.537 bpb = 1.763 to 1
    fse encode : 0.005 seconds, 94.51 b/kc, rate= 163.44 mb/s
    fse decode : 0.003 seconds, 166.95 b/kc, rate= 288.67 mb/s

    huf : 768,771 -> 438,437 = 4.562 bpb = 1.753 to 1
    huf encode : 0.003 seconds, 147.09 b/kc, rate= 254.34 mb/s
    huf decode : 0.003 seconds, 163.54 b/kc, rate= 282.82 mb/s
    huf decode : 0.003 seconds, 175.21 b/kc, rate= 302.96 mb/s (*1)


    inName : pic
    H = 1.210

    arith 12: 513,216 -> 79,473 = 1.239 bpb = 6.458 to 1
    arith encode : 0.003 seconds, 91.91 b/kc, rate= 158.90 mb/s
    arith decode : 0.007 seconds, 45.07 b/kc, rate= 77.93 mb/s

    rans 12: 513,216 -> 79,474 = 1.239 bpb = 6.458 to 1
    rans encode : 0.007 seconds, 45.52 b/kc, rate= 78.72 mb/s
    rans decode : 0.003 seconds, 96.50 b/kc, rate= 166.85 mb/s

    fse : 513,216 -> 80,112 = 1.249 bpb = 6.406 to 1
    fse encode : 0.003 seconds, 93.86 b/kc, rate= 162.29 mb/s
    fse decode : 0.002 seconds, 164.42 b/kc, rate= 284.33 mb/s

    huf : 513,216 -> 106,691 = 1.663 bpb = 4.810 to 1
    huf encode : 0.002 seconds, 162.57 b/kc, rate= 281.10 mb/s
    huf decode : 0.002 seconds, 189.66 b/kc, rate= 328.02 mb/s
    Update: ryg's blog about rANS: http://fgiesen.wordpress.com/2014/02/02/rans-notes/
    Last edited by Jarek; 4th February 2014 at 01:41.

  2. Thanks:

    Bulat Ziganshin (15th May 2016)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    While encoding symbol s, there are generally two possibilities of the number of bits to produce, like k[s]-1 or k[s], what suggests using branch.
    However, as Yann's has noticed, we can remove this branch for tANS..
    For rANS it is a bit more complicated, but it also can be done.

    With renormalization, rANS require 3 main parameters:
    - frequency accuracy f[s] ~ p[s] * 2^n
    - the beginning of I range: L=2^R
    - the number of bits used while renormalization: 2^B
    Finally the range for the state is
    x \in I = {2^R, ..., 2^{R+B} -1}
    In practice, e.g. R=B=16 for 32bits, R=B=32 for 64bit, n~12 (size 2^n symbol table is required).

    We have
    L_s = f[s] * 2^{R-n}
    I_s = {L_s, ..., L_s * 2^B - 1}
    to encode symbol s, we first need to take x to I_s range (renormalization) - the question is how many B-bit block we need to use first.
    We need k[s]-1 or k[s] such blocks, where
    k[s] = ceiling(lg(L/L_s) / B)
    the minimal x that requires k[s] bits is
    x_s = L_s * 2^{B * k[s]}
    So let us define
    nB[s] = 2^{R+B} - x_s + (k[s] -1 ) * 2^{R+B} = k[s] << (R+B) - L_s << (B * k[s])

    Finally, branchless rANS encoding step with renormalization is
    nbBits = ((x + nB[s]) >> (R+B)) << B;
    writeBits(x, nbBits); x >>= nbBits;
    x = (floor(x/f[s]) << n) + mod(x , f[s]) + cumul[s]


    The renormalization for decoding (assuming n<B) is
    if(x < (1<<R)) x = x << B | readBits(B);

    the branch again can be replaced by
    nbBits = (1 -  (x >> R)) << B;
    x = x << nbBits | readBits(nbBits)
    Last edited by Jarek; 16th August 2015 at 14:08.

  4. Thanks:

    Bulat Ziganshin (16th May 2016)

  5. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by Jarek View Post
    The renormalization for decoding (assuming n<B) is
    if(x < (1<<R)) x = x << B | readBits(B);

    the branch again can be replaced by
    nbBits = (1 -  (x >> R)) << B;
    x = x << nbBits | readBits(nbBits)
    I assume that was meant to be
    nbBits = (1 - (x >> R)) * B
    ?

    I'm not sure on the 1 - (x>>R) bit either. I think you can probably just do (x < (1<<R)) as a variable without the branch, so using it as a truth for 0/1, and avoid branch hits/misses.

    I tried something similar in my code and sometimes it slightly speeds things up and sometimes it harms it, depending on whether I am using the code in an order-0 or order-1 context and whether I am compiling with icc or gcc/clang.

  6. #4
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Oh, you are right, also for encoding.

    This multiplication is not required if it is ensured that there is at most one renormalization step, otherwise often can be replaced by bitshift.
    For example for 32 bits we can use e.g. 16+16 or 24+8.
    16+16: x \in {2^16, ... 2^32-1} with 16 bit renormalization has at most one renomalization step, so encoding/decoding steps can be written as

    nbWords = (x + nB[s]) >> 32;                   
    write16bits(x, nbWords); x >>= (nbWords << 4); // now its ok :)
    x = (floor(x/f[s]) << n) + mod(x , f[s]) + cumul[s]

    nbWords = 1 -  (x >> 16);
    x = x << (nbWords << 4) | read16bits(nbWords);
    s = symbol [ x & mask ]
    x = f[s] * (x >> n) + (x & mask) - cumul[s]


    Here is decoding with branch-less read16bits() - the data is pointed by 2byte ptr:
    nbWords = 1 -  (x >> 16); mask = (nbWords << 16) - 1
    x = x << (nbWords << 4) | (*(ptr += nbWords) & mask) ; // complex but branch-less
    s = symbol [ x & mask ]
    x = f[s] * (x >> n) + (x & mask) - cumul[s]

    I don't know how to make it branchless for encoding (?)

  7. #5
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by Jarek View Post
    Here is decoding with branch-less read16bits() - the data is pointed by 2byte ptr:
    nbWords = 1 -  (x >> 16); mask = (nbWords << 16) - 1
    x = x << (nbWords << 4) | (*(ptr += nbWords) & mask) ; // complex but branch-less
    s = symbol [ x & mask ]
    x = f[s] * (x >> n) + (x & mask) - cumul[s]

    I don't know how to make it branchless for encoding (?)
    That's more or less what I ended up with.

    In some scenarios it became faster to have a lookup table for the mask, bizarrely! My code wasn't entirely the same as yours though so maybe something else is the culprit.

    int z[2] = {0,((1<<16)-1)};
    x = x << (nbWords << 4) | (*(ptr += nbWords) & z[nbWords]) ;


    Some examples of MB/s with different compilers and methods on my DNA quality test file showed this. +br is the old branching code. -br was branchless with table for the mask and -br2 was branchless with mask computed inline. Why that makes so much difference I have no idea. Anyway it appeared that more or less branchless worked slightly better on order-0 (but not uniformly; eg depending on the version of GCC), but the extra complexity elsewhere in the order-1 entropy encoder meant that the branchless one was harming throughput.


    gcc-4.6.3
    +br -br -br2
    o0 291 / 301 / 254
    o1 192 / 186 / 187

    gcc-5.1.0
    o0 294 / 284 / 257
    o1 200 / 192 / 188

    icc
    o0 269 / 220 / 277
    o1 197 / 166 / 203

    clang
    o0 279 / 291 / 263
    o1 205 / 200 / 182


    Clang using Ryg's rans64.h instead as the backend rANS coder gave 379MB/s decode times at order 0 and 238MB/s for order-1. For comparison the latest git checkout of FSE (./test/fse -b) claims 292MB/s decode, but I'm not sure the benchmark is doing the same thing. Anyway maybe rANS can be pretty comparable to tANS for speeds. This is all on a rather aging Xeon E5-2660.

  8. Thanks:

    Jarek (19th August 2015)

  9. #6
    Member
    Join Date
    Sep 2015
    Location
    Brno
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Reminds me of my 3:1 huffman codes that were meant to better compress skewed distributions. Decoder grabs two bits at a time and codes 0, 1, 2 map to 'a' and code 3 maps to 'b'. The excess entropy in 'a' case is accumulated and after 12 rounds i interpret the buffer as 19 binary digits, interleaving the input bitstream. (3^12 = 2^19 + 1.5%.). The win is that decoder is still always whole-bit aligned and fast. But it complicated the encoder to the point I abandoned the idea. Need for unbounded back-patching of these 19 bits (you can't patch a value that still needs to be patched) killed streaming and the small gain in compression rate was not worth it.

    This is much better though.. Generic, not just powers of 2 and 3.

  10. #7
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Thanks Zde, this is a nice idea, but would require an additional buffer for each of (non-power-of-2) symbol (and requires grouping symbols into powers of 2).
    ANS trick is to integrate all such additional buffers into one natural number carrying the fractional number of bits.

  11. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    We revisited optimising our rANS code again recently, using linux "perf" tool for studying where the bottle necks are and profiling things like branch misses. It's particularly good for things like that. Eg an order0 decode of enwik9:

    @ deskpro107386[compress.../rans_sta...]; perf stat ./rANS_static4c -d /tmp/enwik9_o0 > /dev/null
    Took 3291943 microseconds, 303.8 MB/s

    Performance counter stats for './rANS_static4c -d /tmp/enwik9_o0':

    3294.589602 task-clock (msec) # 1.000 CPUs utilized
    293 context-switches # 0.089 K/sec
    4 cpu-migrations # 0.001 K/sec
    856 page-faults # 0.260 K/sec
    11775910176 cycles # 3.574 GHz
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    18361710754 instructions # 1.56 insns per cycle
    1912728587 branches # 580.567 M/sec
    319419710 branch-misses # 16.70% of all branches

    3.295474444 seconds time elapsed


    The above code is an example of 32-bit state retaining 24-bit and shifting out 8-bit in the renormalisation step. Code is from https://github.com/jkbonfield/rans_static and very heavily derived from Ryg's rANS implementation.

    Attempting to remove branches by the techniques outlined above in this thread do work, but only on certain types of data. It's more instructions on average but not branching so if the data is highly compressable it doesn't help and if not it probably will.

    My colleague (Rob Davies) mentioned that the CMOV instruction is probably a better alternative to the existing branchless code. He rewrote the entire main rANS decode loop (including renormalisation) in pure assembler and easily beat the existing C code for decoding speed. All well and good, but ideally I'd like C and not assembly. The renorm func I previously had was this:

    static inline void RansDecRenorm(RansState* r, uint8_t** pptr)
    {
    uint32_t x = *r;
    if (x < RANS_BYTE_L) {
    uint8_t* ptr = *pptr;
    x = (x << 8) | *ptr++;
    while (x < RANS_BYTE_L) x = (x << 8) | *ptr++;
    *pptr = ptr;
    }
    *r = x;
    }


    No matter what I did I couldn't get various gcc versions to use cmov, but clang did when given this code:

    static inline void RansDecRenorm(RansState* r, uint8_t** pptr)
    {
    uint32_t x = *r;
    uint8_t* ptr = *pptr;
    uint32_t y = (x << 8) | *ptr;
    uint32_t cond = x < RANS_BYTE_L;
    x = cond ? y : x;
    ptr += cond ? 1 : 0;
    if (x < RANS_BYTE_L) x = (x<<8) | *ptr++;
    *pptr = ptr;
    *r = x;
    }


    It still has branches in the x = cond ? y : x, but this idiom is enough to persuade clang to use cmov instruction in that scenaro.
    The second if (x < RANS_BYTE_L) is left in place as Rob discovered it's so rarely triggered that a branching implementation works out faster due to fewer instructions coupled with high branch predictability. This asm clang generated turned out to be faster than the original hand-written asm.

    Analysing the assembly that clang produced showed that it did things better in some places and poorer in others, so we now have a pure assembler implementation which is faster than the original hand-written assembler and also faster than the compiler generated assembler.

    The final(?) ASM variant we ended up with was:

    static inline void RansDecRenorm(RansState* r, uint8_t** pptr) {
    uint32_t x = *r;
    uint8_t *ptr = *pptr;

    __asm__ ("movzbl (%0), %%eax\n\t"
    "mov %1, %%edx\n\t"
    "shl $0x8,%%edx\n\t"
    "or %%eax,%%edx\n\t"
    "cmp $0x800000,%1\n\t"
    "cmovb %%edx,%1\n\t"
    "adc $0x0,%0\n\t"
    : "=r" (ptr), "=r" (x)
    : "0" (ptr), "1" (x)
    : "eax", "edx"
    );
    if (x < 0x800000) x = (x << 8) | *ptr++;
    *pptr = ptr;
    *r = x;
    }


    This still has a normal C second stage of renormalisation, due to the rare branch making the generated code just as good. Although we also have a variant that renormalises two rans states interleaved for slightly better performance. That one has the second stage incorporated. It also has a few other (C) tweaks as well to the main decoding loop. The perf output (compare to above) on the same file with this new version is:

    @ deskpro107386[compress.../rans_sta...]; perf stat ./rANS_static4k -d /tmp/enwik9_o0 > /dev/null
    !Took 1632432 microseconds, 612.6 MB/s

    Performance counter stats for './rANS_static4k -d /tmp/enwik9_o0':

    1633.859460 task-clock (msec) # 1.000 CPUs utilized
    186 context-switches # 0.114 K/sec
    4 cpu-migrations # 0.002 K/sec
    861 page-faults # 0.527 K/sec
    5801691617 cycles # 3.551 GHz
    <not supported> stalled-cycles-frontend
    <not supported> stalled-cycles-backend
    17864513950 instructions # 3.08 insns per cycle
    1262553856 branches # 772.743 M/sec
    12532018 branch-misses # 0.99% of all branches

    1.634640622 seconds time elapsed



    So for order-0 it's around twice as fast. For order-1 it's more dominated by fetching from large memory tables so comes out at 30-35% faster, but still a respectable win.

    Full benchmarks of all variants follows.

    Code:
    24+8 (all interchangable):
    4c, o0:            229.1 MB/s enc, 290.6 MB/s dec   1000000000 bytes -> 644057148 bytes
    4j, o0:            227.3 MB/s enc, 286.7 MB/s dec   1000000000 bytes -> 644057148 bytes
    4k, o0:            229.4 MB/s enc, 631.5 MB/s dec   1000000000 bytes -> 644057148 bytes
    
    4c, o1:            150.2 MB/s enc, 206.1 MB/s dec   1000000000 bytes -> 484003410 bytes
    4j, o1:            147.6 MB/s enc, 223.1 MB/s dec   1000000000 bytes -> 484003410 bytes
    4k, o1:            148.5 MB/s enc, 260.7 MB/s dec   1000000000 bytes -> 484003410 bytes
    
    16+16:
    4_16i,o0 (no-asm): 273.5 MB/s enc, 376.2 MB/s dec   1000000000 bytes -> 644044227 bytes
    4_16i,o0 (asm) :   275.6 MB/s enc, 464.2 MB/s dec   1000000000 bytes -> 644044227 bytes
    (fails at order-1; I need to figure out why)
    
    32+32:
    64c,o0:            272.4 MB/s enc, 423.1 MB/s dec   1000000000 bytes -> 644066851 bytes
    64c,o1:            179.7 MB/s enc, 249.0 MB/s dec   1000000000 bytes -> 484012935 bytes
    
    ryg_rans main_simd,o0:  
    SIMD rANS: 652616446 bytes
    5321193344 clocks, 5.3 clocks/symbol (572.2MB/s)
    5313525315 clocks, 5.3 clocks/symbol (573.0MB/s)
    (main_simd from Ryg's github - so a slightly different benchmark harness)
    All of the above were compiled with gcc 5.3.0 -O3 -ntune-native (and -march=native for the simd one) and run on a core-i5 at 3.2GHz. It still needs a lot of testing and I'll be pushing changes to my github soon, but I'm hoping others can make use of the optimised normalisation.

  12. Thanks (5):

    Bulat Ziganshin (12th May 2016),Cyan (12th May 2016),Jarek (12th May 2016),Turtle (13th May 2016),willvarfar (13th May 2016)

  13. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 684 Times in 370 Posts
    >No matter what I did I couldn't get various gcc versions to use cmov

    once i had the idea that profile-guided optimization may help since compiler just need stats to select between (may be predict
    able) branches and branch-less code

  14. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Code is on github now to experiment with, but I'm still figuring out the crash on 4_16 (it's slower anyway and nor do I use it, so not an issue really for me).

    Regarding profile guided optimisation, maybe it's doable and it's a good point to try. So, let's try it. Gcc with -fprofile-generate -fprofile-arcs, run test mode on enwik9, then recompile with -fprofile-use -fbranch-probabilities. (This was 5.3.0, although I previously tried 6.1.0 too.)

    Alas, 4c dropped to ~270Mb/s decode rate, so it didn't figure out the cmov benefits. Maybe if I tried the ?: variant that clang successfully uses cmov on then gcc would, but it's all a bit of a stab in the dark. Perhaps we need another alternative to __builtin_expect. Eg __builtin_unpredictable to indicate this conditional is likely to be fairly random and not optimised away by the cpu branch predictor, making cmov etc a better bet than pure test/jmp.

    I should probably put a little test case together to log at https://gcc.gnu.org/bugzilla/.

    PS. I tried the latest FSE checkout on enwik9 using 1MB blocks (same as the rans examples) and on my system it's encoding at 390MB/s and decoding at 595MB/s. Considerably faster encoder and slightly slower decode that the rans-4k code above. The huff mode (128KB block) though really flies! 688MB/s enc and 942MB/s dec. Great job Cyan.

  15. Thanks (2):

    Bulat Ziganshin (13th June 2016),Cyan (12th May 2016)

  16. #11
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    If you're targeting branchless, it's a better idea to use a 32-bit rANS with 16-bit renormalization (like what the SIMD variant does). It gives you slightly less compression (but only slightly provided your probability resolution is 14 bits or lower) but gives more efficient code. That's what BitKnit uses. Caveat is that this doesn't work with the reciprocal computation code in ryg_rans (which can only divide numbers of 31 bits and smaller); 31-bit rANS with 16-bit renorm and up to 13-bit probability resolution would be an option, but ugh. (As an aside, that might be why your 16+16 test fails: you need to be using 15+16 with the reciprocal computation in 32-bit ryg_rans, if you try to divide 32-bit numbers you will get incorrect results.) And likewise, ryg_rANS64 is 31+32 not 32+32.

    There's basically no point in using rANS with a static probability distribution. ryg_rans uses that strictly because it's a simple consistent baseline that's easy to compare to other coders (and also, that code is written to illustrate the algorithm, not for maximum performance). If your model really is static, you absolutely want to stick with tANS, at least if you can live with the table sizes. (rANS can in principle work from a standard cumulative frequency table, which has better density than a typical tANS table format, although fast decoding is more challenging in that case.)

  17. Thanks (5):

    Bulat Ziganshin (13th June 2016),Cyan (13th May 2016),JamesB (12th May 2016),Turtle (13th May 2016),willvarfar (13th May 2016)

  18. #12
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Hmm, I'm using 16-bit renorm in one of my variants, but yes it's failing occasionally so I guess that's why. I didn't think about it too much obviously!

    In theory it should be faster, but so far (despite the reciprocal bug) it doesn't run faster than the 24+8 code despite adding an assembly variant of the renorm code:

    __asm__ ("movzwl (%0),  %%eax\n\t"
    "mov %1, %%edx\n\t"
    "shl $0x10, %%edx\n\t"
    "or %%eax, %%edx\n\t"
    "xor %%eax, %%eax\n\t"
    "cmp $0x10000,%1\n\t"
    "cmovb %%edx, %1\n\t"
    "sbb %%eax, %%eax\n\t"
    "andl $2, %%eax\n\t"
    "addq %%rax, %0\n\t"
    : "=r" (ptr), "=r" (x)
    : "0" (ptr), "1" (x)
    : "eax", "edx"
    );


    This is having to increment the ptr register by 2, hence the sbb, andl and addq. It's more instructions than incrementing by 1 which can be done in a single adcq instruction, so it ends up comparable if not longer than the 24+8 code given the second renormalisation step very rarely happens. Perhaps restructuring the entire thing to use indices instead of an incrementing pointer would solve this as then the memory fetches can have an inbuilt idx*2 component in their assembly, but when I tried that in the past (pure C) I couldn't get anything to go as far as the ptr++ version. Perhaps worth trying again though. What sort of decompression throughput are you getting? I'm using 12 bit frequencies at the moment. 14-bit is 0.4% smaller for enwik9 but the larger tables make it 35% slower. I should use 10 or 11 bit for order-1 probably to try and reduce the size and avoid too much cache thrashing. (10-bit is 25% quicker at O1 and only 0.4% larger.)

    Regarding table size, any fast rANS I can think of is going to need a lookup table the size of the masked bits off rans state, as trying to iterate through the cumulative frequency table is likey to be much slower. How does that compare to tANS table sizes? My rANS is now slightly faster than the tANS in FSE at decoding, so it seems both can be comparable on decode at least. If you're not using static probabilities clearly that isn't possible and with range/arithmetic coders I've had success with using a list of symbols sorted by frequency so that common symbols require very few lookups to work out the cumulative probabilities.

    To be honest, I only started with rANS because your code was clear and easy to understand - much more so to me than the tANS implementations and maths! For that you have my most sincere thanks. Although my more aggressively optimised rANS is probably less understandable. Most likely tANS would be a better fit for what I use rANS for though.

    [Edit: that said, tANS is almost certainly going to be faster on simpler architectures. The above only works if you have a cmov equivalent that doesn't suck for speed; some older intel CPUs had slow cmov.]

  19. #13
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    That's a poor way to do the increment. Do (Intel syntax):

    Code:
    lea temp, [%0+2]
    cmovb %0, temp
    (after you already did the initial compare)

    The BitKnit code does essentially this:


    //const U16 *ptr;
    U32 y = (x << 16) | read16_le(ptr);
    const U16 *ptr_inc = ptr + 1;
    ptr = (x < RANS_L) ? ptr_inc : ptr;
    x = (x < RANS_L) ? y : x;


    which compiles to basically that with both Clang and VC++. Haven't checked GCC in a while.

  20. Thanks (3):

    JamesB (13th May 2016),Turtle (13th May 2016),willvarfar (13th May 2016)

  21. #14
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Thanks for the tips Fabien. Yes using lea works faster and replacing the freq reciprocal with a boring division (for now) fixes the failure.

    I also gave the main order-0 loop the same going-over that Rob did to the 24+8bit version and that massively sped it up in the same manner (some of those ideas taken from the pre-computation of bias used in your SIMD variant). This makes the 16-bit variant both work (yay!) and also slightly ahead of the 24+8 variant for decode speed both order-0 and order 1 by around 5% speed. Intrinsically the only real difference should be the extra renomalisation step which isn't taking much time up overall.

    So now we get (va 631 and 260)

    4_16i+asm,o0-12: 150.3 MB/s enc, 651.2 MB/s dec 1000000000 bytes -> 644044227 bytes
    4_16i+asm,o1-12: 117.5 MB/s enc, 275.2 MB/s dec 1000000000 bytes -> 484004039 bytes
    4_16i+asm,o1-10: 117.3 MB/s enc, 294.7 MB/s dec 1000000000 bytes -> 485931915 bytes

  22. #15
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by JamesB View Post
    Thanks for the tips Fabien. Yes using lea works faster and replacing the freq reciprocal with a boring division (for now) fixes the failure.
    If you want the fast reciprocal back, just move the L down a notch (from 1<<16 to 1<<15). This costs you a bit (literally and figuratively ) of resolution but if you're using 12-bit probabilities anyway, it should be fine.

    Quote Originally Posted by JamesB View Post
    Intrinsically the only real difference should be the extra renomalisation step which isn't taking much time up overall.
    The actual extra renormalization work is not a big deal, but the branch mispredictions tend to hurt.

    For something like the 64-bit state rANS with 32-bit-at-a-time renorm, renormalization is rare (and predictable!) enough that branching is usually a win. For 8- and 16-bit renorm with not-too-skewed probabilities, it's usually random enough that branchless is the way to go.

  23. #16
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by fgiesen View Post
    For something like the 64-bit state rANS with 32-bit-at-a-time renorm, renormalization is rare (and predictable!) enough that branching is usually a win. For 8- and 16-bit renorm with not-too-skewed probabilities, it's usually random enough that branchless is the way to go.
    This is why with 8-bit renormalisation we used branchless for the first renorm and branching for the second (very rare) one. The branch miss rates are very small - under 1% for the entire program.

    (Anyway for now we sort of have to stick with the 8-bit renorm in order to not break our existing files, but it's good to explore alternatives.)

  24. #17
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Quote Originally Posted by fgiesen View Post
    There's basically no point in using rANS with a static probability distribution. ryg_rans uses that strictly because it's a simple consistent baseline that's easy to compare to other coders (and also, that code is written to illustrate the algorithm, not for maximum performance). If your model really is static, you absolutely want to stick with tANS, at least if you can live with the table sizes. (rANS can in principle work from a standard cumulative frequency table, which has better density than a typical tANS table format, although fast decoding is more challenging in that case.)
    As James rANS has very close performance to FSE, I think Fabian is suggesting that FSE has still place for speedup (?) (beside symbol grouping)

    Many processors are currently receiving hardware video decoding - maybe the producers will finally mature to support efficient handling of various types of compressed data e.g. to compress data storage and transmission by default - for example a cheap built in hardware tANS decoder, fed with decoding table, could separately decode a symbol per cycle for both tANS and Huffman ...

    Hardware decoder of zlib can get 2bytes/cycle: http://www.cast-inc.com/ip-cores/dat...l-d/index.html

  25. #18
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by Jarek View Post
    Many processors are currently receiving hardware video decoding - maybe the producers will finally mature to support efficient handling of various types of compressed data e.g. to compress data storage and transmission by default - for example a cheap built in hardware tANS decoder, fed with decoding table, could separately decode a symbol per cycle for both tANS and Huffman ...
    A tANS decoder seems like a very bad way to build a Huffman decoder in hardware, on several axes: area, power consumption, design complexity, and timing constraints.

    Hardware is not software. The cost landscape is very different. And tANS feels like a very poor fit; at least the implementations we're using. (It's quite possible there's a different, equivalent way to do tANS that's more hardware-friendly, but the current way isn't a great fit.) I would be quite surprised if someone did a tANS decoder in HW that was even close to competitive with a canonical Huffman decoder, and I'd want to know how they did it.

    In HW, small tables work fine, but "small" here means "low hundreds of bits", not "tens of kilobytes", and the larger you make them, the slower they get (this is just physics - more and longer wires with more capacitance, and deeper logic trees required for the addressing). Even a 1024-state tANS has uncomfortably large tables by HW standards, and just performing that table access (not doing anything with the result yet!) in a single cycle might be too much if you're targeting GHz-level clock rates; you'd certainly have to keep an eye on it. Straight tANS, I just don't see happening at competitive clock rates at all, because the critical path for the next symbol involves the updated state, which is only available after essentially the complete tANS decode is finished. You'd have to fit all of it inside a single clock cycle, can't pipeline it. Very bad news.

    Interleaved might be slightly better, because you at least get some more time for the updated state (and might be able to pipeline that), but now your critical path is working out how many bits to consume, which is still only known after the big table lookup. So this doesn't solve the key problem, and with interleaving you already lost the ability to decoder a regular Huff stream full-speed. Full multi-state, multi-stream is the way I'd want to go in HW for fast tANS. Say you go for 4 streams, then you can pipeline it across 4 cycles, get 2-3 cycles worth of time for the table access (way more comfortable margins!) and it all feels a lot more plausible to me. But this one really can't read regular non-interleaved Huffman streams now.

    Quote Originally Posted by Jarek View Post
    Hardware decoder of zlib can get 2bytes/cycle: http://www.cast-inc.com/ip-cores/dat...l-d/index.html
    That's for an "average" DEFLATE stream (whatever that means) and thus includes matches, which produce more than one output byte per decoded Huffman symbol. I wouldn't infer anything about their Huffman decoder one way or the other from that.

    A HW canonical Huffman decoder has a substantial advantage: you don't need to fully decode the current symbol before starting to decode the next one. All you really need to know is the length of the current code word. That's a much simpler problem, especially with canonical Huffman where the codes are sorted by length. For DEFLATE, you end up with something like 14 parallel compares against 15-bit constants ("start" values for every code len) feeding into a priority encoder to figure out the length of a Huffman code word. I don't think you need full digital comparators either. You don't need anything like a full decode at that point - that can happen in later pipeline stages.

    That said, DEFLATE in particular is a lot more complicated than just a straight canonical Huffman scheme:
    - Literal/len and offset codes in the same stream. Thus, to decode the next code word, you need to know which one it should be; figuring that out requires more than just length decoding.
    - The "extra" bits for lens and offsets are also in the same stream, and again, you need to know how many there are to decode the next code words correctly.

    So a HW DEFLATE decoder probably *would* do a table lookup based on the raw bits too, to get the number of extra bits and (for lit/len codes) a "is this a match?" flag. That would work out to something like 4 bits per table entry. (Literal needs 3b extra_len + 1b is_match, offs needs 4b extra_len), and that table read does need to finish in the first cycle if you want a one-cycle fast path.

    Either way, in HW I'd definitely go for a split between a fast predecoder that only determines code word boundaries (for the critical path), and leave "understanding" the symbols for later pipe stages. tANS makes that a lot harder, so I have my doubts about the HW-friendliness. (But I'd love to be proven wrong.)

  26. Thanks (4):

    Bulat Ziganshin (15th May 2016),Cyan (15th May 2016),DirtyPunk (15th May 2016),Turtle (15th May 2016)

  27. #19
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Fabian,
    Modern processors have more than a billion of transistors.
    They are receiving video decoding, but still no help for general data compressors - what could mean a few times speedup without occupying a physical core: e.g. a separate unit receiving decoding table and stream to process, which could have separate memory or partially use cache (e.g. be directly connected to a part of L3).
    Depending on memory dependence, the cost would be from ~1000 to a few hundred thousands of transistors - still negligible comparing to billion+, giving extremely fast nearly free decoding(/encoding) and so allowing even to hold compressed data in RAM: improving amount of memory, transfer on motherboard.

    The first step of processor manufacturers supporting general purpose compression should be entropy decoding - and tANS/Huffman is perfect due to universality.
    Sure later steps may separate them ... or even finally add entire compressor, like zstd in a few years if marketed by Fb.

    So tANS decoding step is
    t = decodingTable[X]; to_stream(t.symbol); X = t.newX | readBits(t.nbBits);

    the dependence is using the table then reading nbBits bits, then bitor - so one approach to symbol/cycle is 2-3x interleaving.
    For Huffman decoding step is
    t = decodingTable[X]; to_stream(t.symbol); X = ((X << nbBits) & mask) | readBits(t.nbBits);

    the only difference is that newX is calculated not tabled - putting it into decodingTable we can use tANS decoder to decode Huffman, but pure Huffman decoder still needs 2-3x interleaving to get symbol/cycle.

    The only resource difference is that tANS has larger memory need for the same number of states (in practice they may differ), but this is still negligible comparing to billion+ transistors (and I believe it could directly use a part of L3 - does hardware video decoding use only own separate memory?).

    So how do you imagine the first step of help of processor manufacturers for general purpose compression?

  28. #20
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by Jarek View Post
    Modern processors have more than a billion of transistors.
    They are receiving video decoding, but still no help for general data compressors
    Yes, because users are often spending minutes to hours in one sitting watching video on a single device, the majority of it in a single format (MPEG-4 AVC). And video has the nice property that you copy a relatively small input stream to a HW-visible buffer (a couple MB/s for typical streams) and then tell the decoder "go" and forget about it; the media block writes it to GPU memory without having to coordinate with the CPU further, which is good. Ping-ponging data back and forth between the CPU and a dedicated HW coprocessor, especially if you want low latency, is trickier on the system level in various ways. (I'll get to that.)

    Low-level/driver-level programming for special-purpose HW is something I've spent several years doing professionally, so I'm not completely talking out of my ass here.

    From a chip maker's perspective, it is frequently much easier to justify 10 special-function blocks for 10 different standards than it is to justify a single more general block. The specialized blocks can be simply turned off when they're not in use. A general block that can handle multiple formats but consumes 20% more power than each of the 5 specialized blocks it replaces is worse for all the use cases they care about.

    For example, an entropy decoder block for AVC deals with either CABAC or a VLC code with fixed tables specified in the standard (which allows building decoders using state machines/minimized custom logic, ROM, or a combination of both, all of which is significantly more efficient than a RAM-based decoder). They also need to support less than 100MB/s on the input end (the highest specified AVC profile I'm aware of is High10@5.2 at 720Mbit/s, so 90MB/s). A dedicated entropy decoding/encoding block for lossless compression would need very different specs to be useful (much higher throughput for once), but I disagree with you in that no, I don't think it's easy to say what that should be at all.

    Quote Originally Posted by Jarek View Post
    what could mean a few times speedup without occupying a physical core: e.g. a separate unit receiving decoding table and stream to process, which could have separate memory or partially use cache (e.g. be directly connected to a part of L3).
    Depending on memory dependence, the cost would be from ~1000 to a few hundred thousands of transistors - still negligible comparing to billion+, giving extremely fast nearly free decoding(/encoding) and so allowing even to hold compressed data in RAM: improving amount of memory, transfer on motherboard.
    I don't see anywhere near the potential speed-ups you do. Memory access to a 8-16kB SRAM doesn't magically get an order of magnitude faster just because a dedicated hardware unit originates it. L1 cache accesses don't have 3-4 cycle latency on the majority of current CPUs because HW designers are slacking off; you simply can't access 32kB of SRAM (pretty standard L1D size) arbitrarily fast.

    Hardware is great at speeding up tasks with lots of inherent parallelism; entropy decoding is not one of them. If you have an algorithm with lots of such potential, great! But pointing at a completely serial algorithm and saying "that, just in hardware" isn't very convincing. (Offloading a core is potentially interesting, even if it's slower, as long as it's lower-power; that might be an interesting design point, but doesn't sound like what you're thinking about.)

    Quote Originally Posted by Jarek View Post
    The first step of processor manufacturers supporting general purpose compression should be entropy decoding - and tANS/Huffman is perfect due to universality.
    The part that you're describing as "universal" only is because you're ignoring several "implementation details" that make a massive difference, and because you're refusing to consider other implementation options. A "general" entropy coding block that tries to be useful for several formats should support at least some of the more common ones, right? Well, here's some of the differences a "general" entropy decoder needs to deal with:


    • You presume all codes can be decoded in a single table lookup. Not true. E.g. DEFLATE has a maximum code length of 15 bits. Decoding such long codes with a single lookup requires an impractically large table and would be disastrous in terms of latency. If you disagree that 2^15 entry tables are impractical: Ogg Vorbis allows up to 32 bits. Do we agree that's impractical to decode with a single table lookup, at least?
    • So what do you do with over-long codes? With canonical Huffman (e.g. DEFLATE, JPEG), it's not so bad, and quite HW-friendly. Other formats (e.g. Ogg Vorbis) don't use canonical Huffman codes though. What about them? How do you specify what to do?
    • A canonical Huffman decoder can in fact be implemented with very little memory, and designed to be very fast and low-power; but having to support non-canonical codes would mean that you pay the extra cost in power and perf even when you don't need to. (Unless you're going for broke and putting several decoder modules in hardware...)
    • Most formats (e.g. DEFLATE again) interleave other data into a single bit stream, or use multiple Huffman tables. How does the dedicated HW decoder know when to read such extra bits, or when to switch tables?
    • On the subject of multiple tables: how many do you support? 2, good enough for DEFLATE? 4, good enough for baseline JPEG? 5, good enough for RAR? Remember this multiplies your already substantial SRAM requirements by yet another factor!
    • What bit-packing convention do you use? LSB-first, like DEFLATE? MSB-first, like JPEG and Huff0? Support both? (Some formats pack bits into larger words, like 16-bit or 32-bit, not bytes, and also bring endianness into the mix.)
    • What about peculiarities in the framing format? JFIF JPEG files don't have raw bit streams; 0xff bytes are escaped in the input. Does the HW decoder know about such things? What is important enough to include, and what isn't?
    • What if you use some order-1 modeling (e.g. Brotli)? How does your decompressor know what tables to use when?


    You can find various other compressors that have their own unique combination of these characteristics; I was trying to stick to well-known, wide-spread ones. If you decide that any particular format (e.g. JFIF JPEG, or DEFLATE) needs HW support, you do exactly that. For such a block to ever get built in the first place, it needs to accelerate some workloads that customers actually care about. And it might never get used for anything else even if it's actually a lot more generic than that. Especially if its speed or power usage performing the task it's sold for is middling (because it pays an extra cost for being more generic).

    There's other system-level problems too. For example, if it's a device connected to L3, well, we're talking physical memory now. Apps deal in virtual memory. To make sure that its input buffers are actually there, you would need to go through the OS and allocate non-pageable device memory. Apps can't do that by themselves (it's a privileged operation); you need a driver to do it for them. Likewise, the output buffers need to be non-paged as well. But that means that the HW unit can't just read from or write to app buffers; the driver needs to copy the input buffers from app memory to device memory, and the output from device memory to app memory. That's not a show-stopper but still a non-trivial cost factor for a fast decoder (which seems to be your target audience). You can avoid some of that copying if you're willing to complicate app-level memory management a lot more, like GPU APIs do; but now you're making this really hard to use, and unlike 3D rendering (which usually works from a working set in physical memory), I/O actually does frequently involve virtual memory more directly (for example, memory-mapped files being backed by the page cache), which GPU-like interfaces don't; this would force an extra copy (in a different place) anyway.

    Then there's the issue of "what do you actually do while the HW decoder works on the data"? Theoretically, "make the decompress be async and do something else until the data is ready". In practice, not how most people use compression APIs; most use it synchronously and want the decompressed data to be ready when the function returns. If you have other work to do on the same core in the meantime, great! If not (frequently the case), your CPU just ends up waiting anyway.

    Quote Originally Posted by Jarek View Post
    The only resource difference is that tANS has larger memory need for the same number of states (in practice they may differ), but this is still negligible comparing to billion+ transistors (and I believe it could directly use a part of L3 - does hardware video decoding use only own separate memory?).
    Media blocks, when integrated in a SoC often go through the GPU memory access path (since that's the consumer anyway). Which might go to L3, L4 (when present) or straight to memory, and is usually very high-latency, because it's bandwidth/throughput-optimized.

    The L3 cache in current Intel CPUs has access latencies on the order of 45-50 clock cycles. That's not gonna cut the mustard if you need a memory access for every single symbol before you can even start decoding the next one.

    Quote Originally Posted by Jarek View Post
    So how do you imagine the first step of help of processor manufacturers for general purpose compression?
    Currently, I don't.

    The first, second and third steps are to accept that clock rates haven't gone up significantly in the last 15 years and purely sequential code just isn't going to get faster, certainly not within the next couple of years, and maybe not ever. HW isn't magic. It's merely highly parallel. Modern CPUs offer numerous ways to extract latent parallelism from SW too: instruction-level, memory-level, thread-level, vectorization. Now it's our job to actually use it. Once we have algorithms that have more inherent parallelism (especially if highly structured), they become more suitable for HW implementation also. But they also tend to run much faster (and more power-efficiently!) in SW form, so there's less desire to move them to HW in the first place.

    The real question is, given all the many variations and all the other stuff that people throw into bitstreams (framing, sync markers, extra bits, table switches etc.), how realistic a goal is a "generic" Huffman decoder functional block even? I'm really not sure that the problem is well-scoped enough (and there's enough potential upside) to be worth considering.

    If you have a specific fixed algorithm you want to make faster, sure; consider a custom HW impl, pull out all the stops, go wild, and make that exact thing fast doing whatever you have to do.

    But if your goal is to make new not-yet-existing algorithms fast, err... don't build some "generic" piece of HW that solves an incredibly fuzzy problem and expect it to help. Design the algorithms to be fast using some of the many tools already at your disposal.

  29. Thanks (3):

    Bulat Ziganshin (15th May 2016),schnaader (15th May 2016),Turtle (15th May 2016)

  30. #21
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    882
    Thanks
    478
    Thanked 276 Times in 116 Posts
    I would be quite surprised if someone did a tANS decoder in HW that was even close to competitive with a canonical Huffman decoder, and I'd want to know how they did it.
    For the record, there is at least one hw implementation I know of with published details :

    High throughput hardware architectures for asymmetric numeral systems entropy coding
    Seyyed Mahdi Najmabadi ; Dept. of Parallel Syst., Univ. of Stuttgart, Stuttgart, Germany

    The caveats : this is university study, not industrial, it certainly doesn't translate into commercial deployment in the foreseeable future.
    Numbers published look pretty good, but there is no comparison with a hw canonical huffman encoder.
    Last but not least, it's about encoder, not decoder...


    Also for the record :
    the problem of requiring a table for tANS decoding is indeed an issue for decoders implementation,
    both hardware and software.

    This was underlined in the early days of FSE :
    http://fastcompression.blogspot.fr/2...42231642830337

    A solution I was looking for was to replace the table lookup by an arithmetic operation (something as simple as a multiplication and a mask, no variable loops). It doesn't need to be competitive with table lookup, just "good enough", in order to replace table lookup in situations where table is too large for the decoding system. Unfortunately, I failed to find one.
    That doesn't mean there is none, there may be a good solution to this problem that simply evaded my attention.
    Though there is a direct relation with the way symbols are spread into the table.

    Jarek answer to this problem resulted in the invention of rANS.
    rANS can support reduced table-lookup sizes, at the cost of increased complexity, as has been already well covered by fgiesen on his blog.

    Zstandard uses FSE, and in order to limit pb with this issue use small tables, in combination with a small enough nb of symbols.
    Within v0.6.x, the total amount of memory for all FSE decoding tables combined is a mere 5 KB.
    That sounds small enough for a relatively large number of systems, including embedded ones.
    If required, that amount can be divided by 2, by replacing some table values by more complex indirection rules.
    Going below that will require active encoder assistance (the encoder would need to use smaller-than-max state values, which can also be done).
    Last edited by Cyan; 15th May 2016 at 17:10.

  31. #22
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Fabian,
    Hardware video decoding is indeed nearly a necessity, but general purpose compression is also extremely frequent task, and its popularity is growing with improvement of performance - for data storage and transmission, for example it is currently becoming default in web browsers.
    Hence, there should finally start the time of some hardware help in processors - giving significant benefits at nearly negligible cost (comparing to billion+ number of transistors).
    The question is what kind of help should it be?

    One approach is a complete decoder/encoder of some standard - but the situation is too unstable in this moment.
    Another approach is to give help in a critical part - like entropy coding, which is much more costly than software LZ parser - lz4fast reaches 3GB/s software decoding, having 1symbol/cycle hardware entropy coder it could maintain this speed for much better compression ratio.

    Sure there are huge differences between entropy coders - you definitely won't satisfy all of them with a single HW implementation.
    But let's reverse the point of view: if processors would have built some extremely fast relatively general entropy decoder, other could write or modify their compressors to make the best use of this unit.

    For example o0 tANS/Huffman decoder with 2048 states and 256 size alphabet:
    - can decode both Huffman and tANS (and include some encryption),
    - if you insist on using larger than 11 bit symbols (pr < 2^-11), just split them in two symbol in preprocessing,
    - if you want a context model, just split your stream into streams for separate contexts.
    Looking at 64bit multiplication in a few cycles, I believe it can be done 1 symbol/cycle. Eventually, entropy coding of separate streams can be made parallelized.

  32. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Jarek, i'm pretty sure that GPUs can reach 10+ GB/s entropy (de)coding speed - the only need is to avoid divisions that aren't implemented in hardware. I will make a post in a few days. My confidence is based on the GPUs speed that's ranging from 50 GIOPs for Celeron Ivy Bridge to 4-5 TIOPs for GeForce 1080, while huffman/RC encoding, for example, require about 10-20 operations.

    And big thanks for very interesting discussion!

    PS: RAR1 switches between 5 huffman tables, while RAR 2+ is essentially deflate++ algo (or mean(deflate,lzx)). BZIP2 employs 6 huffman tables, selecting an optimal one for every block of 50 symbols.
    Last edited by Bulat Ziganshin; 16th May 2016 at 15:37.

  33. #24
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    767
    Thanks
    236
    Thanked 244 Times in 149 Posts
    Bulat, you are talking about a heavy GPU ... while important targets are e.g. smartphones - still having billion+ transistors, they could cheaply get a big gain from some hardware help like fast HW entropy decoder - for compression of transmission and data storage ...

  34. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 684 Times in 370 Posts
    500-5000 GIOPS: discrete desktop GPUs from GF750 to GF1080
    25-500 GIOPS: integrated desktop GPUs from Ivy Bridge Celeron (6 EUs) to Iris Pro 580 (72 EUs)

    I don't have exact data for smartphone SoCs, except that Apple A9 last year was a little over 100 GFLOPs. So, i expect performance range about 5-100 GIOPs. High-end SoCs usually dedicate more transistiors to their GPU part than the CPU one. SoC also have a lot of dedicated codecs as well as DSP engine for software implementation of remaining tasks, f.e.: http://anandtech.com/show/9552/qualc...erated-imaging
    Last edited by Bulat Ziganshin; 16th May 2016 at 19:26.

  35. #26
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Quote Originally Posted by JamesB View Post
    Alas, 4c dropped to ~270Mb/s decode rate, so it didn't figure out the cmov benefits. Maybe if I tried the ?: variant that clang successfully uses cmov on then gcc would, but it's all a bit of a stab in the dark. Perhaps we need another alternative to __builtin_expect. Eg __builtin_unpredictable to indicate this conditional is likely to be fairly random and not optimised away by the cpu branch predictor, making cmov etc a better bet than pure test/jmp.
    __builtin_expect is in clang since 3.8 at least. also you may look at http://comments.gmane.org/gmane.comp...ang.scm/149494 - they said about idea i also had - intrinsic that gives the condition probability as well as PGO that collects this probability from the program runs
    Last edited by Bulat Ziganshin; 16th May 2016 at 22:58.

  36. Thanks:

    JamesB (16th May 2016)

  37. #27
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    __builtin_expect is in clang since 3.8 at least. also you may look at http://comments.gmane.org/gmane.comp...ang.scm/149494 - they said about idea i also had - intrinsic that gives the condition probability as well as PGO that collects this probability from the program runs
    Ah yes - they even have the same name of __builtin_unpredictable.

    In theory PGO may have worked, but clang was getting the right answer anyway in that cmov was faster. Even with profiling demonstrating a highly variable branch I couldn't get gcc to switch to using cmov in that case. A pity as embedding assembly doesn't feel great.

  38. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,552
    Thanks
    767
    Thanked 684 Times in 370 Posts
    highly variable not necessarily mean unpredictable - it's the part of cpu that's very well optimized. once i benchmarked bit coding routines and made a cycle like that:

    for(;;) for (bits=1..15) out(value,bits)

    out(x,bits) {... total+=bits; if (total>=16) {WriteWord(...)...}}

    it gave me results i not expected and later i attributed that to the smartness of CPU what was able to find structure of WriteWord calls

  39. #29
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it gave me results i not expected and later i attributed that to the smartness of CPU what was able to find structure of WriteWord calls
    State-of-the-art branch predictors do fairly sophisticated modeling themselves. In data compression terms, they basically do context mixing between several models themselves, including pure context models based on varying-size global history windows (the longest of which are quite long indeed, often using over 200 previous branches as the context!), special-purpose predictors for loops, secondary modeling... the works.

    Reasonably recent example (2014): http://www.jilp.org/cbp2014/paper/AndreSeznec.pdf

    TAGE/ITTAGE (the basic predictors here) are old enough to have actually made it into shipping CPU cores by now (ITTAGE-SC-L itself is too new to have made it into a shipping CPU uArch, these things take a while); I doubt anyone uses the algorithms as described in the papers directly, but the basic ideas are bound to crop up. In any case, modeling based on a large global history is nothing new (the Pentium 4 already did this).

    In a reasonably small piece of code (i.e. no instruction cache/branch prediction thrashing), the entropy of the branch direction stream is a fairly good first-order approximation for the actual branch prediction cost. Branches that are either highly biased, or highly correlated with other (previous) branches, are very cheap.

  40. Thanks (2):

    Bulat Ziganshin (17th May 2016),Cyan (17th May 2016)

  41. #30
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    492
    Thanks
    176
    Thanked 174 Times in 118 Posts
    Impressive how advanced things have become.

    This is something I love Google perf-tools for. They're a bit quirky sometimes, but can be wonderful at using the hardware CPU counters to illustrate precisely which branches the CPU is failing to predict and which they do not. For this case, it was very obvious we could get 20% or more miss rate in filling the ANS state during decoding (underflow condition), but you're right it is highly correlated with the entropy of the data stream.

    The other trick which can make a HUGE difference sometimes (especially on some of the older chips) is to vary the size of the arrays slightly. Eg a 256x256 sized lookup table may well be better as a 256x262 lookup table. The reason being if you access the same elements from different rows regularly then the hash function to turn address into cache-line index may collide; if more than the N-way associativity then things grind to a halt. I've seen cases as extreme as 2-3x speed increase by adding a small tweak to array dimensions in some cases.

    I'm sure this is "teaching grandmother to suck eggs" with you two, but others may learn from this. I know I didn't understand this until a year or so ago.

Page 1 of 2 12 LastLast

Similar Threads

  1. Questions on range encoder implementation
    By ilya in forum Data Compression
    Replies: 4
    Last Post: 19th March 2012, 16:01
  2. Clever data structure for range sum queries
    By Piotr Tarsa in forum Data Compression
    Replies: 17
    Last Post: 7th March 2012, 21:32
  3. Clever data structure for range sum queries
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 30th December 2011, 03:13
  4. Hashing a large range of integers to a smaller range
    By mtwaha in forum Data Compression
    Replies: 3
    Last Post: 11th April 2011, 23:27
  5. How fast should be a range coder ?
    By Cyan in forum Data Compression
    Replies: 33
    Last Post: 16th November 2009, 16:02

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •