Page 8 of 9 FirstFirst ... 6789 LastLast
Results 211 to 240 of 251

Thread: Asymetric Numeral System

  1. #211
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Jarek,

    thanks for the hint! I had skipped over this part of the paper too quickly i guess. I now see how you have to kind of leave some "dead bits pool" in the state x,
    between the lower bits that are serialized and the upper ones that maintain x close to the Shannon limit. It seems to help ergodicity and mixing properties
    of the system as a whole.

    For experimentation purpose, I've modify the code to go 64bit for the state x and support any values of L between 16 and 48 (while still doing 16b I/O).
    Hence, i get a reservoir of 'dead bits' with size 0 to 32bit.
    The C-code is here: http://pastebin.com/371DJHEa

    Experimenting with increasing L values, i've noticed that things improve quite quickly as L increases. For just L=20bit, you get most of the benefit
    of the extra precision already. Above 20bit, things stagnate and one doesn't get noticeable improvement by going up to L=48bit afterward.
    And one still can't get on par with the precision of arithmetic coding with 16b.
    See the graph of deltaH = (H-H0)/H0 for increasing L values:
    Click image for larger version. 

Name:	delta_entropy_variable_bits.png 
Views:	231 
Size:	20.4 KB 
ID:	3277

    Since my goal is to stay in 32bit overall precision (because i have ARM processor in mind) and use 16b I/O,
    maybe the right combination is to use 12bit precision for probabilities (so, M=12), and keep 4 bits of 'dead bits pool'
    in-between (so, L=16, basically).

    Still, i'm puzzled about why arithmetic coding keeps on being more precise overall, but the same 16b registers size.

  2. Thanks:

    Jarek (23rd November 2014)

  3. #212
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Thanks skal for the graphs - they nicely show the deltaH ~ 1/L^2 behavior (each succeeding line is ~ 4 times lower).
    Indeed the second way for improvement is reducing the precision of probability (M) - e.g. to 12 bits as you have written.

    Why don't you try the uABS variant? It uses nearly optimal symbol distribution (|epsilon| < 1/x instead of M/x).
    The range variants were intended for large alphabet

    uABS encode step:
    if s = 0 then x = ceil((x+1)/(1-p)) - 1
    if s = 1 then x = floor(x/p)

    Decode step:
    s = ceil((x+1)*p) - ceil(x*p) (0 if fract(x*p) < 1-p, else 1)
    if s = 0 then x = x - ceil(x*p)
    if s = 1 then x = ceil(x*p).

    With uABS you can accurately work on 32bit x, with 16bit precision of probability and 16bits renormalization: you need at most one renormalization step per symbol (single 'if' instead of loop).

    Matt's fpaqc uABS: http://mattmahoney.net/dc/fpaqc.zip

    ps. comparing with arithmetic coding, remember that the buffer containing fractional number of bits is a single number for ANS (x) instead of 2 (Low, Range) ...

    ps2. I see Android has a serious problem with the cost of encryption ( http://www.techspot.com/news/58932-a...rformance.html ) - maybe it would be worth thinking about extremely cheap ANS-based one ...
    Combining with extremely fast compression like zhuff, it could actually improve the storage performance ...
    Last edited by Jarek; 23rd November 2014 at 17:52.

  4. #213
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Jarek,

    Quote Originally Posted by Jarek View Post
    Why don't you try the uABS variant? It uses nearly optimal symbol
    distribution (|epsilon| < 1/x instead of M/x).
    The range variants were intended for large alphabet
    Indeed, uABS works with much better precision (comparable to arithmetic, actually).
    Here's the code for 16b proba precision, with 16b I/O: http://pastebin.com/CTE4zqEn

    The graph for the relative precision looks like:
    Click image for larger version. 

Name:	entropy_uABS.png 
Views:	223 
Size:	11.0 KB 
ID:	3280

    uABS is actually quite a different concept than all the range ones (with table or not, etc.).
    I wonder if that could be of inspiration to help improving the rANS's precision overall, so
    that it's not M/x but ~1/x instead.

    So far i see, the decoding for uABS is ~2x slower than the previous rABS ( http://pastebin.com/371DJHEa),
    and similar in speed to arithmetic coding (http://pastebin.com/FnNK4HDc).

    So, there's definitely a trade-off between speed and precision here.

    /skal

  5. Thanks (2):

    Cyan (24th November 2014),Jarek (24th November 2014)

  6. #214
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    uABS was the original ABS (nearly 9 years ago) - tANS came nearly two years later and rANS a year ago. I don't see a way for anything practical between them (?) - rABS for M<<L becomes practically uniform.

    Let's take the proper decoding step from your implementations (p0=p*2^16 \in {1, ..., 2^16 - 1}, modified to the same notation, applied remarks below).
    rABS:
    uint16_t xfrac = x & 0xffff;
    uint32_t x_new = p0 * (x >> 16);
    if (xfrac < p0) { x = x_new + xfrac; out[i] = 0;
    } else { x -= x_new + p0; out[i] = 1; }

    uABS (very similar to fpaqc):
     uint64_t xp = (uint64_t)x * p0 - 1;
    out[i] = ((xp & 0xffff) + p0) >> 16; // <= frac(xp) < p0
    xp = (xp >> 16) + 1;
    x = out[i] ? xp : x - xp;

    plus renormalization (much simpler than in arithmetic):
     if (x < (1 << 16)) { x = (x << 16) | *ptr++;  } 


    Some remarks:
    - I see you use loop for uABS - single "if" should be sufficient (we produce ceil(lg(1/p)) or floor(lg(1/p)) bits in p probability step - at most 16 bits for 16 bit precision for p),
    - Maybe rABS would be a bit faster if putting "out[i] = (xfrac >= p0)" into 'if' ? (applied above)
    - 32bits would be sufficient for the multiplication in the second line of rABS (in opposite to uABS - it might be the costly part) (applied above).

    This rABS being twice faster than uABS seems suspicious?
    Anyway, rABS with 12 bit M and 16 bit renormalization seems completely sufficient?

    Maybe somebody see some optimizations?
    nburns has mentioned that he has found some for fpaqc?

    ps. above rABS uses M=2^16 (p=p0/M, p0 \in {1, ..., M-1}) so it is quite inaccurate in above setting.
    Here is modified for general M=2^N (e.g. N=12 for up to 256 times lower deltaH):
    uint16_t xfrac = x & mask;      // mask = M-1
    uint32_t x_new = p0 * (x >> N);
    if (xfrac < p0) { x = x_new + xfrac; out[i] = 0;
    } else { x -= x_new + p0; out[i] = 1; }

    The only income for extreme 16bit probability precision I see is for extremely low probable symbols - they could be handled with a specialized separate entropy coder.

    ps2. For completeness, your decoding step for arithmetic coding:
     uint32_t split = low + ((uint64_t)(hi - low) * p0 >> 16); 
    out[i] = (x > split);
    if (out[i]) {low = split + 1;} else {hi = split;}
    if ((low ^ hi) < 0x10000) { // 16bit renormalization
    x = (x << 16) | *ptr++;
    low <<= 16;
    hi = (hi << 16) | 0xffff;
    }


    The rABS speedup seems to be thanks of staying in 32bit arithmetics (?)

    update: Pascal has managed to reduce the number of instructions for uABS:
    uint64_t xp = (uint64_t)x * p0;      
    out[i] = ((xp & 0xffff) + p0) >> L_pb;
    xp >>= L_pb;
    x = out[i] ? xp : x - xp;
    Last edited by Jarek; 29th November 2014 at 01:11. Reason: update

  7. #215
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Jarek View Post
    Maybe somebody see some optimizations?
    nburns has mentioned that he has found some for fpaqc?
    The optimization I posted for fpaqc deals with the bit predictor, so it's not in the ANS logic per se. I spent a fair bit of time playing with the ANS code in fpaq[bc], and there were some tweaks that appeared beneficial, but didn't appear important enough to post. At least, it would have been premature to share them at the stage they were at. Maybe I'll pick it back up someday.

  8. #216
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    this code looks pretty fast by itself. thу on;y potential problem is stupid C cimpilers that can't use 32*32->64 multiplication and CMOVcc instructions. i suggest to try assembler version or profile-guided optimization

  9. #217
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Hi,

    indeed, i could make the code a little bit faster (faster than arithmetic), but it's getting a little bit tricky
    because it heavily depends on exactly what word-size / precision one is targeting.
    Here's the code: http://pastebin.com/GvrEsrr8
    I inverted the ceil/floor rounding direction and it's indeed overall less instructions in the end.

    The dissassembly shows that the code is pretty tight already. Here's the main
    decoding loop (x86-64, gcc -O3):

    Code:
    // while (x < base_x) x = (x << L_IO) | *ptr++;   // decode forward
     130:   c1 e0 10                shl    $0x10,%eax
     133:   48 83 c7 02             add    $0x2,%rdi 
     137:   89 c2                   mov    %eax,%edx 
     139:   0f b7 47 fe             movzwl -0x2(%rdi),%eax
     13d:   09 d0                   or     %edx,%eax
     13f:   3d ff 00 00 00          cmp    $0xff,%eax
     144:   76 ea                   jbe    130 <uABSDecode+0x40>
    //    uint64_t xp = (uint64_t)x * q0;  
     146:   89 c1                   mov    %eax,%ecx
     148:   49 0f af c9             imul   %r9,%rcx 
     14c:   0f b6 d1                movzbl %cl,%edx 
    //    xp >>= L_pb;
     14f:   48 c1 e9 08             shr    $0x8,%rcx
    //    out[i] = ((xp & (K - 1)) + q0) >> L_pb;   
     153:   4c 01 ca                add    %r9,%rdx 
     156:   29 c8                   sub    %ecx,%eax
     158:   48 c1 ea 08             shr    $0x8,%rdx
     15c:   84 d2                   test   %dl,%dl  
     15e:   42 88 14 16             mov    %dl,(%rsi,%r10,1)
     162:   49 8d 50 01             lea    0x1(%r8),%rdx
    //    x = out[i] ? xp : x - xp;
     166:   0f 45 c1                cmovne %ecx,%eax
     169:   4c 39 da                cmp    %r11,%rdx
     16c:   74 08                   je     176 <uABSDecode+0x86>
     16e:   4d 89 c2                mov    %r8,%r10
     171:   49 89 d0                mov    %rdx,%r8
     174:   eb c9                   jmp    13f <uABSDecode+0x4f>
     176:   f3 c3                   repz retq
    The decoding speed curve Arith vs uABS shows that in the middle proba range,
    uABS is ~2x faster now:

    Click image for larger version. 

Name:	uABS_speed.png 
Views:	227 
Size:	12.1 KB 
ID:	3281

    About the need for the 'while (x < base_x) { ... }' loop instead of just 'if (x < base_x)': yes it's needed for very low proba
    values (or proba very close to 1). I guess i could do a special case for these...

    thanks!
    /skal
    Last edited by skal; 25th November 2014 at 19:49.

  10. Thanks:

    Cyan (27th November 2014)

  11. #218
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Quote Originally Posted by skal View Post
    About the need for the 'while (x < base_x) { ... }' loop instead of just 'if (x < base_x)': yes it's needed for very low proba
    values (or proba very close to 1). I guess i could do a special case for these...
    /skal
    The need for reading 16bits twice for 32bit x means that we get to x = 0.
    How can you get there starting from x >= 2^16 and p >= 2^-16 ? e.g. x -> floor(x*p)
    Single 'if' should be sufficient here ...

    And how does rABS speed look here? Do you better precision of p than e.g. 12 bits?

    Ps. Much more uniform speed of uABS seems a good featutre for synchronizing with other parts of compressor.
    Last edited by Jarek; 25th November 2014 at 22:20.

  12. #219
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Jarek,

    Quote Originally Posted by Jarek View Post
    The need for reading 16bits twice for 32bit x means that we get to x = 0.
    How can you get there starting from x >= 2^16 and p >= 2^-16 ? e.g. x -> floor(x*p)
    Single 'if' should be sufficient here ...
    indeed, if can remove the while in this case. The code for 16b precision is here: http://pastebin.com/8hQijAMu
    The inner loop is now pretty tight:

    Code:
    //  if (x < K)  x = (x << L_IO) | *ptr++;
     130:   4d 89 c1                mov    %r8,%r9
     133:   49 89 c8                mov    %rcx,%r8
     136:   3d ff ff 00 00          cmp    $0xffff,%eax
     13b:   77 10                   ja     14d <uABSDecode+0x5d>
     13d:   c1 e0 10                shl    $0x10,%eax
     140:   49 83 c2 02             add    $0x2,%r10
     144:   89 c1                   mov    %eax,%ecx
     146:   41 0f b7 42 fe          movzwl -0x2(%r10),%eax
     14b:   09 c8                   or     %ecx,%eax
    // uint64_t xp = (uint64_t)x * q0;
     14d:   89 c1                   mov    %eax,%ecx
     14f:   48 0f af cb             imul   %rbx,%rcx
    //    out[i] = ((xp & (K - 1)) >= p0);
     153:   0f b7 f9                movzwl %cx,%edi
     156:   4c 39 df                cmp    %r11,%rdi
     159:   40 0f 93 c7             setae  %dil
    //    xp >>= L_pb;
     15d:   48 c1 e9 10             shr    $0x10,%rcx
    //  out[i] = ((xp & (K - 1)) >= p0);
     161:   29 c8                   sub    %ecx,%eax
     163:   40 84 ff                test   %dil,%dil
     166:   42 88 3c 0e             mov    %dil,(%rsi,%r9,1)
    //    x = out[i] ? xp : x - xp;
     16a:   0f 45 c1                cmovne %ecx,%eax
     16d:   49 8d 48 01             lea    0x1(%r8),%rcx
     171:   48 39 d1                cmp    %rdx,%rcx
     174:   75 ba                   jne    130 <uABSDecode+0x40>
     176:   5b                      pop    %rbx
     177:   c3                      retq
    And how does rABS speed look here? Do you better precision of p than e.g. 12 bits?

    Speed-wise, this uABS is faster than uABS in the mid-probability range (which is a good property!) and
    gets slower for skewed distribution (less of a concern). See the graph for decoding speed of uABS vs rABS vs arithmetic:

    Click image for larger version. 

Name:	uABS_rABS_arith_speed.png 
Views:	220 
Size:	16.6 KB 
ID:	3282

    /skal

  13. Thanks:

    Jarek (26th November 2014)

  14. #220
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    can you replace the 'if' with unconditional read from memory? i.e.

    ptr += bytes_to_skip
    x = *ptr

  15. Thanks:

    Jarek (26th November 2014)

  16. #221
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Indeed, maybe making it branch-less could be beneficial, for example
     if (x < (1 << 16)) { x = (x << 16) | *ptr++;  } 

    renormalization can be changed to
     ren = (x < (1 << 16)) << 4 ;    // 16 if renormalization needed, 0 else
    x = (x << ren) | (*ptr >> ren);
    ptr += ren >> 3;


    Or
    x = out[i] ? xp : x - xp;

    can be made by filling size 2 table as {xp, x-xp}, and then taking its out[i] value.

  17. #222
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Jarek, the second one already employs CMOV so no need to change it, at least for current compiler. and emulating 'if' via table will be slow anyway: unpredicted 'if' takes 14+ ticks (so in the worst case of 50%/50% you lose 7+ ticks on average), while store-to-load forwarding with address calculation takes 13 ticks every time

    renormalization can be changed to
    according to my knowledge, this code should be much faster, at least when x<64K condition has 50% probability
    Last edited by Bulat Ziganshin; 26th November 2014 at 02:46.

  18. #223
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    103
    Thanks
    13
    Thanked 54 Times in 35 Posts
    Hi,
    Quote Originally Posted by Bulat Ziganshin View Post
    Jarek, the second one already employs CMOV so no need to change it, at least for current compiler. and emulating 'if' via table will be slow anyway: unpredicted 'if' takes 14+ ticks (so in the worst case of 50%/50% you lose 7+ ticks on average), while store-to-load forwarding with address calculation takes 13 ticks every time


    according to my knowledge, this code should be much faster, at least when x<64K condition has 50% probability
    Yes, i tried renormalisation with masks or table, and it's ~50% slower.

    Not surprisingly, i also tried 2x interleaving. Code is here: http://pastebin.com/utufHJ6J
    I get a 60-70% speed-up with it compared to regular uABS, which is nice.

    /skal
    Last edited by skal; 26th November 2014 at 12:58.

  19. #224
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Skal, have you maybe also checked interleaving speedup of arithmetic coding?

    Also, I think tABS would be even faster at cost of accuracy - kind of CABAC M-coder approximation.
    It can be also cheaper for hardware implementations as 32bit multiplier seems quite expensive (still cheaper than 64bit...).
    Additionally, for larger L, one could slightly perturb symbol spreads using PRNG initialized with cryptokey for simultaneous encryption.

    Decoding step with renormalization is just:
    t = decodingTable[probability][x];
    out = t.symbol ^ AH; // AH - probability above 1/2
    x = t.newX | readBits(t.nbBits)


    For any case, here are some L=16 state symbol spreads for correspondingly 1/32, 2/32, ..., 16/32 quantized probabilities:
    0000000000000010
    0000100000000000
    0000000001000010
    0010000010000000
    0000010000001010
    0001000010001000
    0001000010001010
    0010010010000010
    0010010010001010
    1000101000101000
    0010101000100011
    1001001010001010
    0010101010001011
    1010100011001100
    1010101010101000
    1111001010010000

    from
    Click image for larger version. 

Name:	quasi.png 
Views:	211 
Size:	125.4 KB 
ID:	3285
    as long as prob > 1/32, it would stay near the deltaH ~ 0.001 bits/symbol.
    Decoding table needs 1 byte/position: 5 bits for newX, 2 bits for nbBits, 1 bit for symbol - so the total memory cost of all tables is just 16x16 = 256 bytes (also extremely cheap for hardware implementations).

    The spreads were chosen by testing all possibilities and so their probability distribution of states can be divergent from 1/x (it is why I've called them qABS - quasi in preprint), what may worsen behavior while switching between probabilities.
    Safer (using Pr(x) ~ 1/x) would be choosing symbol spreads using e.g. tuned spread for larger L.

    For example we can take L=64 states, and 1/128 probability quantization - deltaH should be ~ 16 times lower (~ 0.0001 bits/symbol), problem with low probable symbol would start with p~0.01, and the memory need for all decoding tables would be 64x64x2=8192 bytes (fits in L1).

    ps. for uABS/rABS you can use the formulas for encoding some information in the initial state to compensate the cost of storing the final state - just start with x=1 and, until reaching x>=L, perform a few encoding steps without renormalization.
    Last edited by Jarek; 27th November 2014 at 08:03.

  20. #225
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Jarek View Post
    update: Pascal has managed to reduce the number of instructions for uABS:
    uint64_t xp = (uint64_t)x * p0;      
    out[i] = ((xp & 0xffff) + p0) >> L_pb;
    xp >>= L_pb;
    x = out[i] ? xp : x - xp;
    Looks like he got rid of adding and subtracting 1. I think I was puzzled as to exactly what those were for, and I stared at that code forever. I really wanted to get rid of them, but I think I never quite made up my mind as to whether they were necessary. I guess my instinct was that they looked out of place and could possibly go.

  21. #226
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I have a guess at the optimal symbol spread function for, e.g., tABS: Huffman + reverse binary.

    I would not use Huffman in the traditional way. I would use Huffman to generate the optimal ordering of the ranges, and that's all. The order would come from a depth-first traversal of the Huffman tree, and the ranges themselves would use the actual symbol counts -- not power-of-2 Huffman codes.

    Put the symbol ranges in that order, then apply the reverse binary transform.

  22. #227
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    nburns, I have to admit that I don't see why it should work well? Anyway, if you think so, just test it with the toolkit and report the results.
    https://github.com/JarekDuda/Asymmet...SystemsToolkit

  23. #228
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Jarek View Post
    nburns, I have to admit that I don't see why it should work well? Anyway, if you think so, just test it with the toolkit and report the results.
    https://github.com/JarekDuda/Asymmet...SystemsToolkit
    I will test it when I can, but I haven't had access to a computer to use for programming. In the meantime, though, I've been thinking it over and developing lines of reasoning. The ultimate best outcome, of course, would be a proof.

    I think what matters is how tightly the symbols are packed into the low bits of the state. The upper bits contain the information that's carried over between symbols. If the ranges are not optimally ordered, symbols will tend to alias each other in the low bits, and the symbol information will creep up into the upper bits. Then the process of shifting off the low bits will be inefficient, and there will be less space at the top to carry information between symbols.

    I think Huffman is likely to be the solution to packing symbols in the low bits. I am proposing -- obviously -- that the problem is related to the problem of finding the optimal prefix code. (Actually, Huffman packs the symbols into the high bits -- but you apply reverse binary to flip them over.) The logic appeals to me. Huffman provides an optimal construction, and ANS provides a process for carrying information forward.

    It's not quite a proof, and empirical evidence is pending. But, hopefully, even if there's something incomplete, there is insight here and it helps drive things forward.

  24. #229
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    Can somebody please explain really easy how this compression works? I tried to read all sources available but I still don't get it. I have clue how encoding and decoding works but I don't know where the "real" compression it's happening. Thanks

  25. #230
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Hi, the basic concept is in the right hand side part of this diagram:

    Click image for larger version. 

Name:	ans.png 
Views:	205 
Size:	37.2 KB 
ID:	4309

    You want to encode "01111" symbol sequence into a natural number x.
    The upper part uses standard binary system:
    x -> 2x + s
    encoding new symbol in the youngest position and shifting all old bits one position up.
    This encoding rule can be seen that
    "x -> x-th appearance of correspondingly even (s=0) or odd (s=1) numbers"

    To asymmetrize it we can use the same rule, but redefining "even" and "odd" numbers - accordingly to a chosen density - here for Pr(0)=1/4, Pr(1)=3/4.
    Finally, standard binary system lead to x=47 encoding the sequence, while the asymmetric one to x=18, which is smaller - cheaper to store.

    This x can be seen as a buffer containing lg(x) bits of information (no floor or ceiling!).
    Adding information of symbol of probability p (containing lg(1/p) bits of information), means that the amount of information in new x should be
    lg(x) + lg(1/p) = lg(x/p)
    Hence, new x containing both old x and symbol s, should be ~ x/Pr(x), what is happening in ANS.
    You also need to add a renormalization to prevent x->infinity, what is done by just sending youngest bits of x to bitstream when needed.

    Is it more clear now?

  26. #231
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    This prompted me to finally get around to putting my recent rANS tweaks onto github, bringing myself into the 21st century at last!

    https://github.com/jkbonfield/rans_static

    My rANS code is derived from Ryg's rANS, but more heavily unrolled and with a few tweaks here and there to improve out of order execution and memory contention. I also created an order-1 variant. One day I'll explore FSE to see what an order-1 driver would look like; maybe! Speed wise it doesn't compare well vs FSE, but the order-1 encoding is what I needed for my day job.

    Anyway, my recent round of tweaks have sped it up slightly on my core-i5, sometimes by as much as 10% although usually much less. I also added a variant that shifts out data in 16bit chunks, leaving the smallest state as 16 bit, rather than 8 & 24. This is how Ryg's SIMD version worked too. It's not compatible with my earlier "4c" output so I haven't tested it much as I can't really use it in my work, but I was curious to see how well it could be done using 32-bit instructions.

    My two test files are high and low complexity DNA quality strings (Q40 and Q8 respectively):


    Q40 test file, order 0:
    Code:
    arith_static            251.2 MB/s enc, 143.0 MB/s dec  94602182 bytes -> 53711390 bytes
    rANS_static4c           289.2 MB/s enc, 340.0 MB/s dec  94602182 bytes -> 53690171 bytes
    rANS_static4j           293.7 MB/s enc, 376.1 MB/s dec  94602182 bytes -> 53690159 bytes
    rANS_static4_16i        299.9 MB/s enc, 377.0 MB/s dec  94602182 bytes -> 53690048 bytes
    rANS_static64c          279.0 MB/s enc, 439.4 MB/s dec  94602182 bytes -> 53691108 bytes
    Q8 test file, order 0:
    Code:
    arith_static            239.1 MB/s enc, 145.4 MB/s dec  73124567 bytes -> 16854053 bytes
    rANS_static4c           291.7 MB/s enc, 349.5 MB/s dec  73124567 bytes -> 16847633 bytes
    rANS_static4j           290.5 MB/s enc, 354.2 MB/s dec  73124567 bytes -> 16847597 bytes
    rANS_static4_16i        371.0 MB/s enc, 453.6 MB/s dec  73124567 bytes -> 16847528 bytes
    rANS_static64c          351.0 MB/s enc, 549.4 MB/s dec  73124567 bytes -> 16848348 bytes
    Q40 test file, order 1:
    Code:
    arith_static            128.4 MB/s enc,  94.7 MB/s dec  94602182 bytes -> 43420823 bytes
    rANS_static4c           168.4 MB/s enc, 212.1 MB/s dec  94602182 bytes -> 43167683 bytes
    rANS_static4j           170.0 MB/s enc, 240.2 MB/s dec  94602182 bytes -> 43167683 bytes
    rANS_static4_16i        188.4 MB/s enc, 254.9 MB/s dec  94602182 bytes -> 43229063 bytes
    rANS_static64c          203.9 MB/s enc, 287.1 MB/s dec  94602182 bytes -> 43168614 bytes
    Q8 test file, order 1:
    Code:
    arith_static            189.2 MB/s enc, 130.4 MB/s dec  73124567 bytes -> 15860154 bytes
    rANS_static4c           208.6 MB/s enc, 269.2 MB/s dec  73124567 bytes -> 15849814 bytes
    rANS_static4j           210.5 MB/s enc, 305.1 MB/s dec  73124567 bytes -> 15849814 bytes
    rANS_static4_16i        240.4 MB/s enc, 345.4 MB/s dec  73124567 bytes -> 15869029 bytes
    rANS_static64c          239.2 MB/s enc, 397.6 MB/s dec  73124567 bytes -> 15850522 bytes

    Finally, I have to say, a seriously good job from Cyan on FSE Huff. It's incredibly quick on this data and while it can't compete with order-1 entropy encoding for some data it's sufficient for many cases. Some of it can be mitigated by packing multiple symbols into one byte too.

    Code:
    @ [tmp/rans_sta...]; ~/work/compression/fse_huff2 < /tmp/Q40
     402.7 MB/s enc,  596.0 MB/s dec     94602182 bytes -> 54093961 bytes
     548.3 MB/s enc, 1001.8 MB/s dec     94602182 bytes -> 54093961 bytes
     552.6 MB/s enc, 1029.4 MB/s dec     94602182 bytes -> 54093961 bytes
     552.3 MB/s enc, 1029.7 MB/s dec     94602182 bytes -> 54093961 bytes
     551.2 MB/s enc, 1028.9 MB/s dec     94602182 bytes -> 54093961 bytes
     552.6 MB/s enc, 1030.3 MB/s dec     94602182 bytes -> 54093961 bytes
     553.5 MB/s enc, 1029.6 MB/s dec     94602182 bytes -> 54093961 bytes
     552.2 MB/s enc, 1028.2 MB/s dec     94602182 bytes -> 54093961 bytes
    
    @ [tmp/rans_sta...]; ~/work/compression/fse_huff2 < /tmp/Q8 
     645.4 MB/s enc,  682.6 MB/s dec     73124567 bytes -> 16577764 bytes
     665.0 MB/s enc, 1480.0 MB/s dec     73124567 bytes -> 16577764 bytes
     663.7 MB/s enc, 1528.6 MB/s dec     73124567 bytes -> 16577764 bytes
     667.9 MB/s enc, 1527.9 MB/s dec     73124567 bytes -> 16577764 bytes
     669.8 MB/s enc, 1532.0 MB/s dec     73124567 bytes -> 16577764 bytes
     669.2 MB/s enc, 1533.5 MB/s dec     73124567 bytes -> 16577764 bytes
     667.2 MB/s enc, 1525.8 MB/s dec     73124567 bytes -> 16577764 bytes
     669.7 MB/s enc, 1533.0 MB/s dec     73124567 bytes -> 16577764 bytes
     668.0 MB/s enc, 1532.4 MB/s dec     73124567 bytes -> 16577764 bytes
     668.5 MB/s enc, 1526.1 MB/s dec     73124567 bytes -> 16577764 bytes
    The huff2 there is 2 bytes per symbol, so it analyses the input data and if it has fewer than 16 symbols then it packs them into nibbles. This affects speed a bit (encoding too even when it can't pack them), but reduces the overheads of huffman on smaller alphabets. This worked for Q8 but not for Q40. The basic 1 and 2 byte entropy is:

    Code:
    $ ~/work/compression/entropy16 < /tmp/Q40
    len = 94602182 bytes, entropy16 = 48376591.101198, 4.090949 bits per byte
    len = 94602182 bytes, entropy8  = 53753874.085890, 4.545677 bits per byte
    $ ~/work/compression/entropy16 < /tmp/Q8
    len = 73124567 bytes, entropy16 = 16344936.441046, 1.788175 bits per byte
    len = 73124567 bytes, entropy8  = 16852471.743796, 1.843700 bits per byte
    I'm tempted to start using FSE huff in my work, once it's declared ready. (Is it stable yet?)


    Edit: For comparison the FSE on these files benchmarks as so:

    Code:
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Aug 17 2015)
    /tmp/Q40         :  94602182 ->  53758663 (56.83%),  341.3 MB/s ,  431.0 MB/s  
    
    FSE : Finite State Entropy, 64-bits demo by Yann Collet (Aug 17 2015)
    /tmp/Q8          :  73124567 ->  16852131 (23.05%),  365.6 MB/s ,  402.9 MB/s

  27. Thanks (4):

    Cyan (16th April 2016),Jarek (15th April 2016),schnaader (17th April 2016),Turtle (16th April 2016)

  28. #232
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Hi James, thanks for the sources.
    The large speed difference of Yann's Huff0 here is mainly due to grouping 2 bytes.
    Comparing it with FSE on large alphabet gives much closer speeds: 276 vs 429MB/s encoding, 447 vs 519MB/s decoding:
    http://encode.su/threads/1920-In-mem...ll=1#post45168
    Hamid claims even smaller differences: https://sites.google.com/site/powturbo/entropy-coder

    As you have noticed, this grouping also improves compression ratio (order of prediction) - you can also do it for ANS. 8 symbol alphabet is rather a degenerated application.
    And generally, alphabet size is not limited to 256. Especially alias rANS could be efficiently used for huge alphabets with nearly perfect accuracy.

    The new Huff0 also uses the trick that the state of decoder often determines a few symbols at once:
    http://fastcompression.blogspot.fr/2...lti-bytes.html
    This trick is rather restricted to order 0 models (you would need to prepare separate tables for all context combinations otherwise).
    As I have commented there, it partially can be directly done also for tANS/FSE, especially if renormalizing by more than one bit at once.

    More practical would be just doing it indirectly: the symbol grouping by decoder can be shifted into grouping symbols into a larger alphabet for entropy coding - to get analogous speedup for ANS.
    We can group a varying number of symbols, such that the final alphabet has more uniform probability distribution, then a single step of decoder produces a few symbols - exactly as in Yann's trick.
    For example assuming dominating symbol A, such (prefix-free) grouping can be:
    B,C,D, ..., AA, AB, AC, AD ...
    nearly doubling the size of alphabet, but also increasing speed and order of the method (compression ratio).
    The next step could be nearly tripling the alphabet:
    B, C, D, ... AB, AC, AD, ... AAA, AAB, AAC, AAD ...

    An example of general expansion algorithm: until exceeding some alphabet size, take the most probable leaf and expand it (in above examples, first A was expanded, then AA).
    Last edited by Jarek; 16th April 2016 at 14:06.

  29. #233
    Member
    Join Date
    Mar 2016
    Location
    Spain
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    There is only one Java implementation of ANS (https://code.google.com/archive/p/kanzi/source). But it is rANS variant. I am interested in tANS. It is hard to rewrite it to this variant? Which parts I have to change? Thanks

  30. #234
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    178
    Thanks
    61
    Thanked 51 Times in 40 Posts
    irect, a more up-to-date version of the Java code is here: https://github.com/flanglet/kanzi. Before answering your question, can you explain why you want tANS vs. rANS ? What is your use case ?

  31. #235
    Member
    Join Date
    Mar 2016
    Location
    Spain
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks hexagone for the link. I need to compress data which contains only 250 different characters, so tANS variant is the best, isn't it?

  32. #236
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    178
    Thanks
    61
    Thanked 51 Times in 40 Posts
    ANS is ANS and different implementations (range based or table based) should give similar compression ratios (the size of the tables used to encode frequencies has an impact obviously). I see no reason to believe that tANS would better fit your scenario. Before implementing a tANS version, why not give the rANS code a try and see if the results match your expectations. If you want to go ahead and implement tANS, Cyan's code is where I would start.

  33. #237
    Member
    Join Date
    Mar 2016
    Location
    Spain
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I should mention that speed is also important to me. So tANS variant where there aren't any multiplication should give best results...

  34. #238
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    178
    Thanks
    61
    Thanked 51 Times in 40 Posts
    I get 50 MB/s encoding and 124 MB/s decoding with this version (i7-2600 @3.40GHz, Win7). Speed could be improved by removing the division and replacing it with an inverse multiplication if it matters to you. Multiplication is cheap but do not forget that dereferencing arrays in Java is not free (the JVM does a decent job at removing bound checks but they are not entirely eliminated). So, it remains to be seen what sort of speed advantage tANS would have. If you go ahead and implement though, I would welcome looking at the code.

  35. #239
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    I've no idea how it compares, but there is also the Java rANS from CRAM here:

    https://github.com/samtools/htsjdk/t.../encoding/rans

    This has E04/D04 files for 4-way interleaved order-0 encoding and decoding and E14/D14 for order-1 codec. We found for our specific use case (DNA sequencing instrument quality values) that the order-1 encoding was making a sizeable difference in ratio; eg the 53.7 to 43.2Mb in one above example. Obviously a full and complex model (eg as used in fqzcomp or fastqz) does a better job, but this was a very fast intermediate solution.

    That Java code isn't mine btw. I just made the C variant, based off Ryg's original implementation. And to be honest I only chose that over FSE to make order-1 because I understood the code better! I'm not sure how much more work it would be to make tANS order-1 variant instead, and whether it would work any better or not. Certainly my order-0 variant can't quite match FSE on speed so perhaps order-1 would be similar?

  36. #240
    Member
    Join Date
    Mar 2016
    Location
    Spain
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Jarek View Post
    Hi, the basic concept is in the right hand side part of this diagram:

    Click image for larger version. 

Name:	ans.png 
Views:	205 
Size:	37.2 KB 
ID:	4309
    Ok, I finally understand how this basic example works. The last thing I can't figure out, how there are redefined "even" and "odd" numbers. Why we choose density Pr(0)=1/4, Pr(1)=3/4 for sequence01111? Thank you for answer.

Page 8 of 9 FirstFirst ... 6789 LastLast

Similar Threads

  1. Replies: 39
    Last Post: 10th April 2014, 23:26
  2. Replies: 3
    Last Post: 19th April 2013, 17:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •