Results 1 to 12 of 12

Thread: recip_arith : a division free arithmetic codec by Charles Bloom

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts

    recip_arith : a division free arithmetic codec by Charles Bloom

    with source code available on github :
    https://github.com/thecbloom/recip_arith

  2. Thanks (6):

    Bulat Ziganshin (22nd October 2018),comp1 (19th October 2018),encode (19th October 2018),JamesB (19th October 2018),Jarek (19th October 2018),Mike (20th October 2018)

  3. #2
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    807
    Thanks
    245
    Thanked 257 Times in 160 Posts
    Oh, I thought it is also multiplication-free like Moffat's/daala, but it is indeed only division-free.

    Its "No Google, you can't patent this" license should be put especially on all machine learning papers ... https://news.ycombinator.com/item?id=17266951

  4. #3
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    OK that's nice timing. I was just doing the same thing.

    Thank you Charles (& RAD Game Tools).

  5. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    Some interesting bits in here. The reciprocals are (1<<N + i-1)/i rather than (1<<N)/i, so it rounds up and not down. I think this, coupled do division by only a small value, is sufficient to ensure the number of bits needed in the multiplier always fit within uint32_t, avoiding the complexity of extra shifts and adds (like Montgomery solution). I like that.

    The other great bit there is working out how to limit the size of the lookup table independently from limiting the CDF sizes. I hadn't thought about that before.

    I appeared to have some totally bizarre alternative in my old arith_static.c code, which I'd completely forgotten I even did this way:

    Code:
        //init
        for (i = 1; i < (1<<(32-TF_SHIFT)); i++)
            RC_recip3[i] = 1+((1LL<<31) * ((1LL<<(i_log2(i)+1)) - i)) / i;
    
        ...
    
        // subsequent usage:
        rc->range >>= TF_SHIFT;
        uint32_t t = (rc->code * (uint64_t)RC_recip3[rc->range]) >> 31;
        return (((rc->code - t)>>1) + t) >> i_log2(rc->range);
    "i_log2" here is a bsr assembly instruction, so fast and less work than another lookup table. However it's still more work than the fixed size shifts used in Charles' code. I'm not sure how much by though.

  6. #5
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    75
    Thanks
    8
    Thanked 40 Times in 29 Posts
    Quote Originally Posted by JamesB View Post
    Some interesting bits in here. The reciprocals are (1<<N + i-1)/i rather than (1<<N)/i, so it rounds up and not down.
    Related: Alverson's paper -> https://pdfs.semanticscholar.org/499...820adc28ac.pdf

  7. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    The Montgomery paper is the more usual one perhaps: https://gmplib.org/~tege/divcnst-pldi94.pdf

    Or a more recent variation: https://gmplib.org/~tege/division-paper.pdf

    That latter one is somewhat obscure, but it's relying on efficient branch prediction to speed up the algorithm. It may be faster in the single case, but potentially many divisions via SIMD may not gain as much.

    Edit: libdivide (https://libdivide.com/) and fastdiv (https://github.com/jmtilli/fastdiv) give various implementations of some of these algorithms.

  8. Thanks:

    Bulat Ziganshin (22nd October 2018)

  9. #7
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 114 Times in 36 Posts
    Charles's write-ups on the construction:

    "A Multi-Symbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication" http://cbloomrants.blogspot.com/2018...sion-free.html
    "About arithmetic coders and recip_arith in particular" http://cbloomrants.blogspot.com/2018...eciparith.html
    "Arithmetic coder division-free maps" http://cbloomrants.blogspot.com/2018...free-maps.html
    "Working through recip_arith" http://cbloomrants.blogspot.com/2018...eciparith.html
    "recip_arith without unused range" http://cbloomrants.blogspot.com/2018...sed-range.html

    For reciprocal approximations, you want to use directed roundings (so, either floor or ceil) so the error curve is one-sided. Then you pick your precision to make the error small enough to ensure you never round to the wrong value. Using either floor reciprocals and a ceil multiply-by-reciprocal, or ceil reciprocals and floor multiply-by-reciprocal, works. The latter is preferred because floor multiply is usually easier/cheaper to get than ceil multiply. (The static modes in ryg_rans used the same trick!)

    I like Alverson's paper more than the Montgomery et al one, but that might just be me.

    -

    FWIW there's an alternative way to think about this that might be helpful to understanding the construction.

    Specifically, what this coder really does is round the range down (always rounding down here!) to the nearest smaller floating-point approximation, where we have 8 bits of mantissa (including the explicit most-significant 1 bit) and "enough" exponent bits. (The latter are not tracked explicitly).

    That is, rather than allowing arbitrary values for "range", after shrinking the range we round it down further to be one of the representable values. For the default 8 bits, and assuming we're using 32-bit uints for simplicity, the representable values for range are:

    Code:
      0xff << 24
      0xfe << 24
      ...
      0x80 << 24
      0xff << 23
      0xfe << 23
      ...
      0x80 << 23
      ...
    and so forth. (Recip_arith does this shrink just before starting decoding/encoding a new symbol rather than just after decoding/encoding the previous symbol, but since these occur in a loop, that's really the same thing.)

    The point being that we explicitly decide to throw away part of the range (the part we lose by rounding down to the nearest "permitted" value). That's where the coding loss comes from. But it also means that we only divide by a very limited set of numbers (really, only 128 of them up to shifts) and thus we can table the reciprocals for them in advance.

    I like this view of the coder because it shows a very clear connection to "regular" range coding: namely, except for periodically quantizing the range, this is a normal range coder. If you put the same range-rounding step into a regular range coder (using divides), you should get the exact same results.

  10. Thanks:

    JamesB (23rd October 2018)

  11. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    That's a very elegant way of tackling the problem. Thanks for the explanation.

    By limiting the possible number of different ranges, you're making a reciprocal table lookup a small dimension and hence cache friendly. The cache misses are what would otherwise make it little better than normal division. (I already tried!)

    This does also open up an interesting debate vs ANS (specifically rANS). A division free arithmetic coder looks to remove the biggest hurdle to efficient implementations. It's still more than ANS as it has two state variables to update, but brings the two much closer together.

  12. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Btw, for reference, there's that - https://encode.su/threads/1153-Simpl...ll=1#post28539
    (mult-free with no compression loss)
    Maybe its possible to combine the ideas?

  13. #10
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 114 Times in 36 Posts
    Yes, it's definitely possible, all you do is switch the range to be quantized to a log scale (at some precision) instead of semi-log.

    Should leave you with one multiply for a multi-symbol arithmetic coder. (But not worth worrying about, multiplies are not worth replacing with more table lookups.) Several dependent table lookups on the critical path though which is dubious.

    "No loss" is not a particularly interesting target since it makes your tables way larger (and the integer holding the log-scale range wider) for a tiny benefit. If you want full precision, you're better off just doing the arithmetic.

  14. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > it makes your tables way larger for a tiny benefit

    In fact, normally its combined with fsm counters, so large/precise table is only required for fsm init.

    Alternatively, it should be possible to apply the same precision reduction trick, it would also make the tables smaller.

    But I wonder, is it possible with a non-binary alphabet?

  15. #12
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 114 Times in 36 Posts
    Yes, like I already said, I'm pretty sure that all you have to do is you switch from an (effectively) semi-log-scale range to one that's purely logarithmic, and unlike the binary version this doesn't leave you completely multiply-free.

    In something like recip_arith:
    - You get rid of "range" and replace it with a scaled fixed-point log2_range=FIX_SCALE*log2(MAX_RANGE / range) (conceptually), which counts the number of "finished" bits. Somewhat opposite to normal use but this one starts out at 0 and grows as you encode values. Makes the shifting below easier so that's what I'll use.
    - The tests and shifts in the "renorm" steps are trivial (log2 is monotonic so the tests are easy, and the shifts turn into an add/sub on log2_range)
    - You need a table for the dequantized range values; the "r_norm" in recip_arith is now from a table lookup, "r_norm = range_table[log2_range % FIX_SCALE] >> (log2_range / FIX_SCALE)". (FIX_SCALE is of course pow2 so these are really an AND and a shift, respectively.)
    - You also have the corresponding tabled reciprocals, indexed the same way.
    - No clz on range; you're already storing the log2, that part is just log2_range/FIX_SCALE (=a shift)
    - Updates to "code" still have to compute "cdf_low * r_norm"; this part doesn't go away, that's the leftover multiply. (Unnecessary in binary coders since you only ever multiply by 0 or prob, so you can "hardwire" that part.)
    - The "range" update in encoder_put and decoder_remove are the crucial part. You do this in logarithmic space, so it turns into "log2_range -= floor(FIX_SCALE*(log2(freq) - cdf_bits))".

    That last one is really the only substantive change. "log2(freq) - cdf_bits" is nonpositive (-num_bits for that symbol); since we're counting negative fractional bits, floor actually rounds the number of bits up; then we subtract this (negative) value from our log2_range value, which grows it.

    The log2(freq) here is, like the rest, amenable to an implementation using either a full LUT or a somewhat smaller LUT + clz.

    Okay, so what does this do? The original recip_arith rounds down range from the value it would have with a more precise coder. That means that at every step, there's a part of the code space [rounded_down(range), range) that is unused.

    This one is a similar idea, but now we burn parts of the range in a different spot. Namely, at every step, we space the codes apart using the full range value (well, its quantized version anyway), but then when it's time to shrink the range after encoding (or decoding) a symbol, we only use part of the full range we could for that symbol. (The floor on the log2 for the freq computation means we shrink the range more than we should. This part is crucial - the shrunk range _needs_ to be strictly <= the "exact" range value for that symbol, or this won't work.)

    In other words, whereas the regular (semi-log2) recip_arith puts all the wasted range at the end of the interval, this variant would have a sliver of wasted range for every symbol.

    Caveat: this is something we have discussed as a theoretical possibly but not actually implemented (it's interesting theoretically but doesn't actually solve any problem we have), so it's entirely possible that I'm missing something here, but I'm fairly certain it should work.

Similar Threads

  1. Is encode.su community interested in a new free lossy image codec?
    By Raphael Canut in forum Data Compression
    Replies: 205
    Last Post: 20th October 2020, 01:08
  2. Oodle Hydra and other Bloom goodness
    By SolidComp in forum Data Compression
    Replies: 1
    Last Post: 12th March 2017, 07:39
  3. Division-free arithmetic coder
    By Bulat Ziganshin in forum Data Compression
    Replies: 17
    Last Post: 19th May 2016, 02:12
  4. Charles Ashford's COMPRESS
    By comp1 in forum Download Area
    Replies: 12
    Last Post: 18th July 2014, 04:18
  5. Charles Blooms new huffman optimisation
    By willvarfar in forum Data Compression
    Replies: 2
    Last Post: 18th August 2010, 10:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •