with source code available on github :
https://github.com/thecbloom/recip_arith
with source code available on github :
https://github.com/thecbloom/recip_arith
Oh, I thought it is also multiplicationfree like Moffat's/daala, but it is indeed only divisionfree.
Its "No Google, you can't patent this" license should be put especially on all machine learning papers ... https://news.ycombinator.com/item?id=17266951
OK that's nice timing. I was just doing the same thing.
Thank you Charles (& RAD Game Tools).
Some interesting bits in here. The reciprocals are (1<<N + i1)/i rather than (1<<N)/i, so it rounds up and not down. I think this, coupled do division by only a small value, is sufficient to ensure the number of bits needed in the multiplier always fit within uint32_t, avoiding the complexity of extra shifts and adds (like Montgomery solution). I like that.
The other great bit there is working out how to limit the size of the lookup table independently from limiting the CDF sizes. I hadn't thought about that before.
I appeared to have some totally bizarre alternative in my old arith_static.c code, which I'd completely forgotten I even did this way:
"i_log2" here is a bsr assembly instruction, so fast and less work than another lookup table. However it's still more work than the fixed size shifts used in Charles' code. I'm not sure how much by though.Code://init for (i = 1; i < (1<<(32TF_SHIFT)); i++) RC_recip3[i] = 1+((1LL<<31) * ((1LL<<(i_log2(i)+1))  i)) / i; ... // subsequent usage: rc>range >>= TF_SHIFT; uint32_t t = (rc>code * (uint64_t)RC_recip3[rc>range]) >> 31; return (((rc>code  t)>>1) + t) >> i_log2(rc>range);
Related: Alverson's paper > https://pdfs.semanticscholar.org/499...820adc28ac.pdf
The Montgomery paper is the more usual one perhaps: https://gmplib.org/~tege/divcnstpldi94.pdf
Or a more recent variation: https://gmplib.org/~tege/divisionpaper.pdf
That latter one is somewhat obscure, but it's relying on efficient branch prediction to speed up the algorithm. It may be faster in the single case, but potentially many divisions via SIMD may not gain as much.
Edit: libdivide (https://libdivide.com/) and fastdiv (https://github.com/jmtilli/fastdiv) give various implementations of some of these algorithms.
Bulat Ziganshin (22nd October 2018)
Charles's writeups on the construction:
"A MultiSymbol Division Free Arithmetic Coder with Low Coding Loss using Reciprocal Multiplication" http://cbloomrants.blogspot.com/2018...sionfree.html
"About arithmetic coders and recip_arith in particular" http://cbloomrants.blogspot.com/2018...eciparith.html
"Arithmetic coder divisionfree maps" http://cbloomrants.blogspot.com/2018...freemaps.html
"Working through recip_arith" http://cbloomrants.blogspot.com/2018...eciparith.html
"recip_arith without unused range" http://cbloomrants.blogspot.com/2018...sedrange.html
For reciprocal approximations, you want to use directed roundings (so, either floor or ceil) so the error curve is onesided. Then you pick your precision to make the error small enough to ensure you never round to the wrong value. Using either floor reciprocals and a ceil multiplybyreciprocal, or ceil reciprocals and floor multiplybyreciprocal, works. The latter is preferred because floor multiply is usually easier/cheaper to get than ceil multiply. (The static modes in ryg_rans used the same trick!)
I like Alverson's paper more than the Montgomery et al one, but that might just be me.

FWIW there's an alternative way to think about this that might be helpful to understanding the construction.
Specifically, what this coder really does is round the range down (always rounding down here!) to the nearest smaller floatingpoint approximation, where we have 8 bits of mantissa (including the explicit mostsignificant 1 bit) and "enough" exponent bits. (The latter are not tracked explicitly).
That is, rather than allowing arbitrary values for "range", after shrinking the range we round it down further to be one of the representable values. For the default 8 bits, and assuming we're using 32bit uints for simplicity, the representable values for range are:
and so forth. (Recip_arith does this shrink just before starting decoding/encoding a new symbol rather than just after decoding/encoding the previous symbol, but since these occur in a loop, that's really the same thing.)Code:0xff << 24 0xfe << 24 ... 0x80 << 24 0xff << 23 0xfe << 23 ... 0x80 << 23 ...
The point being that we explicitly decide to throw away part of the range (the part we lose by rounding down to the nearest "permitted" value). That's where the coding loss comes from. But it also means that we only divide by a very limited set of numbers (really, only 128 of them up to shifts) and thus we can table the reciprocals for them in advance.
I like this view of the coder because it shows a very clear connection to "regular" range coding: namely, except for periodically quantizing the range, this is a normal range coder. If you put the same rangerounding step into a regular range coder (using divides), you should get the exact same results.
JamesB (23rd October 2018)
That's a very elegant way of tackling the problem. Thanks for the explanation.
By limiting the possible number of different ranges, you're making a reciprocal table lookup a small dimension and hence cache friendly. The cache misses are what would otherwise make it little better than normal division. (I already tried!)
This does also open up an interesting debate vs ANS (specifically rANS). A division free arithmetic coder looks to remove the biggest hurdle to efficient implementations. It's still more than ANS as it has two state variables to update, but brings the two much closer together.
Btw, for reference, there's that  https://encode.su/threads/1153Simpl...ll=1#post28539
(multfree with no compression loss)
Maybe its possible to combine the ideas?
Yes, it's definitely possible, all you do is switch the range to be quantized to a log scale (at some precision) instead of semilog.
Should leave you with one multiply for a multisymbol arithmetic coder. (But not worth worrying about, multiplies are not worth replacing with more table lookups.) Several dependent table lookups on the critical path though which is dubious.
"No loss" is not a particularly interesting target since it makes your tables way larger (and the integer holding the logscale range wider) for a tiny benefit. If you want full precision, you're better off just doing the arithmetic.
> it makes your tables way larger for a tiny benefit
In fact, normally its combined with fsm counters, so large/precise table is only required for fsm init.
Alternatively, it should be possible to apply the same precision reduction trick, it would also make the tables smaller.
But I wonder, is it possible with a nonbinary alphabet?
Yes, like I already said, I'm pretty sure that all you have to do is you switch from an (effectively) semilogscale range to one that's purely logarithmic, and unlike the binary version this doesn't leave you completely multiplyfree.
In something like recip_arith:
 You get rid of "range" and replace it with a scaled fixedpoint log2_range=FIX_SCALE*log2(MAX_RANGE / range) (conceptually), which counts the number of "finished" bits. Somewhat opposite to normal use but this one starts out at 0 and grows as you encode values. Makes the shifting below easier so that's what I'll use.
 The tests and shifts in the "renorm" steps are trivial (log2 is monotonic so the tests are easy, and the shifts turn into an add/sub on log2_range)
 You need a table for the dequantized range values; the "r_norm" in recip_arith is now from a table lookup, "r_norm = range_table[log2_range % FIX_SCALE] >> (log2_range / FIX_SCALE)". (FIX_SCALE is of course pow2 so these are really an AND and a shift, respectively.)
 You also have the corresponding tabled reciprocals, indexed the same way.
 No clz on range; you're already storing the log2, that part is just log2_range/FIX_SCALE (=a shift)
 Updates to "code" still have to compute "cdf_low * r_norm"; this part doesn't go away, that's the leftover multiply. (Unnecessary in binary coders since you only ever multiply by 0 or prob, so you can "hardwire" that part.)
 The "range" update in encoder_put and decoder_remove are the crucial part. You do this in logarithmic space, so it turns into "log2_range = floor(FIX_SCALE*(log2(freq)  cdf_bits))".
That last one is really the only substantive change. "log2(freq)  cdf_bits" is nonpositive (num_bits for that symbol); since we're counting negative fractional bits, floor actually rounds the number of bits up; then we subtract this (negative) value from our log2_range value, which grows it.
The log2(freq) here is, like the rest, amenable to an implementation using either a full LUT or a somewhat smaller LUT + clz.
Okay, so what does this do? The original recip_arith rounds down range from the value it would have with a more precise coder. That means that at every step, there's a part of the code space [rounded_down(range), range) that is unused.
This one is a similar idea, but now we burn parts of the range in a different spot. Namely, at every step, we space the codes apart using the full range value (well, its quantized version anyway), but then when it's time to shrink the range after encoding (or decoding) a symbol, we only use part of the full range we could for that symbol. (The floor on the log2 for the freq computation means we shrink the range more than we should. This part is crucial  the shrunk range _needs_ to be strictly <= the "exact" range value for that symbol, or this won't work.)
In other words, whereas the regular (semilog2) recip_arith puts all the wasted range at the end of the interval, this variant would have a sliver of wasted range for every symbol.
Caveat: this is something we have discussed as a theoretical possibly but not actually implemented (it's interesting theoretically but doesn't actually solve any problem we have), so it's entirely possible that I'm missing something here, but I'm fairly certain it should work.