# Thread: Is this how to practically implement Arithmetic Coding?

1. ## Is this how to practically implement Arithmetic Coding?

I have read multiple papers/articles explaining how to practically implement Arithmetic Coding. They say the same things and is really mind-bending. But here is how far I got. See the wall of text below. Can someone please explain in as few words as I did and explain clearly what I got wrong?

Two reference links. If need, see page 8-10 https://www.cc.gatech.edu/~jarek/courses/7491/Arithmetic2.pdf
Another detailed one: https://www.drdobbs.com/cpp/data-com...0169251?pgno=2

"All we do is wait as the float precision elongates and then when we see the first bit at the start (in the 3 here > 0.[3]172537584) won't change anymore again, we chop it off and store the bit. Until we know when to chop, we Follow it, and may chop off 2 or more bits simultaneously. The end of the 0.2646264 gets a (or more) bit added to it, we chop one and add one. - We shift the whole float thing to the left. We look to see if low has increased above 0x7FFFFFFF, and if high has dropped below 0x80000000. If was high, we add a 1 bit to the end, if low, 0 to the end. What we look for (to know when to chop off the 1st bit), is, think about it, we start with 0.0-1.0 with 256 ranges ex. |......|...|..|.|.|, and you have to look at these 256 ranges as a whole and see it split as 4 units (leading bits) > \$|.....S.|..\$.|..|S.|.|\$ where halfway means 1st bit won't change, and the 4ths mean the 2nd bit won't change, and we jump in one range, eventually the decimal gets too long, but before that occurs we have chopped off a front-end bit. The bits we chop off, are the encoded file, we don't directly double each range left/rightwards nor convert to binary from decimal. Each bit we chop is part of ex. 0.23634634 (look at this as a number, not 8-bit chars 0 . 2 3 6....), so after ex. 3 bits the number is shrunken until disappears."

It would appear, all I need do is know when to chop off a bit (shrinks the 'number describing the whole file'). We can see here the 0.2 gets locked in, none of the 3 ranges has a 0.3 etc anymore. https://royvanrijn.com/blog/2010/02/...by-prediction/
So we store the bit, which is 0 if low! 1 if high. And at the end we add a 1 if....why would we do that? The algorithm already makes the float grow more precise. Say we end up with 0.264755546346757543467876878, of '1' (we are at 0.2... of 0.0-1.0) we have 256 ranges and extend the code like the link above does.

2. I think I got it, and wow is it simple if so:

So that's it, the float we build gets longer 0.2463262564, and the first number gets locked in when we see all ranges start with it. We then take our float (0.2463262564, which is ex. in binary 10010010110100100110011101100100) and remove and store the front 1 and get 0.315778916. Then we continue our little thing splitting it into 256 ranges and elongate it.

3. .....However this variant seems easier:

So that's it, the float we build gets longer 0.2463262564, and the first number gets locked in when we see all ranges start with it. We then remove and store the 1st or more locked digits > 0.[2]463262564 and get 0.463262564. Then we continue our little thing splitting it into 256 ranges and elongate it.

4. What we actually have is a high and low, 2 floats. Ex. 0.2025 to 0.2826. If we look inside, say we see locked digits (see the 6 below, taken from https://royvanrijn.com/blog/2010/02/...by-prediction/). This allows us to take these two floats 0.2025 to 0.2826 and make them into 0.025 to 0.826.

• A: 0.202500 to 0.238545
• B: 0.238545 to 0.270585
• C: 0.270585 to 0.282600

• A: 0.02500 to 0.38545
• B: 0.38545 to 0.70585
• C: 0.70585 to 0.82600

5. It worked. it outputs 5 bits per character instead of 8. But it's only order-0. It looks only look at blank/nothing and know each letter probability (occurrence count). So wiki8 100MB would come to 62.5MB. But it's a start.

6. I saw fpaq2 is an order-0 coder and makes wiki8 (100MB) into 25MB. I implemented order-0 and would get 62MB. How can solo letter counts / %s alone do so much compression? What else does fpaq2 do....

7. fpaq2 is not order0.

As to arithmetic coding, you can also see these threads:

Paq is not really a good source for entropy coders: https://encode.su/threads/2515-mod_p...ll=1#post51672

8. But at the top of the attached code it says:
"// fpaq2 - Stationary order 0 file compressor."

And Matt said so too in an email.

So is it order-2?

9. Matt said this: http://mattmahoney.net/dc/text.html#2212
fpaq2 (Oct. 21, 2006) uses a combination context mixing and PPM algorithm.
I think its order16 or something (128/8=16).
fpaq0 is order0.

10. Interesting to see predicting 1 bit at a time doesn't have float precision issue.

As for byte prediction, I see wikipedia says my idea of removing starting digits once are common, and rounding the end which has little effect.

11. I tend to feel predicting 1 bit doesn't reap as much benefit. But even if it does, how do you go through each possible letter? Is it the start of 50% of the 256 possible chars it's making? Like order-0, it just knows the 0/1 frequency and, that's really bad because 50/50 is 1 & 0. Right? So order-1 or 3 only makes 1bit prediction look good.....unless its as long, maybe its as good.....let me give a viz seeing that:

context | prediction

the g
and the 1

Here we predict less but have longer context, same help, no loss of accuracy.

Is it so? The prediction either way is narrowed down lotsssss and needs less arithmetic coding correction tuning (we get more larger ranges/slices yay).

However say we have a % now saying 1 is likely by x amount, it's still similar to 0's % slice. Ex. "and the girl/boy/cat" may all start with 0 and another 4 observations may with 1. I suppose the context will make the 50 50 ratio less even but, is that what makes bit prediction work? It's like 0.7 bits to store 1 bit lol.

12. > doesn't have float precision issue

Real arithmetic coding doesn't use float-point types normally.
Or rather it emulates its own float-point type by writing down the low-bound's bytes that can't be changed anymore.

> I tend to feel predicting 1 bit doesn't reap as much benefit

13. Originally Posted by Shelwien
Real arithmetic coding doesn't use float-point types normally.
Isn't the difference between arithmetic and range coding that the former works on floats and the latter on integer ranges?

14. No, the difference is that rangecoding outputs whole bytes at once, and AC outputs bits.
This is the original source of the name btw: http://www.compressconsult.com/rangecoder/
Basically rangecoder is a speed-optimized version of algorithm with lower precision.

I do have some rangecoder implementations which use hardware floats, but its tricky -
compilers always try to optimize float-point math as randomly as possible,
so calculation results would be frequently different on different platforms etc.
So real float-point is rarely used in compression.

15. ## Thanks:

Jarek (10th January 2020)

16. Rangecoding VS Arithmetic Coding, are you saying I can use integers that are 500 ints long ex. 54757547457488878586865666575686868624152521632523 2385685 ?? I hate the float limit of 16. However sadly, in rangecoding on wikipedia it talks about popping out bytes, it sounds like they have a overflow problem going on. How do you do the math if you don't get %s? Only add/subtract? Such math is faster that division, that's why ints are used and long no?

17. For AC you basically only need long addition for low-bound (carryless implementations don't need even that).
Range width can only become smaller, so normal fixed-size float is enough for it.

18. Originally Posted by Shelwien
fpaq2 is not order0.

Paq is not really a good source for entropy coders: https://encode.su/threads/2515-mod_p...ll=1#post51672
Finaly tested this and it works on 32bit compile. Put it in pxv. On pxd 64bit compile it fails on compression, on 1M Nulls and file is larger. Doing something wrong probably.

19. Does anyone know why my order-1 Arithmetic Encoder (which achieves ~50% compression) is 3 times as slow if fed double the data? It would take me 90,000 hours to run on 100MB. It seems my dictionary is an issue and/or the range picking (deciding which of up to 65,000 ranges to jump into). What do I do?

20. Here's a simple example of an order1 bitwise coder.

21. Originally Posted by Shelwien
No, the difference is that rangecoding outputs whole bytes at once, and AC outputs bits.
This is the original source of the name btw: http://www.compressconsult.com/rangecoder/
Basically rangecoder is a speed-optimized version of algorithm with lower precision.
I think that's a distinction which has become common in recent years but didn't necessarily start off that way. It's an easy way to use the terms now though.

I think range coding was invented by Graham Martin at the University of Warwick (UK) and was an accidental reinvention of arithmetic coding. Arithmetic coding now is done bit-wise, but I don't know if it was that way in 1979 when range coding was invented (given it's only 3 years earlier for arithmetic coding).

The tricky bit with range coders is the divisions. You can maybe separate the two needed for decoding slightly with other instructions inbetween to permit a bit of pipelining, but you'll still be crippled by the speed of div unless you use various lookup tables instead and approximation techniques. I guess this is where the binary one may win out if you can get the 8 bits done one at a time faster than the 1 byte. To my shame I've never really worked much with binary ACs.

22. Here is another discussion: https://en.wikipedia.org/wiki/Talk%3...metic_encoding

"Range encoding, in both theory and practice, involves ranges of integers. Arithmetic coding in theory works with rational numbers between 0 and 1, but since practical considerations cause the precision to be limited to typically 32 bits, it's as if they were using 32-bit ranges of integers."

"With arithmetic coding, the actual arithmetic code, which is a sequence of (coding) symbols (not the sequence of coded symbols), is taken as representing a non-negative rational number in the range [0,1)."

Also https://en.wikipedia.org/wiki/Arithmetic_coding "works" on [0,1).

23. Shelwien, that's a lot of C++ to look over...

Should my dictionary of 2 chars be a tree for fast searching to grab its probabilities? Or just sorted: he 40%, an 10%, .... zz 0.003%

And when I jump into a range, say I have 20,000 slices, how to I pick the one so fast? A table? What?

24. ## Order 1 range coder

Here's my order-1 range coder, based on some of Eugene's range coder work along with my own order-1 frequency modelling and a load of other random bits bundled in (I couldn't be bothered to cut it down, sorry).

Compile with "cc -O3 *.c -o arith" (Yes I know - I call it arithmetic even though it's byte based!)

Run with "-o1" argument. Eg:

Code:
```\$ ./arith -o1 ~/scratch/data/enwik8 enwik8.o1
Took 3018087 microseconds,  33.1 MB/s

\$ /arith -d enwik8.o1 enwik8.dec
Took 3610196 microseconds,  27.7 MB/s

\$ ls -l enwik*
-rw-r--r-- 1 jkb team117 100000000 Jan 13 18:25 enwik8.dec
-rw-r--r-- 1 jkb team117  46884282 Jan 13 18:25 enwik8.o1```
You can do -o0 for order-0. Also o64/o65 for order-0/1 with RLE, o128/129 for bit-packing (useful of limited alphabets with 16 or fewer symbols) and 192/193 for both together. -o65 works OK on enwik8.

25. Oh I get it:

Take the range you are in currently ex. 0.3475-0.3523 (and first remove locked digits, so:) 0.475-0.523)

Then we take the range we want to jump in from dictionary ex. 0.79-0.86

Now we find the number inside 0.475-0.523 which will be 48. We know that in 0.79-0.86 the 0.79 means up to 79% of 48, :-)

So now we get ex. 37.57-41.29 (from inside 48)8)8)8 ) and tells us the range to jump into in 0.475-0.523

​????

26. > Arithmetic coding now is done bit-wise, but I don't know if it was that way
> in 1979 when range coding was invented (given it's only 3 years earlier for
> arithmetic coding).

This distinction is mainly patent-related.
There was a long "ban" on AC use because of IBM's patents.
Then M.Schindler posted a patent-free open-source implementation
(and yes, attributed to G.Martin).

Also coding with bytewise range renormalization only makes sense on platforms
with register size >=24 bits.
So all ACs in 197x-8x were bitwise 16-bit.

> The tricky bit with range coders is the divisions.

1) They're not necessary when we're working with probabilities rather than frequencies.
The coder doesn't even have to be binary.
2) There's this: https://github.com/thecbloom/recip_arith
There're also log-scale implementations which don't even need multiplications.
3) Division is not that costly anymore, especially comparing to branches and RAM.

27. Binary trees below, viz:

1bit prediction

>
....>1
>

........>R

>
....>0
>

2bit prediction

>00
....>R
>01

........>R

>10
....>R
>11

The larger the prediction, the less observations. For 1bit, each has half the observations. For 2bits, each has 1/4th the observations. For 8bits, 1/256th. So what advantage does byte prediction have? It's a x255 loss and only 8 bit gain. It's 8 times more yield. But 8 times less observations? Ah, not 256, but 8. So it's just as good as 1bit prediction or better. Why is it 8 times or 5 times less observations even though I said 255? I'm looking into it. I think I got something here.

28. Me and my partner came up to a small issue.
When we mix context models.
Let's say we follow the common method to mix context models.
50%, 25%, 25%
+
33%, 33%, 33%
=
83%, 58%, 58%
/2=
41.5, 29%, 29%
Great.
But let's now look at our full range with the slices for Arithmetic Encoding. We need to jump/nest into one of these slices.
[........................................][.............................][.............................]
The size of the slices are respective.
Now say we want to jump in the middle slice. But we have 20,000 slices.
Yes, we know the % (the size of the slice) we want to jump into. See how big the middle one is? Yeah. Ok. But where? Is it 0.4-0.7.....or 0.5-0.8.....or 0.6-0.9.....or 0.7-1.0? What position is it at? All we have is a % of '29', we don't have the low-high of the % after mixing models.
An obvious but extremely slow solution: Going through all 20,000 slices subtracting each %.
​What do we do?

29. > An obvious but extremely slow solution: Going through all 20,000 slices subtracting each %.

Its also actually an option - with vector instructions it can be reasonably fast (compared to paq8).
At least when written in C/C++ and with a good recent compiler.

Normally you'd implement this coding hierarchically, with some kind of a tree (usually binary).

30. Is that what everyone does though? Any other options?

31. Large alphabets are used rather rarely - after some point (not much higher than 256) it makes compression worse
and is only useful as speed optimization with a static probability distribution.
Even there using a simple table for the distribution becomes a bad idea after 16k or so symbols (won't fit in L1 cache) -
after that it would be also faster to use hierarchical coding.

https://en.wikipedia.org/wiki/Binary_search_algorithm

> Any other options?

For static model its possible to precalculate a lookup table, then use it to generate/decode a large part of symbol's code.
For adaptive model there're no other options.

Page 1 of 7 123 ... Last