1. ## Unrolling arithmetic coding.

I've been exploring the idea of optimising byte oriented arithmetic coding for a specific task. My data is largely stationary, although with correlations. This means I don't particularly need an adaptive model, so instead I can code using static probabilities with blocks. I'm interested in anyone elses explorations into optimising this sort of code. What I have right now doesn't match things like FSE of course, but it's not THAT far off (within half the speed, for an O(1) entropy encoder).

I started with one of Eugene Shelwien's range coders from coders6c2 zip. I note the main encoder and decoder functions contain:

Code:
```void Encode (uint cumFreq, uint freq, uint totFreq)
{
low  += cumFreq * (range/= totFreq);
range*= freq;
...
}

uint GetFreq (uint totFreq) {
return code/(range/=totFreq);
}```
In here we have a number of divisions, which drastically slow things down. However for a static coder we can cheat by normalising our input frequencies so that they sum to a totFreq value that's a binary of 2. Ie the code becomes:

Code:
```void Encode (uint cumFreq, uint freq)
{
low  += cumFreq * (range >>= TOTFREQ_BITS);
range*= freq;
...
}

uint GetFreq (void) {
return code/(range >>= TOTFREQ_BITS);
}```
This has a remarkable improvement in encoding speed, while decoding still suffers a bit due to the remaining division. I explored a few methods of replacing this with multiplications and bitshifting, but found by themselves they all have minimal benefits, and sometimes harm things.

However I then explored manual unrolling by using multiple concurrent range coders and this is where the real win comes from. For a basic order(0) codec with precomputed frequencies my encoding and decoding loops started like this. (Note this is a C variant instead of the C++ one above.)

Code:
```    // Encoder tight loop
RC_output(&rc, (char *)cp);
RC_StartEncode(&rc);
for (i = 0; i < in_size; i++) {
unsigned char c = in[i];
RC_Encode(&rc, C[c], F[c]);
}
RC_FinishEncode(&rc);

//...

// Decoder tight loop
RC_input(&rc, (char *)cp);
RC_StartDecode(&rc);
for (i = 0; i < out_sz; i++) {
uint32_t freq = RC_GetFreq(&rc);
c = D.R[freq];
RC_Decode(&rc, D.fc[c].C, D.fc[c].F);
out_buf[i] = c;
}
RC_FinishDecode(&rc);```
With some manual loop unrolling we can get this:

Code:
```    // Tight encoding loop
i_end = in_size&~3;
for (i = 0; i < i_end; i+=4) {
unsigned char c[4];
c[0] = in[i+0];
c[1] = in[i+1];
c[2] = in[i+2];
c[3] = in[i+3];

RC_Encode(&rc[0], C[c[0]], F[c[0]]);
RC_Encode(&rc[1], C[c[1]], F[c[1]]);
RC_Encode(&rc[2], C[c[2]], F[c[2]]);
RC_Encode(&rc[3], C[c[3]], F[c[3]]);
}

// Tidy up any remainder that isn't a multiple of 4
for (; i < in_size; i++) {
unsigned char c;
c = in[i];
RC_Encode(&rc[0], C[c], F[c]);
}```

Code:
```    // Tight decoding loop, unrolled 4-ways
i_end = out_sz&~3;
for (i = 0; i < i_end; i+=4) {
uint32_t freq[4];
unsigned char c[4];

freq[0] = RC_GetFreq(&rc[0]);
freq[1] = RC_GetFreq(&rc[1]);
freq[2] = RC_GetFreq(&rc[2]);
freq[3] = RC_GetFreq(&rc[3]);

c[0] = D.R[freq[0]];
c[1] = D.R[freq[1]];
c[2] = D.R[freq[2]];
c[3] = D.R[freq[3]];

RC_Decode(&rc[0], D.fc[c[0]].C, D.fc[c[0]].F);
RC_Decode(&rc[1], D.fc[c[1]].C, D.fc[c[1]].F);
RC_Decode(&rc[2], D.fc[c[2]].C, D.fc[c[2]].F);
RC_Decode(&rc[3], D.fc[c[3]].C, D.fc[c[3]].F);

out_buf[i+0] = c[0];
out_buf[i+1] = c[1];
out_buf[i+2] = c[2];
out_buf[i+3] = c[3];
}

// Tidy up any remainder that isn't a multiple of 4
for (; i < out_sz; i++) {
uint32_t freq = RC_GetFreq(&rc[0]);
unsigned char c = D.R[freq];
RC_Decode(&rc[0], D.fc[c].C, D.fc[c].F);
out_buf[i] = c;
}```

Calling RC_GetFreq 4 times on interleaved range coders, each with non-overlapping input and outputs, means that the expensive divisions can be pipelined. Eg see the Sandy Bridge data sheet at http://www.agner.org/optimize/, which indicates IDIV takes 9 cycles for 32-bit values and has a 20-27 cycle latency before the result is available. During that latency time it appears we can be computing other results, including other divisions.

The net difference is that the code went from 5.7s encode and 16.4s decode on a 1Gb file to 6.8s encoder (slower) and 9.4s decode (much faster) on the same file. That's breaking 100MB/s.

Order(1) modelling is a bit trickier, as we need the last byte output in order to use the frequencies for the next decode cycle. However if we position 8 (I found 8 worked better than 4 for O1) codecs at 1/8th of the buffer apart then they can all use their last byte value fine, with the small exception that on startup we need 8 fresh "last" values. (I could store 7 of these in the file, but for now I just set all 8 to zero.) The effect on order-1 unrolling is equally profound, giving around 100% throughput gains in decoding.

I'm assuming that this is something that people have previously observed, but I've never seen any public code exploiting this. The gains are really quite remarkable given I haven't got any (explicit) SIMD/SSE calls in there. The code is unapologetically messy right now as it's very much a work in progress, but if anyone is interested I uploaded a copy to our ftp server at
ftp://ftp.sanger.ac.uk/pub/users/jkb/arith_static.c

To compare the non-unrolled version comment out the #define UNROLLED at the start. Note the two variants aren't binary compatible with each other.

2. ## Thanks (4):

Bulat Ziganshin (28th January 2014),cade (28th January 2014),Cyan (28th January 2014),Matt Mahoney (29th January 2014)

3. I just noticed that version has a commented out division in the RC_GetFreq and replaced by precomputed multipliers, but it's marginal change and can be put back. I can't be bothered to try and put a new one there given it takes 15 mins for our ftp site to refresh (gah, no forced push method!).

Anyway this is very much a work in progress, I just thought someone here may be interested in the methods used and I'm sure others can teach me more.

4. Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
Eugene has studied vectorised encoders long time ago, the results were good, it's somewhere in this forum.

5. I ended up with the same problem as you when experimenting block-based arithmetic coder (see Range0 - http://fastcompression.blogspot.fr/p...py-coders.html)
Pipelining divisions is a really good idea to hide this operation latency.

It reminds me of some notes from Carmack, using the same methodology within Quake1.

6. Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it. With SSE2, you would probably see worse results caused by the overhead of loading unaligned bytes.

7. Originally Posted by m^2
Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
i seriously doubt that modern compilers can vectorize such code, imho they usually do it with just one operation such as a[i]=b[i]+c[i]. here, we have 1) division and 2) normalization loop, so it definitely out of SSE abilities. i wonder how Eugene managed it - probably he vectorized only part of computations

Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it.
compiler need to manage more variables, so we have more loads/stores. it may be fixed by encoding into 4 separate buffers sequentially

Do you know why encoding is significantly slower after unrolling? It doesn't seem there is any reason for it. With SSE2, you would probably see worse results caused by the overhead of loading unaligned bytes.
I haven't had time to investigate yet; this is all a bit hot off the press

I can probablyt check for cache misses etc using "perf stat" on linux. (A wonderful tool!)

10. cache misses has nothing common to misaligned load penalty and for such linear algorithm they should be near 0. afaik, unaligned SSE loads are free or almost free on modern Intel cpus

11. ## Thanks:

12. I don't think there should be unaligned accesses. The O(1) unrolling positions the loop start on a boundary of 8 input bytes, so that seems reasonable. The O(0) codec just does byte 0,1,2,3 then 4,5,6,7, etc so also aligned.

I'm not sure if it is actually using SSE without explicitly looking at the assembly generated. The primary speed up comes through avoidance latency of the division and the removal of a further division in both encoder and decoder by fixing totFreq. Those divisions really stall the pipeline.

However that still doesn't explain the slow down in encoder when unrolling. It's probably simply the additional memcpy required to paste the various output buffers together. I could have a hybrid affair where they share a common output buffer. It would create some small interdependency between codecs, but it's not doing this via multi-threading so that's not particularly bad.

13. If "char" is 8 bit, this could be slow.
Remember that, tipically, on "big" Intel CPU (Core, back to PPRO) the load function is 64 byte wide.
Therefore if you load 1 byte\64 aligned, you get the remainig 63 "gratis" (at the same latency).
And there can be microops coalescing (or not)
So you really need the dis-assembly to understand what exactly happens.
Code:
```for (i = 0; i < i_end; i+=4) {
unsigned char c[4];
c[0] = in[i+0];
c[1] = in[i+1];
c[2] = in[i+2];
c[3] = in[i+3];```

14. All these versions require 2 multiplications per step - you can reduce it to 1 using range version of ANS instead: 8th page of http://arxiv.org/pdf/1311.2540

For natural f[s] (quantization of probabilities):
freq[s] = f[s] / 2^n
cumul[s] = sum_{i<s} f[s]
symbol[x=0 ... n-1] = min_s : cumul[s]>x - which symbol is in x-th position in (0..01..1...) length 2^n range: having f[s] appearances of symbol s

encoding of symbol s requires 1 division modulo f[s]:
C(s,x) = (floor(x/f[s]) << n) + mod(x,f[s]) + cumul[s]

decoding from x has one multiplication and needs symbol[] table or function deciding in which subrange we are:
s = symbol [ x & mask ]
x = f[s] * (x >> n) + (x & mask) - cumul[s]

15. ## Thanks:

JamesB (29th January 2014)

16. Originally Posted by fcorbelli
If "char" is 8 bit, this could be slow.
Remember that, tipically, on "big" Intel CPU (Core, back to PPRO) the load function is 64 byte wide.
Therefore if you load 1 byte\64 aligned, you get the remainig 63 "gratis" (at the same latency).
that's true for loading from memory to LLC (last level cache), but this code works with data in L1 cache so load delays are 4 cycles

17. Originally Posted by m^2
Have you looked at the asm? Has the compiler vectorised it or is it pure instruction level parallelism?
Eugene has studied vectorised encoders long time ago, the results were good, it's somewhere in this forum.

It sounds like it is bit-wise with the vectorisation being done to update 8 bits at a time. Certainly an interesting method. Oddly it crashes on exit (after doing all the encoding/decoding and flushing results), but also takes around a minute each way so it's substantially slower. It does however do a reasonable job at compression and it's obviously an adaptive model. My own hacked up coder isn't so hot for general data as the storage cost for storing all the symbol frequencies is quite high unless I boost the block size further. (However it was designed for a specific purpose.)

18. There are reasons why the coders in modern applications like HEVC are all *binary* coders and not multi-symbol arithmetic coders. A division can really hurt performance a lot, and even worse, is hard to implement in hardware. Most of the modern approaches are entirely table driven, where the states in the table both reflect the partitioning of the interval, and the probability estimate. In practise, nobody uses an arithmetic coder with multiplication and division anymore. It's just too slow.

19. Regarding range version of ANS (rANS) I have mentioned - using a single multiplication instead of two for range coding, there is a nice explanation and comparison on Charles Bloom blog ( http://cbloomrants.blogspot.com/ ), saying that its implementation of Fabian 'ryg' Giesen (I couldn't find details) decodes 140-167 "mb/s", about twice slower than FSE.
Unfortunately encoding speed is about twice slower, probably because of required division.

update: Here is Fabian's implementation: https://github.com/rygorous/ryg_rans

20. ## Thanks:

JamesB (3rd February 2014)

21. Originally Posted by thorfdbg
There are reasons why the coders in modern applications like HEVC are all *binary* coders and not multi-symbol arithmetic coders. A division can really hurt performance a lot, and even worse, is hard to implement in hardware. Most of the modern approaches are entirely table driven, where the states in the table both reflect the partitioning of the interval, and the probability estimate. In practise, nobody uses an arithmetic coder with multiplication and division anymore. It's just too slow.
Do you have any examples? I've tried fpaq0pv4b which seems to be the best performing binary coder in the fpaq series and despite claiming to be a simple order-0 test for arithmetic coders, it's simply not able to keep up with a byte wise arithmetic coder despite the divisions.

I'm sure there are ways of optimising the bitwise ones further, but I don't know of any open source code for this. Pointers would be most welcome.

Meanwhile I'm exploring rANS. Thanks Jarek and Ryg.

22. Originally Posted by Jarek
Regarding range version of ANS (rANS) I have mentioned - using a single multiplication instead of two for range coding, there is a nice explanation and comparison on Charles Bloom blog ( http://cbloomrants.blogspot.com/ ), saying that its implementation of Fabian 'ryg' Giesen (I couldn't find details) decodes 140-167 "mb/s", about twice slower than FSE.
Unfortunately encoding speed is about twice slower, probably because of required division.

update: Here is Fabian's implementation: https://github.com/rygorous/ryg_rans
After a few gotchas and a bug fix (thanks Fabian) I now have a version of my O0 and O1 coder using rANS as a more or less drop-in replacement for the byte wise range coder.

The results are indeed an improvement as predicted.

On my 1Gb relatively low-entropy file it is now encoding at 109MB/sec and decoding at 167MB/sec. This compares well to the 139MB/sec encoding and 93MB/sec decode of the range coder version (which wasn't exactly tardy, on a 2.2Ghz Xeon E5-2660). I'm definitely happy with it, although it's perhaps a bit bleeding edge! I know the tANS version would be faster still, but I'm less sure about making an order-1 variant of that algorithm.

Thanks all.

23. ## Thanks:

Jarek (4th February 2014)

24. I decided I'd upload the source and some prebuilt 64-bit linux binaries here.
I tried cross-compiling for windows and running under Wine but it fails for some unknown reason. I don't know if this means it won't work under windows or whether it is simply my crufty and hacky wrapper that isn't portable.

Consider these simply as a demonstration, but I expect they'd do very well on the various benchmarks compared to most other o0 and o1 pure entropy encoders.

One thing I found is that by unrolling/interleaving the rANS encoders they run pretty close to the tANS implementation in FSE. That's good news.

25. ## Thanks (2):

Bulat Ziganshin (7th February 2014),ne0n (7th February 2014)

26. Originally Posted by JamesB
Do you have any examples? I've tried fpaq0pv4b which seems to be the best performing binary coder in the fpaq series and despite claiming to be a simple order-0 test for arithmetic coders, it's simply not able to keep up with a byte wise arithmetic coder despite the divisions.
My typical example would be the MQCoder, though any of the MPEG encoders would also work. I have an implementation I could share, but it is not optimized. In fact, you can get rid of all branches but one, and divisions and multiplications are neither present, only shifts, additions and subtractions. The trick is of course to binarize the data correctly. Again, a trivial example might be the AC coding variant of 10918-1 (JPEG).

27. thorfdbg, it will be great if you can share code/explanations. i heard about that 10-20 years ago, but don't yet understand how it works

28. So that's page 54 onwards of www.w3.org/Graphics/JPEG/itu-t81.pdf

Code is available in http://www.ijg.org/files/jpegsrc.v9a.tar.gz jcarith.c, linked to from http://www.ijg.org/

I've no idea how this compares speed wise. I couldn't find a single function to compress or decompress an entire block of memory, rather just lots and lots of embedded calls to arith_encode and arith_decode to emit specific symbols.

29. Originally Posted by Bulat Ziganshin
thorfdbg, it will be great if you can share code/explanations. i heard about that 10-20 years ago, but don't yet understand how it works
Here's a cleaner explanation of the closely related QM coder. The idea of input to it for images is bit planes (eg MSB for every byte, then 2nd, 3rd, ... to LSB). I haven't tried it yet since arithmetic coding is fast enough for me right now, but I had problems when I had too many symbols, so I will probably try this later (with bit planes).

http://www.cs.ucf.edu/courses/cap5015/QM_coder.pdf

30. ## Updated version

I updated the demonstration of the unrolling (technically I should call it interleaving or striping I guess as it's not unrolling in the usual sense of the word given it produces different output) and fixed a few bugs found by testing on Matt's simple.zpaq archive of awkward files.

Speed wise it's largely unchanged. I also did some basic benchmarking with manual code changes to test different amounts of interleaving/striping on an Intel and AMD system. As can be seen it's beneficial in all cases. These are for the rANS_static demo:

Code:
```> rANS_static, manual unrolling tests in MB/s.
>
> Intel   1-way           2-way           4-way           8-way
> QQ-O0   138/109         156/167         159/225         130/184
> qq-O0   141/116         160/180         151/232         135/217
>
> QQ-O1    99/57          102/91           98/123          87/133
> qq-O1   112/91          114/148         113/187         111/194
>
>
> AMD     1-way           2-way           4-way           8-way
> QQ-O0    73/91           80/130          77/148          87/125
> qq-O0    90/99          101/140         102/169         102/149
>
> QQ-O1    58/34           56/53           54/69           51/66
> qq-O1    82/55           84/83           80/99           81/95
>
> Conclusion:
> O1 encoding should be 2-way done 4 times (hence 8-way).
> O1 decoding should be 8-way```
As before, many thanks to Eugene Shelwien, Jarek Duda and Fabian Giesen for their code on which this is largely based.

31. ## Thanks (2):

Bulat Ziganshin (12th February 2014),Cyan (12th February 2014)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•