I've been exploring the idea of optimising byte oriented arithmetic coding for a specific task. My data is largely stationary, although with correlations. This means I don't particularly need an adaptive model, so instead I can code using static probabilities with blocks. I'm interested in anyone elses explorations into optimising this sort of code. What I have right now doesn't match things like FSE of course, but it's not THAT far off (within half the speed, for an O(1) entropy encoder).
I started with one of Eugene Shelwien's range coders from coders6c2 zip. I note the main encoder and decoder functions contain:
Code:
void Encode (uint cumFreq, uint freq, uint totFreq)
{
low += cumFreq * (range/= totFreq);
range*= freq;
...
}
uint GetFreq (uint totFreq) {
return code/(range/=totFreq);
}
In here we have a number of divisions, which drastically slow things down. However for a static coder we can cheat by normalising our input frequencies so that they sum to a totFreq value that's a binary of 2. Ie the code becomes:
Code:
void Encode (uint cumFreq, uint freq)
{
low += cumFreq * (range >>= TOTFREQ_BITS);
range*= freq;
...
}
uint GetFreq (void) {
return code/(range >>= TOTFREQ_BITS);
}
This has a remarkable improvement in encoding speed, while decoding still suffers a bit due to the remaining division. I explored a few methods of replacing this with multiplications and bitshifting, but found by themselves they all have minimal benefits, and sometimes harm things.
However I then explored manual unrolling by using multiple concurrent range coders and this is where the real win comes from. For a basic order(0) codec with precomputed frequencies my encoding and decoding loops started like this. (Note this is a C variant instead of the C++ one above.)
Code:
// Encoder tight loop
RC_output(&rc, (char *)cp);
RC_StartEncode(&rc);
for (i = 0; i < in_size; i++) {
unsigned char c = in[i];
RC_Encode(&rc, C[c], F[c]);
}
RC_FinishEncode(&rc);
//...
// Decoder tight loop
RC_input(&rc, (char *)cp);
RC_StartDecode(&rc);
for (i = 0; i < out_sz; i++) {
uint32_t freq = RC_GetFreq(&rc);
c = D.R[freq];
RC_Decode(&rc, D.fc[c].C, D.fc[c].F);
out_buf[i] = c;
}
RC_FinishDecode(&rc);
With some manual loop unrolling we can get this:
Code:
// Tight encoding loop
i_end = in_size&~3;
for (i = 0; i < i_end; i+=4) {
unsigned char c[4];
c[0] = in[i+0];
c[1] = in[i+1];
c[2] = in[i+2];
c[3] = in[i+3];
RC_Encode(&rc[0], C[c[0]], F[c[0]]);
RC_Encode(&rc[1], C[c[1]], F[c[1]]);
RC_Encode(&rc[2], C[c[2]], F[c[2]]);
RC_Encode(&rc[3], C[c[3]], F[c[3]]);
}
// Tidy up any remainder that isn't a multiple of 4
for (; i < in_size; i++) {
unsigned char c;
c = in[i];
RC_Encode(&rc[0], C[c], F[c]);
}
Code:
// Tight decoding loop, unrolled 4-ways
i_end = out_sz&~3;
for (i = 0; i < i_end; i+=4) {
uint32_t freq[4];
unsigned char c[4];
freq[0] = RC_GetFreq(&rc[0]);
freq[1] = RC_GetFreq(&rc[1]);
freq[2] = RC_GetFreq(&rc[2]);
freq[3] = RC_GetFreq(&rc[3]);
c[0] = D.R[freq[0]];
c[1] = D.R[freq[1]];
c[2] = D.R[freq[2]];
c[3] = D.R[freq[3]];
RC_Decode(&rc[0], D.fc[c[0]].C, D.fc[c[0]].F);
RC_Decode(&rc[1], D.fc[c[1]].C, D.fc[c[1]].F);
RC_Decode(&rc[2], D.fc[c[2]].C, D.fc[c[2]].F);
RC_Decode(&rc[3], D.fc[c[3]].C, D.fc[c[3]].F);
out_buf[i+0] = c[0];
out_buf[i+1] = c[1];
out_buf[i+2] = c[2];
out_buf[i+3] = c[3];
}
// Tidy up any remainder that isn't a multiple of 4
for (; i < out_sz; i++) {
uint32_t freq = RC_GetFreq(&rc[0]);
unsigned char c = D.R[freq];
RC_Decode(&rc[0], D.fc[c].C, D.fc[c].F);
out_buf[i] = c;
}
Calling RC_GetFreq 4 times on interleaved range coders, each with non-overlapping input and outputs, means that the expensive divisions can be pipelined. Eg see the Sandy Bridge data sheet at http://www.agner.org/optimize/, which indicates IDIV takes 9 cycles for 32-bit values and has a 20-27 cycle latency before the result is available. During that latency time it appears we can be computing other results, including other divisions.
The net difference is that the code went from 5.7s encode and 16.4s decode on a 1Gb file to 6.8s encoder (slower) and 9.4s decode (much faster) on the same file. That's breaking 100MB/s.
Order(1) modelling is a bit trickier, as we need the last byte output in order to use the frequencies for the next decode cycle. However if we position 8 (I found 8 worked better than 4 for O1) codecs at 1/8th of the buffer apart then they can all use their last byte value fine, with the small exception that on startup we need 8 fresh "last" values. (I could store 7 of these in the file, but for now I just set all 8 to zero.) The effect on order-1 unrolling is equally profound, giving around 100% throughput gains in decoding.
I'm assuming that this is something that people have previously observed, but I've never seen any public code exploiting this. The gains are really quite remarkable given I haven't got any (explicit) SIMD/SSE calls in there. The code is unapologetically messy right now as it's very much a work in progress, but if anyone is interested I uploaded a copy to our ftp server at
ftp://ftp.sanger.ac.uk/pub/users/jkb/arith_static.c
To compare the non-unrolled version comment out the #define UNROLLED at the start. Note the two variants aren't binary compatible with each other.