Actually I decided to benchmark it again. The more complex formats such as adaptive order-1 range coding are quite competitive:
Code:
$ time node ../javascript/main_arith_gen.js -o 1 /dev/shm/enwik8 > /dev/shm/_.enc; time node ../javascript/main_arith_gen.js -d /dev/shm/_.enc > /dev/shm/_.out; ls -l /dev/shm
Compress order 1, 100000000 => 47945026
real 0m5.820s
user 0m5.734s
sys 0m0.096s
Decompress 47945410 => 100000000
real 0m6.117s
user 0m6.044s
sys 0m0.084s
total 242144
-rw-r--r-- 1 jkb team117 47945410 Jun 17 21:06 _.enc
-rw-r--r-- 1 jkb team117 100000000 Jun 17 21:06 _.out
-rw-r----- 1 jkb team117 100000000 Jun 17 21:06 enwik8
vs C
Code:
time ./tests/arith_dynamic -o1 /dev/shm/enwik8 > /dev/shm/_.enc; time ./tests/arith_dynamic -d /dev/shm/_.enc > /dev/shm/_.out; ls -l /dev/shm
Took 2617014 microseconds, 38.2 MB/s
real 0m2.644s
user 0m2.603s
sys 0m0.039s
Took 3203435 microseconds, 31.2 MB/s
real 0m3.219s
user 0m3.191s
sys 0m0.029s
total 242144
-rw-r--r-- 1 jkb team117 47945550 Jun 17 21:07 _.enc
-rw-r--r-- 1 jkb team117 100000000 Jun 17 21:07 _.out
-rw-r----- 1 jkb team117 100000000 Jun 17 21:06 enwik8
So only around 2x slower.
The faster order-0 rans with static frequencies and heavily optimised C (including some inline asm to force it to use cmov), loop unrolling, etc, shows a bigger gap:
Code:
$ time node ../javascript/main_rans4x16.js -o 0 /dev/shm/enwik8 > /dev/shm/_.enc; time node ../javascript/main_rans4x16.js -d /dev/shm/_.enc > /dev/shm/_.out; ls -l /dev/shm
Compress order 0, 100000000 => 63632618
real 0m1.777s
user 0m1.709s
sys 0m0.084s
Decompress 63632618 => 100000000
real 0m1.018s
user 0m0.946s
sys 0m0.080s
total 257464
-rw-r--r-- 1 jkb team117 63632618 Jun 17 21:11 _.enc
-rw-r--r-- 1 jkb team117 100000000 Jun 17 21:11 _.out
-rw-r----- 1 jkb team117 100000000 Jun 17 21:06 enwik8
vs C
Code:
$ time ./tests/rans4x16pr -o0 /dev/shm/enwik8 > /dev/shm/_.enc; time ./tests/rans4x16pr -d /dev/shm/_.enc > /dev/shm/_.out; ls -l /dev/shm
Took 384988 microseconds, 259.7 MB/s
real 0m0.405s
user 0m0.394s
sys 0m0.014s
Took 184316 microseconds, 542.5 MB/s
real 0m0.200s
user 0m0.164s
sys 0m0.036s
total 257464
-rw-r--r-- 1 jkb team117 63634172 Jun 17 21:10 _.enc
-rw-r--r-- 1 jkb team117 100000000 Jun 17 21:10 _.out
-rw-r----- 1 jkb team117 100000000 Jun 17 21:06 enwik8
So that's 4-5x speed difference. It widens even more for order-1 rans, but as I say the javascript code has no loop unrolling and minimal nods to efficiency. It's strictly designed to be as close to the specification as possible and hence it's an easy to follow algorithmic piece. Unlike the C code which is a total mess!
Eg the order-1 rans javascript decode inner loop is:
Code:
// Main decode loop
var output = new Buffer.allocUnsafe(nbytes);
for (var i = 0; i < nbytes; i++) {
var i4 = i%4;
var f = RansGetCumulativeFreq(R[i4], 12);
var s = C2S[f]; // Equiv to RansGetSymbolFromFreq(C, f);
output[i] = s;
R[i4] = RansAdvanceStep(R[i4], C[s], F[s], 12);
R[i4] = RansRenorm(src, R[i4]);
}
The C version of the same function is:
Code:
for (i = 0; cp < cp_end-8 && i < (out_sz&~7); i+=8) {
for (j = 0; j < 8; j+=4) {
RansState m0 = RansDecGet(&R[0], TF_SHIFT);
RansState m1 = RansDecGet(&R[1], TF_SHIFT);
R[0] = sfreq[m0] * (R[0] >> TF_SHIFT) + sbase[m0];
R[1] = sfreq[m1] * (R[1] >> TF_SHIFT) + sbase[m1];
RansDecRenorm(&R[0], &cp);
RansDecRenorm(&R[1], &cp);
out[i+j+0] = ssym[m0];
out[i+j+1] = ssym[m1];
RansState m3 = RansDecGet(&R[2], TF_SHIFT);
RansState m4 = RansDecGet(&R[3], TF_SHIFT);
R[2] = sfreq[m3] * (R[2] >> TF_SHIFT) + sbase[m3];
R[3] = sfreq[m4] * (R[3] >> TF_SHIFT) + sbase[m4];
out[i+j+2] = ssym[m3];
out[i+j+3] = ssym[m4];
RansDecRenorm(&R[2], &cp);
RansDecRenorm(&R[3], &cp);
}
}
// remainder
for (; i < out_sz; i++) {
RansState m = RansDecGet(&R[i%4], TF_SHIFT);
R[i%4] = sfreq[m] * (R[i%4] >> TF_SHIFT) + sbase[m];
out[i] = ssym[m];
RansDecRenormSafe(&R[i%4], &cp, cp_end+8);
}

Credit to the V8 JIT engine that it gets remotely close.