Cmov can be faster, but it's hard to get it used by the compiler.
Benchmarking the above code vs some assembly using cmov on two files that encode to 1.84 bits per byte and 4.54 bits per byte gives me these decode speeds (best of 10):
asm cmov: 507.7 MB/s, 495.3 MB/s
no-branch: 414.8 MB/s, 411.8 MB/s
original ryg: 358.1 MB/s, 365.8 MB/s
The assembly version is:
Code:
static inline void RansDecRenorm(RansState* r, uint8_t** pptr) {
uint32_t x = *r;
uint8_t *ptr = *pptr;
__asm__ ("movzbl (%0), %%eax\n\t"
"mov %1, %%edx\n\t"
"shl $0x8,%%edx\n\t"
"or %%eax,%%edx\n\t"
"cmp $0x800000,%1\n\t"
"cmovb %%edx,%1\n\t"
"adc $0x0,%0\n\t"
: "=r" (ptr), "=r" (x)
: "0" (ptr), "1" (x)
: "eax", "edx"
);
if (x < 0x800000) x = (x << 8) | *ptr++;
*pptr = ptr;
*r = x;
}
It's only half branchless cmov because the second renorm half is rarely triggered and very predictable by the CPU. The 16-bit renorm version is around 660MB/s instead. Insane 32-way unrolling with AVX2 is 1.4GB/s, but that's a bit extreme and has other issues with small data sets.
Clang prefers the assembly too, while icc runs tragically slow on it (about half speed) but does a much better job on Lucas' branchless variant. The branchless code certainly seems to be the optimal case for pure C, but a bigger question is whether we can get the compilers to generate code like the assembly above, maybe with hints about predicatability of branching?