Here are a few initial numbers I collected for compressing the Silesia corpus on x86_64, with the caveat that I was unable to include vaporware such as lzturbo which has no source code:
Code:
Csize (bytes) Ctime (ms) Dtime (ms)
Brotli, -6 -w 19 60,285,238 12457 786
Brotli, -6 -w 18 61,100,491 11176 813
ZSTD -7 61,975,738 6326 468
XPACK -6 (chunk size 524288) 62,216,855 7922 419
ZSTD -6 62,981,472 4723 476
ZSTD -5 65,006,998 3315 489
XPACK -5 (chunk size 262144) 65,330,086 4292 443
DEFLATE, libdeflate -7 67,611,241 5076 452
LZFSE 67,628,688 4708 464
DEFLATE, libdeflate -6 67,929,556 4073 450
DEFLATE, libz -7 67,940,580 14471 940
DEFLATE, libz -6 68,229,313 11941 942
Interestingly, LZFSE didn't really do any better than a properly optimized DEFLATE implementation, but it certainly was much faster than the one almost everyone uses. LZFSE also has larger sliding window size than DEFLATE (262144 vs. 32768 bytes), and I believe a significantly stronger LZFSE encoder would be possible, compared with the one currently available.
I am going to try this benchmark on ARM next.
Edit: recompiled LZFSE with clang; this improved decompression speed by about 10% over gcc.