I've published first version of the Compression-Research suite
I consider this repository as the place for my compression-related experiments. Just now it includes almost full stack of BWT-compression related algos, with two applications to benchmark them. There are no any real compressors here, neither planned in future versions.
So, the contribution of this release:
Two new LZP compression routines
My first algorithm lzp_cpu_bsc_mod is just a small low-level optimization of BSC LZP, implementing multiplicative hash and unaligned memory access. Depending on compiler, it's 3-25% faster than original.
LZP preprocessing for BWT/ST usually called with minimum match length minLen of 32..128. And any LZP implementation aiming for high minLen values suffers from "false positives" problem - match candidates smaller than minLen bytes go in runs since initial check is performed only by 4 hashed bytes. BSC LZP implementation only partially solves this problem by comparing bytes with minLen offset first as well as employing other heuristics.
My second algorithm lzp_cpu_rollhash completely solves the "false positive" problem by storing in each hash entry, besides a pointer, a hash of minLen bytes it points to. This allows to check whether we may have a match, with a very high confidence level. Current implementation stores 32-bit hash, thus making "false positives" almost impossible (probability = 2^-32 = 0.000000025%). Since successfull matches are pretty rare (f.e. enwik9 processed by LZP with -h15 -l32 options has 1 match per 400 input bytes), this means that we run innermost loop almost without unpredicted branches, making lzp_cpu_rollhash a new member of growing branch-less compression algorithms family.
Even storing 8-bit hash should make the cost of handling "false positives" negligible, as well as allow to increase amount of hash table entries (while still fitting to L2 cache) from 2^15 to 2^17, improving overall compression ratio with only tiny speed loss.
And of course, hashing 32..128 bytes (=minLen) on every loop step will be prohibitively slow, so we run a rolling hash over the next minLen bytes, hence the algorithm name.
The speed of all three LZP routines somewhat depends on compiler:
Code:
M:\CR>bslab-icl-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 352 MiB/s, 2710.463 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 361 MiB/s, 2639.064 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 488 MiB/s, 1955.604 ms
M:\CR>bslab-clang-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 314 MiB/s, 3033.328 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 324 MiB/s, 2943.981 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 438 MiB/s, 2177.476 ms
M:\CR>bslab-gcc-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 352 MiB/s, 2713.003 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 369 MiB/s, 2584.420 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 421 MiB/s, 2266.292 ms
M:\CR>bslab-msvc-x64.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 348 MiB/s, 2736.850 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 369 MiB/s, 2581.483 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 434 MiB/s, 2197.373 ms
Current lzp_cpu_rollhash implementation employs 64-bit computations so it's suboptimal for 32-bit platforms (but it will be easy to fix):
Code:
M:\CR>bslab-icl.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 315 MiB/s, 3026.841 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 334 MiB/s, 2856.902 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 359 MiB/s, 2654.012 ms
M:\CR>bslab-clang.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 252 MiB/s, 3782.299 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 315 MiB/s, 3026.308 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 328 MiB/s, 2905.698 ms
M:\CR>bslab-gcc.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 327 MiB/s, 2918.068 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 343 MiB/s, 2777.789 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 267 MiB/s, 3566.095 ms
M:\CR>bslab-msvc.exe enwik9 -nobwt -norle -nomtf -lzp1-3
[ 1] lzp_cpu_bsc : 1,000,000,000 => 855,183,966 (85.52%) 288 MiB/s, 3308.380 ms
[ 2] lzp_cpu_bsc_mod : 1,000,000,000 => 855,168,159 (85.52%) 315 MiB/s, 3023.044 ms
[ 3] lzp_cpu_rollhash: 1,000,000,000 => 855,369,315 (85.54%) 307 MiB/s, 3111.218 ms
You can change LZP params using -b -l -h options. Add "-nobwt -norle -nomtf" to run only through the LZP compression part. The fourth LZP algo, "lzp_cpu_rollhash (OpenMP)" just splits input data into 8 MB chunks and runs multiple algorithm instances on multiple cores simultaneously.
The compilers tested:
- Intel C++ Compiler 16.0
- Clang 3.8
- GCC 5.3
- Microsoft Visual C++ 2015 Update 1
Compression-Research-v1.zip includes BSLab executables compiled by them all, for any combination of x86/x64, sse2/avx2 targets - i.e. 16 executables overall. All speed measurements provided here are performed on single core of Haswell i7-4770.
You can find more benchmark results here (boost/enwik8/enwik9 are text files, 100m/1g are executables, and 1g.tor3 is incompressible), and download these testfiles (742 MB).
I will continue to describe contribution of this release in the next postings...