Here's a benchmark from within my framework. This is with 12 threads compressing, and 4 threads decompressing. (chosen to provide best timings)
struct_mixed1 (541837056 bytes)
Code:
Selkie ( 1) -T 4 --> size 541,837,056 - compress time 0.5159 s (1002 MB/s)- decompress time 0.2052 s (2518 MB/s) (compress sz 116,957,152) (ratio 21.6%) total cost = 0.7211
LZ4 ( 1) -T 4 --> size 541,837,056 - compress time 0.2496 s (2070 MB/s)- decompress time 0.2078 s (2487 MB/s) (compress sz 114,785,038) (ratio 21.2%) total cost = 0.4574
Lizard (21) -T 4 --> size 541,837,056 - compress time 0.3461 s (1493 MB/s)- decompress time 0.2178 s (2373 MB/s) (compress sz 86,624,627) (ratio 16.0%) total cost = 0.5639
srle64 ( 1) -T 4 --> size 541,837,056 - compress time 0.3085 s (1675 MB/s)- decompress time 0.1841 s (2806 MB/s) (compress sz 238,846,956) (ratio 44.1%) total cost = 0.4926
memcpy ( 1) -T 4 --> size 541,837,056 - compress time 0.6199 s ( 834 MB/s)- decompress time 0.1781 s (2901 MB/s) (compress sz 541,837,382) (ratio 100.0%) total cost = 0.798
TurboRLE is very interesting. For decomp it's basically at memcpy speed, but I get slower compression than LZ4 presumably this is because the final malloc + memcpy is twice the size of the LZ4 final step do to worse compression ratio.
I was hoping to find something that beat LZ4 in the round trip time, something like total cost of 0.2 seconds, then it might be feasible to rewrite my matrix class to ALWAYS compress data behind the hood (similar idea to blosc in numpy), but at the moment it slows things down too much.