I just implemented an idea I've had for some time now:
http://www.quicklz.com/quicklz_experimental.c
http://www.quicklz.com/quicklz_experimental.exe
It shouldn't be included on public benchmark websites yet becuase this is really just a dirty hack/experiment and not any official release at all
Description:
Normal sliding window LZ compression requires 3 memory reads per "round", 1) fetch some input data 2) lookup offset in hash table 3) read past input data from offset. The addresses of access 2 and 3 are often distant from eachother and thus slow, even
though in L1 cache.
This compressor is saving the first 4 bytes of the fetched input in the hastable together with the offset (concecutive data access) so that it can quickly be determined if we have a) no match b) exactly 3 byte LZ match or c) match longer than 3. In case a and b, which is usually more than 50% of the rounds, we avoid
the 3'rd memory read.
Benchmarks on the same files and same Celeron as on www.quicklz.com:
QuickLZ 1.30 beta 6:
pic.bmp Compressed 18108198 bytes into 16333503 bytes (90.2%) at 78.5 Mbyte/s.
WINWORD.EXE Compressed 10577312 bytes into 7260047 bytes (68.6%) at 94.2 Mbyte/s.
proteins.txt Compressed 7254685 bytes into 2602253 bytes (35.9%) at 144.4 Mbyte/s.
This experimental version:
pic.bmp Compressed 18108198 bytes into 16333503 bytes (90.2%) at 83.6 Mbyte/s.
WINWORD.EXE Compressed 10577312 bytes into 7260047 bytes (68.6%) at 101.2 Mbyte/s.
proteins.txt Compressed 7254685 bytes into 2602253 bytes (35.9%) at 149.1 Mbyte/s.
EXPERIMENTAL CODE, DO NOT USE IN PRODUCTION