Again, this is not about archive format, and obviously there'd be dedup, MT, etc.
Question is, which class of compression algorithms to use as a base for development.
Compression-wise, my PLZMA almost fits, but encoding with parsing optimization is too slow.
fast-CM (like MCM, or nzcc) fits by CR and, potentially, encoding speed
(it can be significantly improved only for encoding with context sorting and out-of-order probability evaluation),
but there's no solution for decoding speed.
And BWT fits both by enc and dec speed, and even CR on text, but BWT CR on binaries is relatively bad.
Plus, there're preprocessors and hybrid options - plenty of choices, which is the problem.
Code:
1,048,576 corpus_VDI_pcf_x3.1M
249,592 corpus_VDI_pcf_x3.1M.lzma 1048576/249592 = 4.20 (276366/249592-1)*100 = 10.72%
243,743 corpus_VDI_pcf_x3.1M.plzma_c1 1048576/243743 = 4.30 (276366/243743-1)*100 = 13.38%
248,687 corpus_VDI_pcf_x3.1M.rz 1048576/248687 = 4.22 (276366/248687-1)*100 = 11.13%
276,366 corpus_VDI_pcf_x3.1M.zst 1048576/276366 = 3.79
276,403 corpus_VDI_pcf_x3.1M.lzma_a0 // lzma -a0 -d20 -fb8 -mc4 -lc0 -lp0
533,864 corpus_VDI_pcf_x3.1M.lz4-1
369,616 corpus_VDI_pcf_x3.1M.lz4-1.c7lit_c2
443,586 corpus_VDI_pcf_x3.1M.lz4-12
355,800 corpus_VDI_pcf_x3.1M.lz4-12.c7lit_c2
707,961 corpus_VDI_pcf_x3.1M.LZP-DS
236,180 corpus_VDI_pcf_x3.1M.LZP-DS.c7lit_c2
391,962 corpus_VDI_pcf_x3.1M.lzpre
306,616 corpus_VDI_pcf_x3.1M.lzpre.c7lit_c2