@cade
Here are some ideas from Shelwien related to LZMA.
Code:
Anyway, imho there're still many ways to improve lzma (compression-wise).
1. lzma parsing optimization is relatively good, but still far from perfect -
only certain patterns like match-literal-rep_match are explicitly checked by the optimizer,
while other ones may be missed.
I think lzma really needs a matchfinder based on fuzzy matching - and explicit
price estimation for these, instead of treating rep-matches like something independent.
2. lzma already has 7 token types. Why not add some more useful ones?
- delta matches, like in bsdiff
- dictionary matches, where dictionary is not a part of the common window
- support for multiple data windows (we can have different types of preprocessing in these)
- multi-dimensional matches (for 2D+ and structured data)
- LZ78 matches (by index (potentially multiple sets), implicit length)
- LZP matches (order choice, explicit length)
- ROLZ matches (contextual distance, explicit length)
- PPM matches (no distance, unary coding of length, context tracking)
- long-range matches (like from a rep preprocessor)
3. lzma2 added block support, but there's no proper optimization for these.
For example, its possible to change lc/lp/pb options per block, but no compressors
make use of this.
More ideas could be taken from zstd (RLE blocks, separation of literals/matches).
Some more can be added after that (rep/MTF/PPM/BWT/huffman blocks, a flag for not adding the block to the window).
4. Secondary LZ. I mainly noticed it when using bsdiff+lzma in an app update system,
but in some kinds of data there're visible patterns in literals/matches after first LZ pass.
(Bsdiff used delta-matches, so there were secondary matches within literal blocks too).
For example, page tags in book1 are encoded by lzma as <match:len=4><literal><rep_match:len=2>,
and only literal changes there.
btw, I´m just interested - when optimal parsing will be available?