Hi all – How much would it help LZ-Huffman style compression to know in advance the kind of data you're compressing? I don't think I've seen discussion of this. What I mean is if you know that your input is always JSON, for example. Or actually minified JSON, minified according to known rules.
Or minified HTML, or really anything where you know in advance that the input data has certain characteristics. For example, you might know the longest possible match length.
I've been thinking mostly about how best to compress very small JSON payloads, like for credit card payments and other financial messages. In raw form, they might be 2 KB long, and 1 KB when minified. It's an interesting case because there are no long matches, mostly just 3 - 9 bytes if no keys or values repeat in a given payload.
If you didn't look for matches longer than say 16 bytes, could you fly through it faster? I think deflate is 128 or 256 max length?
What other characteristics or constraints on input data could help optimize a compressor? I'm incredibly impressed with SLZ lately, and I wonder if it could be optimized further by knowing in advance what kind of data it was compressing (like small JSON payloads), and maybe the ratios could be improved with little overhead penalty. Brotli does very well on these JSON payloads – the best so far – and I wonder if something like brotli 11 could be optimized to have less overhead if it's only small JSON files.
Charles Bloom has an interesting article here on LZ optimal parsing.