Hi all – Has anyone worked on using intermediate representations or formats for compression? I'm thinking mostly of text compression of the sort that's needed on the web, which is currently handled by gzip or, much more expensively, by brotli.
Do you think text could be represented in an intermediate form that would make a tailored gzipper significantly faster and/or lighter on CPU? Another way of saying this is: What's easier than what gzip/DEFLATE currently do to a text file? DEFLATE looks pretty light compared to most codecs, but I wouldn't assume it's the most efficient thing we could be doing. (I realize that we could think of LZ77 as the IR for DEFLATE, which then Huffman codes it.)
One thing that has struck me is that our current compressors are completely naïve with respect to what they're compressing. zlib knows nothing about HTML, CSS, JS, and SVG. This is absurd, but it's been the status quo for 20 years. It knows nothing about the file it's about to compress, or anything about the sorts of strings it should look for, or ignore (e.g. DOCTYPE). So for example, one light intermediate representation beyond plain HTML would be HTML files that came with metadata that helped the compressor. It might include the longest repeated string, small strings that are worth counting, small repeated strings that should be ignored, etc.
A more interesting intermediate representation would be a standardized minified format that would allow for interesting optimizations in a gzipper or brotli compressor. It might not have a huge payoff – it would take some work and more thought to know, but if the gzipper could assume various things about the strings, the whitespace, the only possible line ending character, etc. maybe it would make some big wins possible.
The most interesting intermediate representation would be a binary format for text. This could be for all text, or just tailored for web files. Text is already in a binary format – all formats are in truth binary, but "text formats" usually means: an extremely inefficient binary representation, that in the web's case includes ponderous English-language code words with syntax that varies just enough to weaken the compression, which is then converted to a slightly more efficient binary form (gzip), sent, and then strangely reconverted back to its extremely inefficient prior form before a machine parses it (just brilliant.