Google have now announced a new compression format - brotli - and are getting attention on various programmer websites. Its in Chrome and is currently being added to Firefox.
I have heard about it all other the internet but its only been mentioned in passing on this forum, so I figured it deserved a thread of its own.
Brotli is roughly as fast as zlib’s Deflate implementation. At the same time, it compresses slightly more densely than LZMA and bzip2 on the Canterbury corpus. The higher data density is achieved by a 2nd order context modeling, re-use of entropy codes, larger memory window of past data and joint distribution codes
My tests on LTCB. brotli made the Pareto frontier for decompression speed even on my slow computer, but you need to compress for a long time to achieve that. http://mattmahoney.net/dc/text.html#2414
i've compiled Brotli using their makefiles - 32-bit gcc 4.9.2 -O2 -static --large-address-aware. it may have edge on fast extraction (although lzmh should be as fast as deflate too), but on binary data max. compression and speed/compression ratio is nothing close to 7 year-old tornado, not even talking about lzma, lzham and so on
Testing bro.exe in Windows on the Silesia corpus (bro.exe -i file -o file.bro). Decompression says "corrupt input" on all files. Looking at a dump of the .bro files, I see that CR is always followed by LF.
Testing bro.exe in Windows on the Silesia corpus (bro.exe -i file -o file.bro). Decompression says "corrupt input" on all files. Looking at a dump of the .bro files, I see that CR is always followed by LF.
thanks, i modified sources to open files as binary in the Windows way:
Windows and Linux versions of brotli compress to different sizes on most files in the Silesia corpus (using default -q 11), but decompress correctly either way. I posted both results to http://mattmahoney.net/dc/silesia.html
The use-case for Brotli is to replace deflate for streaming pre-compressed text over HTTP.
Lots of web servers, particularly the "application servers", store (often in memory) the deflated versions of text/ mime types to serve from cache (if not asked for a range request).
The way an application server (e.g. a webserver with sever-side scripts to generate pages) works is typically shockingly suboptimal: they generate the response, then hash it (which they need to do for an ETag anyway), then compare that hash to dictionary of previously-sent compressed responses...
As a talk about the use-case, several years ago I added partial pre-compression of response templates to the Tornado webserver. Most of these application servers have a text template response system e.g. <html><head><title>{% page_title}</title><body><p>Hi {% username}, ...
In tornado they actually precompile these text-based templates to code. Neat hacker trick.
And what I did was pre-compress those template fragments between the computed fields as deflate using relative offsets for the matches. When I emitted the template I emitted the computed fields as deflate literals and fixed the offsets in the matches as I went. At least, that's how I recall doing it.
It was effective. I don't think the same approach would work with a more complicated compressor like Brotli though.
Sorry for the walk down memory lane inside "application servers"
new version:
- added -w/-m options
- improved -v statistics
- extended -h help
- published to github
- note that default dictionary (window) is only 4 MB and maximum is 16 MB, so on larger files it has no chances against lzma and even tornado
--quality: controls the compression-speed vs compression-density tradeoff. The higher the quality, the slower the compression. Range is 0 to 11. Defaults to 11.
--window: base 2 logarithm of the sliding window size. Range is 16 to 24. Defaults to 22.
--mode: the compression mode can be 0 for generic input, 1 for UTF-8 encoded text, or 2 for WOFF 2.0 font data. Defaults to 0.
So lzturbo could give ~15% bandwidth improvement still being essentially faster/less costly ... maybe instead of enforcing brotli, they should buy lzturbo ...
Jarelk, don't forget that lzturbo is just stolen tornado code. his best own archivement is a huffman coder. by looking at lzturbo sources, anyone can immediately understand that it's just a scam
Bulat, I wouldn't call something offering (in verifiable way) 15% improvement of world-wide bandwidth "a scam".
Sure he might stole some of your code, but it was many versions ago, the difference of performance with tornado suggests that it might be completely different now.
This guy is a genius of low level optimization, also in other tasks: https://github.com/powturbo
We are talking about huge world-wide improvements of extremely frequent tasks - it is worth much more than e.g. 1M$ Google could easily pay him to make it open-source, maybe shared with other contributors like you.
maybe instead of enforcing brotli, they should buy lzturbo ...
I thought the same or hire Hamid Buzidi. If I remember my lzturbo research during the start days well, Hamid worked for a German search engine that time, so it must be optimized for HTML compression.
Jarek, brotli dictionary by default is 4 MB and it's indended for compression of pretty small files. so you need to use the same dictionary and non-solid compression, for the start, and then compare with original programs, f.e. lzmh+xwrt with builtin dictionary or lzga. hamid can sell anything telling us what he wrote this but the only real thuing he was ever wrote was a tiny integer compressor. even the huffman compressor he wrote follows the description of jumpless bit coding i wrote here a year ago. modern lz compression like lzmh is absolutely outside his programming skills
brotli failed on 10gb.tar. Input: 10,065,018,880. Compressed with -q 1: 4,733,283,813 (5580 s). Decompressed: 14,687,141,888 (212 s). I'm testing the Sept. 21 commit compiled in Linux (10GB benchmark machine 4).
Edit: rebooted, installed latest (24 Sep 2015) commit of brotli on Linux again, recompiled, tested with -q 5. Output is still bad (wrong size). Compressed 5,357,707,755 (557s, faster than -q 1?), decompressed 13,689,897,536 (262s).
Edit: after further experimentation, the error occurs at byte 25,165,825 of 10gb.tar with -q 1. If you truncate to this length or longer then the output differs. At this length only the last byte of output differs. If you truncate to a smaller file then the output is identical.
Last edited by Matt Mahoney; 24th September 2015 at 22:33.
I assume Google developed brotli so they wouldn't encounter any issues incurred by using someone else's work for their enterprise. So the odds of them using lzham or glza and especially lzturbo are off the table. I would've liked brotli to perform a bit better for what it is but it's not like they can change the standard at this point.
Jarek, brotli dictionary by default is 4 MB and it's indended for compression of pretty small files. so you need to use the same dictionary and non-solid compression, for the start, and then compare with original programs, f.e. lzmh+xwrt with builtin dictionary or lzga. hamid can sell anything telling us what he wrote this but the only real thuing he was ever wrote was a tiny integer compressor. even the huffman compressor he wrote follows the description of jumpless bit coding i wrote here a year ago. modern lz compression like lzmh is absolutely outside his programming skills
For general purpose data, it seems to be total garbage. Compression ratio is terrible, encode speed is terrible, decode speed is mediocre. Even a good LZ-Huff beats it (like LZX).
For text, it's okay.
But that's not really fair testing a text-specific compressor vs. a generic compressor. The competition should be something like LZ-Huff with a text preprocessor or special text mode.