Fascinating compression cost model, where cost = $$$
Hi all – This report by LucidChart is remarkable. They tested brotli, gzip, Zstd, and other compression, as well as numerous binary serialization formats, and evaluated them based on dollars saved in storage and compute costs. They use AWS Lambdas and S3 storage. (Lambdas are pay as you go functions, instead of traditional server instances.)
The most compact binary format, uncompressed, was Amazon's new Ion format. (They didn't test FlatBuffers or Protocol Buffers, or Cap'n Proto, but they tested CBOR, Smile, BSON, etc.) They tested every format in combination with every level of every compressor.
The winner was JSON + Brotli 11. This surprised me. It turns out that the compression codec is more important than the data serialization format. Even though JSON is a bloaty text format, Brotli 11 sort of makes up for it. I'm not sure why Ion + Brotli 11 didn't win, at least by a little bit. Their use case was more focused on long-term storage. When people discover that compression codecs are more important than underlying data formats, they forget that the uncompressed payload is going to take up RAM, so uncompressed size matters too, but if all you're worried about is long-term compressed storage then that doesn't matter as much.
It looks like compressed textual Ion did slightly beat JSON for size not counting cost. My guess would be that the textual formats have a restricted range of common characters that provide for more matches and thus less space used once compressed compared to first binary encoding the data, similar to how processors like precomp can give more compression. I am curious what kind of overhead Ion has (it's a JSON superset) to fall off the cost efficiency list, but since the start and end formats were JSON, any conversion is likely to lose.
Far more domain specific, but I did a similar analysis at some point for genome sequencing alignment formats: BAM vs CRAM.
Again it was using S3 storage (vanilla S3, but also things like glacier) and CPU cost (it was a spot-price at the time). I was considering encode + decode, but took a different approach. Instead of asking what is the cheapest format for 1 month, I asked at what time period do you break even. Given CRAM is so much smaller than BAM but less well supported, I was considering the use case of tools that only support BAM, but where we convert BAM to CRAM, store it for a while, and then convert back to (uncompressed) BAM when the tool next needs to reread the file. This method is pretty sound for any file / compression format and any tool.
I don't have a the slides readily available, but basically it come out as a staggeringly low 1 day! Guess which format is still dominant. *thud* I think it was 3 days if not using spot pricing and going for on-demand, or closer to a month when you get to the really cheap storage systems, but if you're using those you also do want to have your data not being accesed as the cost of re-reading is high.
Thanks for posting this, we're actually struggling with similar metrics for the contest.
Actual formula seems to be like this:
$GB = 1024*1024*1024;
$B4 = 0.000002917 * 10; # Lamda Cost per Second (for a full vCPU)
$B5 = 0.022; # S3 Cost (per GB/Month)
$B6 = 0.0125; # S3 IA Cost (per GB/Month)
$B11 = 0.9; # Percent of S3 data in IA (rather than Standard)
$B12 = 90; # Expected Lifespan of data (in months)
$B13 = 2; # Number of times (on average) decompressing a given snapshot
$B14 = 6; # CPU Multiplier
$B19 = $B4*$B14; # Expected Cost per CPU millisecond of measured time
$B20 = $B11*$B6 + (1-$B11)*$B5; # Average Cost per GB per month
But *6*10 multipliers for csize went unexplained afaik, and this metric gives too much priority to compression ratio imho.
(Where brotli wins, because only brotli has integrated dictionary enabled by default).
When this metric is applied to enwik10 benchmark results we get this:
That's almost (but not quite) just a sort by ascending file size.
I'd agree it's over emphasising the ratio. bcm -9 is ~5x quicker than zpaq -m411 and just 0.4% larger, but still ranks poorer. Where is the /1000 coming from in the cpu cost part of the formula? Also maybe 90 months is a bit excessive for duration. (I find my large files tend to have smaller duration than the small ones, because if I'm running out of space then they're the ones I target when tidying up.)
> Where is the /1000 coming from in the cpu cost part of the formula?
Thanks, it was my mistake - enwik10 benchmark has times in seconds, while LucidChart formula is for times in ms.
I updated the table in the post above, now its not sorted by csize.
"It was much bigger" only when you used jQuery itself for the dictionary - that doesn't count.
Hah! I forgot about that. With this kind of JSON use case, I think the dictionary would be pretty tight. I mean there would be standard field/key names that would be repeated in all files or streams, and probably lots of common values. But I guess those matches would be happening without a dictionary. It would be interesting to find out. The use case I was thinking of most was JSON in REST payments APIs. So it would be small data and very repetitive, like "account number": "nnnnnnnnnnn", and so forth.
Also using XZ instead of plain LZMA is dumb. LZMA has options (lc8 lp0 pb0 fb273 mainly) that improve text compression,
while XZ is a much more limited wrapper.
It is a common way to compute a cost function by adding an extra free parameter (see the formula). Based on the value, one can balance the cost due to the compression ratio with the cost due the comp/decomp speed. If you look at the numbers, with a small lambda value, the compressors with the lowest costs are the ones that compress the most (mostly ignoring speed). With an intermediate values, you see more balanced compressors at the top and finally with higher values, the fastest compressors win. So, it allows one to pick the "best" compressors for a specific scenario.
Shelwien, I don't understand your dictionary fix. What exactly is the 1.76%? That's brotli with dictionary vs. brotli without dictionary? How did you alter the results?
If brotli wins, brotli wins. It doesn't matter if it uses a dictionary while others don't – that's their problem. Unless the brotli dictionary can be used by other compressors in production environments, which I wasn't aware of. You posted that brotli dictionary file on the other thread as though I could do something with it, so that suggests that maybe Zstd can use it.
I don't understand what you did to the results to make XZ win and dominate. As you said, XZ isn't that great. Did you just add 1.76% to the brotli sizes, or shave the others by that amount? How is that valid? Whatever you did needs to reflect the actual reality of using a given compression codec in a production environment.
> Shelwien, I don't understand your dictionary fix. What exactly is the 1.76%?
> That's brotli with dictionary vs. brotli without dictionary?
That's zstd vs zstd -D using brotli dictionary.
> How did you alter the results?
Like this:
$csize*=28942/29453 unless $name=~/brot/i;
> If brotli wins, brotli wins.
> It doesn't matter if it uses a dictionary while others don't – that's their problem.
However in fact, its usually a problem of fake advertisement and
ignorance of benchmark organizers.
> Unless the brotli dictionary can be used by other compressors
> in production environments, which I wasn't aware of.
Yes, some compressors (mainly zstd) do have explicit options for external dictionary usage.
But even when they don't, its not really an unique feature.
It can be easily simulated for most codecs via csize(dict+file)-csize(dict)).
In addition, PPM/CM codecs can use precomputed statistics (eg. see ppmtrain).
So its just not fair to only allow brotli to rely on this trick.
And in fact, the trick goes even further - brotli has transformation rules
for the dictionary (capital conversion etc) so the effective dictionary
is not really dictionary.bin, but something that has to be specially generated
to make the comparison with other codecs really fair.
> You posted that brotli dictionary file on the other thread as though
> I could do something with it, so that suggests that maybe Zstd can use it.
Yes, there're even multiple ways.
> I don't understand what you did to the results to make XZ win and dominate.
> As you said, XZ isn't that great. Did you just add 1.76% to the brotli sizes,
> or shave the others by that amount? How is that valid?
Sure, its an approximation, but to get precise results I'd have to rebuild
their whole benchmark with all other codecs used with brotli dictionary.
I guess we can just rebuild only brotli results without dictionary,
but it would be hard to justify.
Despite defaults and XZ overhead, LZMA is still a more advanced format than brotli,
since it uses adaptive order1 arithmetic coding vs brotli's static huffman.
Also brotli encoding is very slow and this benchmark also takes the speed into account.
> Whatever you did needs to reflect the actual reality
> of using a given compression codec in a production environment.
Actual reality is that most codecs don't use best parameter profiles by default,
and also don't have an integrated dictionary.
And if we really care about actual performance, it should be a good idea to
compare apples to apples, even if it requires reading parameter description for each codec.
If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes? (This is my Positional Data Syntax, where order indicates field/key names).
123456789012
AB-11-9876
6011000995500000
1218123
0
TOM JONES
123 MAIN STREET
New York
NY
US
55555
Y
E
n
Y
Y
> If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes?
Mainly because you're comparing apple to oranges again?
I mean, brotli outputs raw brotli format, while 7z creates archive formats with header and index.
Raw optimized deflate output (zopfli or kzip+deflopt + rawdet) is 93 bytes.
Raw lzma (without .7z or .lzma header) is 94 bytes, plzma is 89 bytes, ppmd 81 bytes.
In my back of the envelope calculation if there are more than 12 users downloading the data, brotli 11 is worth it -- if the computation can be done without being part of the latency chain.
brotli got quite a lot worse after zopflication for very short data -- if you take an early brotli version from 2014, you'd see perhaps 20 bytes less for 100 byte compression range.
>And in fact, the trick goes even further - brotli has transformation rules
The transformation code is very very short and it more than pays back its added complexity -- I'd recommend it for others to use, too. We get about 8 % improvement from the dictionary on a web corpus and about 1 % is from the transformations.
Also it has to be noted that perhaps 2 % is because how the dictionary is coded -- every combo of (length,distance) has a different word associated with it, the static dictionary is not available just as bytes.
I think it would be great if Yann added the exact same mechanism to zstd as this dictionary is readily available in many environments (Windows OS, Apple iOS, Firefox, Chrome, Safari, Android)
> The transformation code is very very short and it more than pays back its added complexity -- I'd recommend it for others to use, too.
Sure, but it'd be nice to at least have the whole set of brotli's reference data for codec benchmarks -
as concatenated versions of dictionary with applied transformations, if necessary.
Of course, brotli can use its dictionary more efficiently, but its less of a problem than comparing vs no dictionary at all.
> If 7z is more advanced, why does it double this 107 bytes of data, while brotli shaves it down to 91 bytes?
Mainly because you're comparing apple to oranges again?
I mean, brotli outputs raw brotli format, while 7z creates archive formats with header and index.
Raw optimized deflate output (zopfli or kzip+deflopt + rawdet) is 93 bytes.
Raw lzma (without .7z or .lzma header) is 94 bytes, plzma is 89 bytes, ppmd 81 bytes.
...brotli with zeroed-out dictionary = 97 bytes.
How do you generate raw LZMA? Is that an option in the 7-Zip utility?
I don't know why you zero out brotli's dictionary. That's like removing one wheel from a Corvette and saying "See it's not so fast with three wheels." Brotli has a dictionary – that's just the way it is. If others don't use dictionaries, that's a flaw of their choosing. I can only use codecs as they exist in the world, compiled for end-users.
I'm not familiar with kzip. When I look for it it seems to be related to KDE or Ken Silverman. I'll try it. I thought Pavlov's was the best gzipper. It was amazing on jQuery, but that might have been a fluke. kzip looks old – I'm surprised that it hasn't been beat by now.
> How do you generate raw LZMA? Is that an option in the 7-Zip utility?
No, there's a separately distributed LZMA SDK which includes "lzma" utility.
But that's not quite enough either, since it encodes to ".lzma" format,
which has a 14-byte header.
Here I attached a patched version of raw LZMA codec with minimized overhead.
Unfortunately it outputs 95 bytes rather than 94, because 94 requires
also a modified rangecoder flush - I did it with my plzma coder.
> I don't know why you zero out brotli's dictionary.
To see how much its result depends on the dictionary.
> If others don't use dictionaries, that's a flaw of their choosing.
No, the problem here is that we can use dictionaries with other codecs too,
but not exactly the same one as brotli's, because brotli dictionary is
not just a raw block of bytes, but a set of strings sorted by length (afaik).
So if I'd use a different dictionary with other codec, and it improves
compression, then it would be also unfair, because brotli utility
doesn't have an option to use a random file as external dictionary.
Currently the best approach would be to use zstd --train on your records,
then zstd -D to compress the records. But that also doesn't guarantee
that zstd is the best codec for this.
(The best one would be likely ppmd_dict with zstd's custom dictionary).
> I can only use codecs as they exist in the world, compiled for end-users.
Well, your task with random access to compressed small blocks implies
that there'd be a custom solution based on some codec library.
Such things are not implemented with standard console utilities.
> I thought Pavlov's was the best gzipper.
Its the best practical one - it supports stream processing and MT,
Kzip and some other optimizers are not very practical - they would
keep iteratively compressing the file until there's no more improvements,
which can take a lot of time.
> It was amazing on jQuery, but that might have been a fluke.
> kzip looks old – I'm surprised that it hasn't been beat by now.
There's zopfli: https://github.com/google/zopfli
Also there're a couple of extra utilities - deflopt and defluff,
which can sometimes improve the results.