The news at Phoronix. I had never even heard of Zchunk. Apparently it's based on Zstd. It's replacing XZ, which I think is LZMA or LZMA2.
The news at Phoronix. I had never even heard of Zchunk. Apparently it's based on Zstd. It's replacing XZ, which I think is LZMA or LZMA2.
Cyan (3rd July 2018)
Its author posted a description here :
https://www.jdieter.net/posts/2018/0...hat-is-zchunk/
So there are (very roughly) about 20,000 packages in the repo, and for each package its metadata record is a few KB in size (starting with about 1KB, I believe). If they compress each record individually, that's gonna hurt the compression ratio tremendously. Because there isn't much redundancy within a record, but there's a lot of similarity between the records, especially if they are sorted by the package name.
If your estimate are accurate, I agree with you. Unless there's some catch regarding CPU or memory use, or how important encoding speed is. What Brotli and Zstd levels are you basing your estimate on?
Zchunk is some sort of delta scheme. I haven't dug into the details, but is it trivially easy to use brotli for delta compression cases? Is there any reason Zstd would be preferred for such cases?
This is thoroughly explained on the link provided by Cyan. "Delta" here refer to chunk-based diffs. Not the delta we know as implemented in some lz compressors. See the parts where the author talks about the relation of zchunk with rsync and zsync. The format itself is pretty straightforward and I believe the algorithm of choice have little to nothing to do with the functionality of the program. Probably the author went for zstd because it is more known, a little more stable and pretty much state of the art for its niche. Maybe brotli is better for some cases, maybe not, but the author probably doesn't even know it, or he knows that probably nobody on his teamwork would volunteer to help him insert it. Remember that zstd is rapidly gaining adepts in the linux world. It is even included on the kernel. See this post for some uses of it.
The recent penetration of modern compression technologies has been a fantastic step!
Comparing Brotli and Zstd, Brotli is very likely more widely available: it is in .NET, in iOS (including TV and phone versions of iOS), in Windows, in Android and in every browser that is still being developed. Brotli is even included in my LG tv.
Sure, it is not in the linux kernel. I think zstd got their foot in by benchmarking brotli 11 with 4-16 MB window against zstd 22 with 128 MB window. For quite a while the zstd's wikipedia page claimed faster compression and decompression for zstd with a better compression rate. Even today, the LTCB is not including the large window brotli results that would make it generally favorable over zstd. Large window is slowly fixing that, and a meta-analysis of recent benchmarking efforts shows Brotli compressing ~5 % more densely than zstd (when both are run at maximal settings in large corpus scenarios where the static dictionary doesn't help Brotli).
For compressed filesystem use zstd may have a better balance of features -- there decompression speed can be really critical, but I consider brotli more balanced for the general use case.
Last edited by Jyrki Alakuijala; 4th July 2018 at 16:19.
That I didn't know. Thanks for the info!
Jyrki, do you really believe that the current situation is the result of a benchmark publication on LTCB ?
Jyrki Alakuijala (6th July 2018)
Yeah. I believe that those people who are basing over-the-wire or over-the-air internet scale solutions on zstd are doing it based on faulty benchmarking scenarios. It seems to me that zchunk is in this category.
Zstd for the filesystem or possibly for the datacenter network fabric seems to be a better fit for the properties of zstd.
Gonzalo (4th July 2018)
My bad. I stand corrected
Jyrki Alakuijala (6th July 2018)
I really wish Matt would fix the LTCB. It's not a valid benchmark. In addition to the window size issues, the benchmarks are on different machines, different compiler settings, and often use very old versions of a codec. The gzip he uses is more than 10 years old, and I have no idea how it compares to the latest release of zlib, GNU gzip, or libdeflate. And the brotli and Zstd versions are two years old. There's no report of decomp memory usage, and no report of CPU load for either compression or decomp.
A compression benchmark needs to the use the same hardware for all codecs, and ideally the latest versions of each codec.
Jyrki Alakuijala (6th July 2018)
Also offtopic, but I think the squah library is a good building block to perform such benchmarks and needs more support to cover more formats and test data: https://quixdb.github.io/squash-benchmark/
What is streaming implementation anyway? All compressors have some lag during compression and decompression. If we pair compressor and decompressor (pipe compressor output directly to decompressor) then before the decompressor will output some original byte compressor will have to process many following input bytes.
A streaming implementation is one that can decode all the bytes that can be decoded given the current information. A streaming implementation of a decoder is in my experience 10-35 % slower than a non-streaming in cpu, but the computation system build on top of a streaming implementation works faster overall.
A streaming data format is one where you don't need a lot of future data to decide how to decode the current bits.
An encoding streaming friendly data format is one where you don't need have a lot of data available before you can compress. For example ANS is a minute step away from encoding streaming friendliness, since data needs to be encoded backwards.
Yes, just different amounts, and these amounts can affect user experience. In the use case of a browser streaming means that the browser is able to issue fetches for the images or javascript earlier (if the urls are available in the already decoded part), or show some sort of preview from the half-loaded page earlier. Also, some more cpu-heavy processing like dom-building or javascript parsing/running can start while the content is being loaded.