Hi everyone!
I wrote a blog post about the current state of JPEG XL and how it compares to other state-of-the-art image codecs.
https://cloudinary.com/blog/how_jpeg...r_image_codecs
Hi everyone!
I wrote a blog post about the current state of JPEG XL and how it compares to other state-of-the-art image codecs.
https://cloudinary.com/blog/how_jpeg...r_image_codecs
boxerab (13th June 2020),Hakan Abbas (26th May 2020),Jarek (26th May 2020),Jyrki Alakuijala (26th May 2020),Mike (26th May 2020),Piglet (27th May 2020),Scope (26th May 2020)
Jon,
your "Original PNG image (2.6 MB)"is actually a jpeg (https://res.cloudinary.com/cloudinar...h_fidelity.png) when downloaded.
did you mean to add 'f_jpg,q_97' in the URL ?
Also:
* you forgot to use '-sharp_yuv' option for the webp example (53kb). Otherwise, it would have give you the quite sharper version:
(and note that this webp was encoded from the jpeg-q97, not the original PNG).
* in the "Computational Complexity", i'm very surprised that JPEG-XL is faster than libjpeg-turbo. Did you forget to mention multi-thread usage?
Jon Sneyers (27th May 2020)
i guess the original PNG would be this: https://res.cloudinary.com/cloudinar...h_fidelity.png
some trials with close filesize (webp = no meta, png = meta):
it would be unrelated with the point of the article itself, but still, since web delivery is mentionned, few points from end-user pov on samples/results:Code:cwebp -q 91 high_fidelity.png -o q91.webp (52.81 KB) -> q91.png cwebp -q 90 -sharp_yuv high_fidelity.png -o q90-sharp.webp (52.06 KB) -> q90-sharp.png
- this file (131.23 KB) could be a good example where automatic compression would be useful. this PNG could be losslessy reduced to 19.85 KB for web (or 16.19 KB in lossless WebP), which would make the (high quality) lossy JPEG XL less revelant for users (edited: reformulation to match the initial point, see this comparison)
- about PNG itself, the encoder used here would make very over-bloated data for web context, making the initial filesize non-representative of the format (original PNG is 2542.12 KB, but expected rendering for web could be losslessly encoded to 227.08 KB with all chunks). as aside note, this PNG encoder also wrote non-standard key for zTxt/tEXt chunks or non-standard chunks (caNv)
btw, instead (or in complement) of the current lossless, did you plan somehow to provide a "web lossless"? (edited) i did very quick trials and atm, the encoder could create bloated files for web from some PNG (16 bits/sample, no alpha optimization, etc.)
Last edited by cssignet; 2nd June 2020 at 13:58. Reason: edit links + point
skal (27th May 2020)
the host (https://i.slow.pics/) did some kind of post-processing on PNG (dropping the iCCP chunk and recompressing the image data worsely). those files are not what i uploaded (see edited link on my first post)
Correct.
The article shows only that crop, but the sizes are for the whole image. Also, lossless WebP wouldn't be completely lossless since this is a 16-bit PNG (quantizing to 8-bit introduces very minor color banding).
Good point, yes, better results are probably possible for all codecs with custom encoder options. I used default options for all.
Numbers are for 4 threads, as is mentioned in the blogpost. On single core, libjpeg-turbo will be faster. Using more than four cores, jxl will be more significantly faster. It's hard to find a CPU with fewer than 4 cores these days.* in the "Computational Complexity", i'm very surprised that JPEG-XL is faster than libjpeg-turbo. Did you forget to mention multi-thread usage?
-sharp_yuv is not default because it's slower: the default are adapted to the general common use, and the image you picked as source is not the common case the defaults are tuned for, far from.
(all the more that these images are better compressed losslessly!)
Just because you have 4 cores, doesn't mean you want to use them all at once. Especially if you have several images to compress in parallel (which is often the case).
For making a point with a fair comparison, it would have been less noise to force 1 thread for all codecs. As presented, i find the text quite misleading.
The thing is, a bitstream needs to be suitable for parallel encode/decode. That is not always the case. Comparing using just 1 thread gives an unfair advantage to inherently sequential codecs.
Typical machines have more than 4 cores nowadays. Even in phones, 8 is common. The tendency is towards more cores and not much faster cores. The ability to do parallel encode/decode is important.
And yet, sequential codecs are more efficient than parallel ones: tile-based compression have sync points and contention that makes the codec wait for threads to finish. Processing several images separately in parallel doesn't have this inefficiency (providing memory and I/O is not the bottleneck).
Actually, sequential codecs are at advantage in some quite important cases:
* image burst on phone camera (sensors is taking a sequence of photos in short bursts)
* web page rendering (which contains a lot of images, usually. Think YouTube landing page.)
* displaying photo albums (/thumbnails)
* back-end processing of a lot of photos in parallel (cloudinary?)
Actually, I'd say parallel codec are mostly useful for the Photoshop case (where you're using one photo only) and screen sharing (/slide deck).
side note: jpeg can be made parallelizable using Restart Markers. Fact that no-one is using it is somewhat telling.
In any case, i would have multiplied the JPEG's MP/s by 4x in your table to get fair numbers.
(edited) i remember what i tried actually: i used this crop as sample and did lossy JPEG XL on it (still, i misread your table and thought you did the same). the fact would be that this sample could be better stored as lossless than high quality lossy JPEG XL - so automatic compression would be useful in this caseOriginally Posted by Jon Sneyers
and more than this, the default lossless JPEG XL would create bigger file than PNGCode:cjpegxl -q 99 -s 6 high_fidelity.png_opt.png high_fidelity-q99.jxl (~ 56KB output)
Code:cjpegxl -q 100 high_fidelity.png_opt.png high_fidelity-q100.jxl (20.93 KB)my observations were about web usage context only, and how 16 bits/sample PNG are rendered in web browsers anywayOriginally Posted by Jon Sneyers
Last edited by cssignet; 2nd June 2020 at 13:56.
I wouldn't say sequential codecs are more efficient than parallel ones: you can always just use only a single thread, and avoid the of course unavoidably imperfect parallel scaling.
If you have enough images to process at the same time (like Cloudinary, or maybe rendering a website with lots of similar-sized images), you can indeed best just parallelize that way and use a single thread per image.
There are still cases where you don't have enough images to process in parallel single-thread processes to keep your cores busy though. For end-users, I think the "Photoshop case" is probably a rather common case.
Restart markers in JPEG only allow you to do parallel encode, not parallel decode. A decoder doesn't know if and where the next restart marker occurs, and what part of the image data it represents. You can also only do stripes with restart markers, not tiles. So even if you'd add some custom metadata to make an index of restart marker bitstream/image offsets, it would only help to do full-image parallel decode, not cropped decode (e.g. decoding just a 1000x1000 region from a gigapixel image).
I don't think the fact that no-one is trying to do this is telling. Applications that need efficient parallel/cropped decode (e.g. medical imaging) just don't use JPEG, but e.g. JPEG 2000.
Multiplying the JPEG numbers by 4 doesn't make much sense, because you can't actually decode a JPEG 4x faster on 4 cores than one 1 core. Dividing the JPEG XL numbers by 3 (for decode) and by 2 (for encode) is what you need to do to get "fair" numbers: that's the speed you would get on single-core (the factor is not 4 because parallel scalability is never perfect).
There's a reason why all the HEIC files produced by Apple devices are using completely independently encoded 256x256 tiles. Otherwise encode and decode would probably be too slow. The internal grid boundary artifacts are a problem in this approach though.
boxerab (13th June 2020),Piglet (28th May 2020),spider-mario (28th May 2020)
Yes, that would work. Then again, if you do such non-standard stuff, you can just as well make JPEG support alpha transparency by using 4-component JPEGs with some marker that says that the fourth component is alpha (you could probably encode it in such a way that decoders that don't know about the marker relatively gracefully degrade by interpreting the image as a CMYK image that looks the same as the desired RGBA image except it is blended to a black background). Or you could revive arithmetic coding and 12-bit support, which are in the JPEG spec but just not well supported.
I guess the point is that we're stuck with legacy JPEG decoders, and they can't do parallel decode. And we're stuck with legacy JPEG files, which don't have a jump table. And even if we would re-encode them with restart markers and jump tables, it would only give parallel striped decode, not efficient cropped decode.
boxerab (13th June 2020),spider-mario (29th May 2020)
i did not check results but here is a (late) primitive and automatic trial of JPEG XL 0.0.1-f84edfb2/WebP 1.1.0 from the files used in my benchmark (for web (lossless) usage)
skal (12th June 2020)
Thanks for doing the tests. The numbers checks out for WebP-1.1.0.
I was wondering: is Pingo webp-lossless re-optimizing an already-compressed WebP-lossless? Or starting back from the PNG source?
I'm asking because the the timing are pretty good for Pingo-webp-lossless compared to the rather slow WebP-cruncher, which is probably doing too much work...
skal/
[what is wjxl, btw?]
in this specific test, it is done from the PNG (-webp-lossless -sN file.png), but it could do both eventually (from WebP: -sN file.webp)Originally Posted by skal
on larger benchmark by using various image type, i guess the WebP-cruncher would give smaller results. the point was to make it faster, so it would try to check the image specs first and select an average good transform instead of trying them allOriginally Posted by skal
on paletted samples, it would be possible that an alternative transform could lead to better compression than what is done atm by the WebP-cruncher (but not always). how entries are sorted in palette would be the critical factor, and could let the usage of its predictors (or not) on image data. perhaps somehow it could make WebP more competitive vs other codecs (JPEG XL, FLIF, OLI (<- e.g. on those specific samples, WebP could be 19 210 bytes and 14 484 bytes, or even smaller with more exhaustive/efficient research))
my bad, i thought i mentionned it in results. it is just my ugly-unoptimized quick attempt to do PNG->JXL losslessly for web (basic alpha optimization, etc.) to make the comparison more reliable with other codecsOriginally Posted by skal
nice work!
Note that the github HEAD WebP version produces 18842 bytes and 14168 bytes ouput, respectively (in -lossless -q 100 -m 6 cruncher mode).
oh, good to know.
cssignet (14th June 2020)
Skal, what's the GitHub "HEAD" version? Is that a nightly or something?
good news, i did not see this! is there any particular reason why atm this transformation would not be done in lower level?Originally Posted by skal
btw, it's just been added to libwebp HEAD for method 5 (thanks for the suggestion!).
WebP 1.1.0/cf2f88b (automatic results on my benchmark with recompiled binaries [on v.low-perf hardware/32-bit OS, where multithread/process could struggle], not checked)
skal (19th June 2020)
cssignet (19th June 2020)
i guess the main reason is that there is no real crunch mode like cwebp. pingo targets its trials very specifically according to the colortype (so RGBA, paletted, etc. are "tested" differently)
it would do estimations for paletted: one with default ordering (+no predictor), second with specific ordering (+predictors). those trials are done in PNG format, fast but weak compression (this would be improved later). it would compare both size, pick the smaller, and set the WebP encoder accordingly. the level in pingo (-s1, -s2, etc) sets how strong the compression would be for estimations, how much ordering ways would be tried, and the method/quality for the final WebP encoding (the libwebp has been modified for that purpose)
perhaps pingo would be faster with this strategy but if someone test a large amount of samples (paletted or not), i guess it would compress worse vs cwebp brute-force. the only reason why it had better compression in this specific case (my benchmark) is because of its palette sorting, which overally performed better than the default in this case. however, it would be inexact science since each way could be better than another according to the sample
Jyrki Alakuijala (20th June 2020)
perhaps i made it even more competitive on tested profiles
Last edited by cssignet; 22nd June 2020 at 02:00.
very nice speed (and compression gain). That's a motivation to add better heuristics in cwebp!
Looks like parallel processing of files has better throughput overall (3.8x) compared to multi-threading each file taken sequentially (~ 1.8x). That's not totally unexpected...
cssignet (24th June 2020)