http://www.radgametools.com/oodlelimage.htm
Same compression as WebP-lossless, 5x faster decompression, says the page.
http://www.radgametools.com/oodlelimage.htm
Same compression as WebP-lossless, 5x faster decompression, says the page.
That's Jon Olick's brainchild.
Yup, this was designed and implemented by Jon.
Some more tech details for the curious: we had several customers that were interested in a minor variation on PNG with the zlib backend replaced with Oodle. That's what OLI started as, although over time the differences to regular PNG grew.
The speedup is mostly from replacing zlib with the Oodle codecs and some high-level changes to filtering to get more parallelism. The basic architecture is very similar to PNG, i.e. apply lossless filters then hand over to a LZ backend. The filters are the standard PNG filters plus a few new ones, namely 3 different types of gradient predictors (sorely missing in stock PNG) and a parametric linear predictor (fixed-point filter coeffs are determined at encode time and stored). Images are usually chopped into independent tiles that can then be predicted simultaneously (regular PNG filters are very serial). As with Kraken, no major new ideas, just careful engineering.
Yeah, comparing against webp 1.0.0. The measurements are on Jon's machine (one of the new Ryzen2 Threadrippers), with each codec benchmarked decoding one image per thread.
OLI is targeting normal PCs, not game consoles (unlike regular Oodle, where we pay a lot of attention to perf on PS4/Xb1); on the consoles virtually nobody uses straight RGB(A) formats in any significant quantity, it's all compressed texture formats. As for GPU-based decoding, we've been shipping BinkGPU since 2014 which decodes regular Bink 2 and does all the data-parallel work on the GPU (leaving the bitstream decoding on the CPU, the Bink 2 format wasn't designed for GPU-side decoding). That one is used extensively on consoles and handhelds but we don't recommend it on PCs since even after shipping nearly unchanged for 4 years, for GPUs by all but one vendor, the decoder kernels keep breaking every couple driver releases, despite being all integer.![]()
GPUs are well-suited to lossy transform-and-quantize style codecs; plenty of latent parallelism that is easy enough to map to tons of invocations while maintaining fairly coherent memory access patterns. Spatial predictors are a big pain because they induce strict ordering requirements that greatly reduce available parallelism and lead to tricky kernels if you want things to complete reasonably quickly. This is the part of BinkGPU that most commonly breaks; it's possible our stuff is broken, but if so, I don't know how. Every driver/compiler engineer I've talked to so far (which by now is quite a few) agrees that what we're doing is allowed as per the various 3D/compute API specs, but apparently we're stressing paths that rarely anybody else uses heavily, so we're really good at finding finicky compiler and even hardware bugs in memory barrier implementations, and then turning them into big splotchy areas of garbage. Not a great place to be.
Transform codecs are great for lossy but they've never really been a winner in the lossless space, though (or at least I don't know of one). Which is a pity since they're definitely the most suitable avenue for a GPU implementation by far.
I don't think the standard high-ratio lossless image coding paradigm translates well to GPUs: spatial predictors on a fairly large neighborhood and then adaptive context coding, it's hard to get more serial than that. I suppose you could bin the contexts into a small, salient subset (easier said than done) and then use something semi-static? The predictors forcing you to work in small piddly diagonal wavefronts of pixels that are mutually independent is still a real downer though.
PNG/OLI-esque, i.e. general lossless coder plus filters (which are just spatial predictors again) has similar problems. Some simple filters are nice (e.g. the PNG "left" filter just turns into a prefix sum, which is easy), the good ones, not so much. Cutting up an image into a handful of tiles so you get enough independent work for CPU-side SIMD is one thing, but to make a GPU impl happy you would need hundreds (thousands ideally), and that's a serious compression hit. So small piddly diagonal(-ish) wavefronts is still the best I can think of for GPUs. It's not great.
Jan Wassenberg (26th September 2018)