Thread: lightning-fast lossless image compression (heads up)

1. lightning-fast lossless image compression (heads up)

I've been talking to some graphics guys, and it seems that you can losslessly compress or decompress images by about a factor of two at hundreds of megabytes per second, single core, using SIMD instructions (e.g., SSE4). (8-bit grayscale images from the imagecompression.info gray8 benchmark.)

Unfortunately, the people who know how to do this are not publishing how or releasing the code because it's not a standard compression format and their companies are pursuing standardized compression formats like JPEG-LS instead.

On the other hand, it seems that the techniques are not patented and it doesn't appear that they will be, so this might be a good thing for a free software person who's good at images and SIMD to look into, if they're so inclined.

This posting at Jan Wassenberg's blog is particularly intriguing:

http://wassenberg.dreamhosters.com/b...e-compression/

Jan did some very cool work for his dissertation on symmetrical compression of 16-bit images, but that code did not work well for 8-bit color channels; the newer variant he describes does, at a cost in compression ratio of a bit under 20 percent.

On his blog he says:

The most interesting part is fixed-to-variable coding for entire vector registers of bits. For the image residuals above, it compresses to within 5-20% of the entropy at speeds comparable to or exceeding state of the art integer coders (http://arxiv.org/abs/1209.2137). A single core can compress/decompress between 4600 MB/s (sparse bit arrays with 0.05% ones) and 1000 MB/s (relatively dense, 12% ones). Of course this scales to multiple cores, as there is no shared dictionary to update.
Although the instruction set improved with SSE4, writing vector code is still an exercise in improvisation. Finding 1-bits is accomplished by a hash function based on de Bruijn sequences (http://supertech.csail.mit.edu/papers/debruijn.pdf) and using the FPU normalization hardware. Vectorized shifts are emulated with the PSHUFB universal shuffle instruction.
I'm not an expert on either images or SIMDfying algorithms, but that sounded very promising.

2. Thanks:

avitar (15th May 2014)

3. Simple/fast lossless image compression is scanning line by line using some simple predictor for pixel value like average of 2 neighboring pixels (or a bit better predictor from LOCO-I) - and encoding difference from the real pixel value using some entropy coder for two-sided exponential distribution (exp(-|x-y|)).
This way you can get about 50% reduction for a photography, and much better for simpler pictures.
Using FSE as entropy coder would give you about 500MB/s per core here, what could be improved using SIMD.

4. Thanks (2):

avitar (15th May 2014),Paul W. (15th May 2014)

5. Originally Posted by Jarek
Simple/fast lossless image compression is scanning line by line using some simple predictor for pixel value like average of 2 neighboring pixels (or a bit better predictor from LOCO-I) - and encoding difference from the real pixel value using some entropy coder for two-sided exponential distribution (exp(-|x-y|)).
I've done sort-of that myself, long ago, in an experimental compressor for window system save areas, just averaging the pixel immediately to the left with the one immediately above and using a cheesy fixed Huffman-ish code for the difference. My cheesy version doesn't work as well as it used to, though, now that people use more colors, gradients, thumbnails, and (especially) photographs in web pages, docs, and desktop backgrounds.

Maybe it can be fixed, though, with a slightly better predictor, plugging in FSE, and reasonable speed optimizations. (Or better yet, finding a well-written free library that already does that.)

This way you can get about 50% reduction for a photography, and much better for simpler pictures.
Using FSE as entropy coder would give you about 500MB/s per core here, what could be improved using SIMD.
How much potential do you see for speeding up FSE with SIMD?

6. The prediction with subtraction should be SIMD vectorized, getting ~ x8 speedup. So the speed depends mainly on the speed of entropy coder. tANS/FSE decoding step is e.g.
t = decodingTable[x];
useSymbol[t.symbol];
x = t.newX + readbits[t.nbBits];
where x is below 16 bits - we could use 8x SIMD. The table use has to be done one by one, but all the rest should get a speedup.
I don't know the exact values that can be achieved, dnd claimed getting about 1.5x speedup, but without anything to confirm that ... I think png level of compression can be obtain with about 500MB/s here.

7. Thanks:

Paul W. (16th May 2014)

8. Jarek;

I don't know the exact values that can be achieved, dnd claimed getting about 1.5x speedup, but without anything to confirm that ... I think png level of compression can be obtain with about 500MB/s here.
1.5x seems really low to me, but I don't know much, and I'm not sure what the issues are. As I understood what Wassenberg did, and I may well not understand it at all, he was picking one quantization for a whole SIMD register worth of pixels, where you have a fixed choice of 4/4 byte split, 6/2 byte split, or exact match of whole registers. Nastily high-frequency detailed images like mandrill would have to require you to encode a lot of (aligned) 8-byte sequences 4/4 way, easier (mostly-gentle-gradient) images like lena with minor color noise would mostly allow you to use 6/2, and really easy (non-photo, synthetic) images with a lot of flat color areas would often use exact (8/0).

(Basically, the encoding per aligned 8-byte unit is determined by the worst-matching byte of each 8. If I've got this all wrong, being not an image guy or particularly good at any aspect of this, I'd appreciate being gently straightened out.)

If that's roughly right, it seems there should be a tradeoff where you can get significantly better compression at a significant cost in speed, if you pick a smaller-than-8-byte unit to do this across, like 4 bytes instead, or you have more choices of splits (like 3 and 1 in addition to 4 and 2). Crudeness and assuming stretches of regularity should get you more vectorizability, right?

One question I have is about the split vs. unified treatment of color channels. Is there a screamingly fast way to interleave/deinterleave the bits of the bytes for different colors? Given the typical correlations between color channels, I'd think it'd be beneficial to just interleave 3 x 8 bits into 24, then do the magic, with all the higher bits together and all the lower bits together. I wouldn't have thought that possible, but the (Leiserson et al.) paper that Wassenberg cites about using magical DeBruijn sequences for indexing high (or low) bits (on hardware without hardware support for it) makes me a bit more optimistic that maybe there's a fast bit-fiddly trick that fills the bill.

9. Originally Posted by Paul W.
One question I have is about the split vs. unified treatment of color channels. Is there a screamingly fast way to interleave/deinterleave the bits of the bytes for different colors? Given the typical correlations between color channels, I'd think it'd be beneficial to just interleave 3 x 8 bits into 24, then do the magic, with all the higher bits together and all the lower bits together. I wouldn't have thought that possible, but the (Leiserson et al.) paper that Wassenberg cites about using magical DeBruijn sequences for indexing high (or low) bits (on hardware without hardware support for it) makes me a bit more optimistic that maybe there's a fast bit-fiddly trick that fills the bill.
http://graphics.stanford.edu/~seander/bithacks.html

The DeBruijn sequence trick is neat. I've been thinking about DeBruijn sequences lately.

10. Thanks:

Paul W. (16th May 2014)

11. Look at LZ4.
Hardware / compiler support turned out to be much better than DeBruijn.
Overall, LZ4 gives you an implementation that's portable, yet about optimal on most platforms.

12. Originally Posted by m^2
Look at LZ4.
Hardware / compiler support turned out to be much better than DeBruijn.
Overall, LZ4 gives you an implementation that's portable, yet about optimal on most platforms.
My take is that the reason for using the DeBruijn method is that it agrees well with SIMD. I'm pretty sure Intel doesn't offer that instruction in hardware with vectorization.

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•