Has anyone looked into custom silicon for PAQ algorithm?
Or FPGA ?
I think medical image archives, which must use lossless, would
be interested in this.
Has anyone looked into custom silicon for PAQ algorithm?
Or FPGA ?
I think medical image archives, which must use lossless, would
be interested in this.
FPGA/ASIC would have the same problems as hardware miners for scrypt-based cryptocoins.
PAQ requires a lot of memory for good compression, so you need a complex memory controller for something like DDR5,
plus 2-3 levels of caches, to compete with x86.
And then anyway, a custom chip at 10nm/3Ghz would cost millions (per chip), so nobody would make it.
Maybe once/if google's TPUs and such become accessible, we could design a new codec to make use of that architecture.
But it won't be paq as it is.
Thanks. I doubt Google will ever sell the TPU, but is available on cloud.
GPU memory speeds are going up with HBM2, but still local on-chip memory
cache is very small, up to 32 kb.
What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.
If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops, so this would
double the compute level of these cards.
Also, how does PAQ perform on image data, rather than text ?
> Thanks. I doubt Google will ever sell the TPU, but is available on cloud.
I'd not be surprised if something similar is added as part of next
intel's instruction set extension.
> GPU memory speeds are going up with HBM2, but still local on-chip memory
> cache is very small, up to 32 kb.
Unfortunately HBM is high-bandwidth, not low-latency.
While paq encoding can be parallelized, using data bit sorting by context,
decoding speed mostly depends on random access latency.
> What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.
16 bits may be enough actually, with non-linear quantization.
Paq uses 8-bit counter states and 12-bit probabilities.
> If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops,
> so this would double the compute level of these cards.
If you mean new 16-bit FP type, it doesn't really help, because current C/C++
compilers are unable to work with floats in a consistent way.
It certainly could be useful, but we'd get incompatible formats with each
compiler update or even change in compile options.
Also, precision is not really much of issue - hashtable lookups are.
For example, see this: https://github.com/hxim/paq8px/blob/...q8px.cpp#L3160
Sure, mixing can be vectorized - see https://github.com/hxim/paq8px/blob/...q8px.cpp#L1186
And we can actually improve paq speed by optimizing these functions.
But first, we need to load the probabilities for all of contexts.
And they are kept in hashtable, so its not even a direct memory read by hash value -
there's also some scanning involved, to resolve hash collisions.
> Also, how does PAQ perform on image data, rather than text ?
Pretty good, although its image models are relatively simple.
Lossless image compression has to deal with lots of noise, so
more complicated models are not necessarily better.
http://www.squeezechart.com/bitmap.html
http://maximumcompression.com/data/bmp.php
boxerab (26th July 2017)
>> Thanks. I doubt Google will ever sell the TPU, but is available on cloud.
> I'd not be surprised if something similar is added as part of next
> intel's instruction set extension.
Yes, GPU and TPU are big threat to Intel.
By the way, Google is making a free TPU cloud available for researchers:
https://techcrunch.com/2017/05/17/th...to-scientists/
>> GPU memory speeds are going up with HBM2, but still local on-chip memory
>> cache is very small, up to 32 kb.
> Unfortunately HBM is high-bandwidth, not low-latency.
> While paq encoding can be parallelized, using data bit sorting by context,
> decoding speed mostly depends on random access latency.
Yes, latency needs to be hidden with many parallel threads. If the algo is
very serial, this won't work out very well.
>> What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.
> 16 bits may be enough actually, with non-linear quantization.
> Paq uses 8-bit counter states and 12-bit probabilities.
>> If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops,
>> so this would double the compute level of these cards.
> If you mean new 16-bit FP type, it doesn't really help, because current C/C++
> compilers are unable to work with floats in a consistent way.
> It certainly could be useful, but we'd get incompatible formats with each
> compiler update or even change in compile options.
Yes, but this is 16 bit FP on GPU, so AMD/nVidia compiler will handle it.
> Also, precision is not really much of issue - hashtable lookups are.
> For example, see this: https://github.com/hxim/paq8px/blob/master/paq8px.cpp#L3160
How large are the hash tables? Small enough to fit in a few kb ?
> Sure, mixing can be vectorized - see https://github.com/hxim/paq8px/blob/...q8px.cpp#L1186
> And we can actually improve paq speed by optimizing these functions.
> But first, we need to load the probabilities for all of contexts.
> And they are kept in hashtable, so its not even a direct memory read by hash value -
> there's also some scanning involved, to resolve hash collisions.
I see.
>> Also, how does PAQ perform on image data, rather than text ?
> Pretty good, although its image models are relatively simple.
> Lossless image compression has to deal with lots of noise, so
> more complicated models are not necessarily better.
> http://www.squeezechart.com/bitmap.html
> http://maximumcompression.com/data/bmp.php[/QUOTE]
Very good. Fascinating stuff! Thanks.
> Yes, latency needs to be hidden with many parallel threads.
> If the algo is very serial, this won't work out very well.
Its normal for compression to be very serial though.
The whole point is that you can use already known data
to improve prediction of still unknown data.
And vice versa, if we reduce the amount of known data
due to parallelism, it hurts prediction (and thus compression ratio).
So its usually easy enough to parallelize encoder without
compression loss (because all of data is known anyway).
But decoder has to wait for context to become available.
> Yes, but this is 16 bit FP on GPU, so AMD/nVidia compiler will handle it.
I don't think it matters. FP arith is tricky - a+b!=b+a etc.
And compiler writers usually try to improve its speed, not consistency.
It works pretty well, for games etc, though there can be occasional glitches
like NFS timer bug.
But in compression we need algorithms to produce exactly the same
values each time, even if program is recompiled on different platform or something.
I mean exactly the same binary values, not approximately the same
plus/minus 1E-10.
And with FP arith its impossible.
Well, possible, but only if everything is written in asm manually.
> How large are the hash tables? Small enough to fit in a few kb ?
~1Gb in paq, ~32Gb in cmix.
Even plain order1 model needs 64k already.
Oh well, 1 GB hash makes it impossible to run on GPU. I wonder if someone will benchmark with new AMD
Threadripper 16 core chip ?
Modern GPUs have plenty of memory, and in theory it should be possible to port paq there...
just that it would be slow.
As to 16-core, I guess you can test zpaq there - it has MT.
> Modern GPUs have plenty of memory, and in theory it should be possible to port paq there...
> just that it would be slow.
Yes, for random access buffer such as hash, you really need it in on-chip memory, or performance will be poor.
> As to 16-core, I guess you can test zpaq there - it has MT.
Thanks. zpaq seems to be for incremental backup. Which PAQ variant would be suitable for 2d image compression ?
> Thanks. zpaq seems to be for incremental backup. Which PAQ variant would be suitable for 2d image compression ?
Most of them have some image models... For best compression you can check the recent paq8px releases.
But image models in paq are implemented the same way as everything else - ie hashtables and sequential bit-by-bit
coding which can't be parallelized.
And unlike universal compression, it _is_ possible to parallelize lossless image compression.
Excellent. Thanks.
ZPAQ is a format which stores the needed decompression algorithm in the archive. If a hardware should decompress any ZPAQ block it has to feature a CPU for the ZPAQL bytecode along with the standard components for the context mixing. FPGAs would be suited for this I guess (if the model doesn't use >1 GB of RAM).
boxerab (6th August 2017)
So, for image compression, standard approach is to transform the image (DCT or wavelet) to decorrelate, then break into blocks and entropy-encode (huffman or arithmetic coding).
Blocks can be around 1KB in size. So, how would PAQ fit into this scheme? Would it require such a large hash table for 1KB blocks ? The blocks are encoded independently.
For lossless JPEG 2000, for example, you can expect around 2:1 compression. How much better could PAQ get?
DCT is only good for lossy compression.
For audio it does at least make some sense - there're really some natural sources for sinusoidal waves and harmonics
(though even for audio, lossless compression of DCT coefs produces much worse results than direct LPC modelling).
And for images there's really little reason to use specifically DCT - there could be periodic patterns sometimes,
but spectral representation is not more useful than original pixels at all.
> So, how would PAQ fit into this scheme?
PAQ (same as other CM compressors; CM = context model) works by estimating probability distribution for the next data bit,
using a statistical data model (actually multiple models, which are mixed together in a NN-like way).
After the bit is encoded/decoded, the model is updated to include actual bit value, relevant stats are adjusted, etc.
CM is basically when compression consists of entropy coding only.
For example, LZMA entropy coder is a fairly standard order-1 CM.
> Would it require such a large hash table for 1KB blocks?
Specifically with PAQ, yes, its just how it is designed.
And in general, it can be implemented differently - we can simply quantize the necessary contexts
and read stats directly from counter arrays, rather than a single hashtable.
> The blocks are encoded independently.
Encoding 1KB blocks of an actual image independently would be a bad idea, even jpeg usually has solid entropy coding for the whole image.
But an image has lots of various components, so it might make sense after some preprocessing, like colorspace and quadtree transformations.
> you can expect around 2:1 compression. How much better could PAQ get?
You can download it (like paq8px or paq8pxd) and actually test with your own samples.
Or you can look here: http://www.squeezechart.com/bitmap.html - old version of paq8px
seems to be ~15% better than jpeg2k/kakadu.
That's normal for lossless compression, because most of compressed data size is taken by noise.
Paq also can compress lossy jpegs too, with a similar gain (10-30%).
>> The blocks are encoded independently.
> Encoding 1KB blocks of an actual image independently would be a bad idea, even jpeg usually has solid entropy coding for the whole image.
> But an image has lots of various components, so it might make sense after some preprocessing, like colorspace and quadtree transformations.
Well, the blocks allow entropy encoding to be coarse-grained parallelized.
>> you can expect around 2:1 compression. How much better could PAQ get?
> You can download it (like paq8px or paq8pxd) and actually test with your own samples.
> Or you can look here: http://www.squeezechart.com/bitmap.html - old version of paq8px
> seems to be ~15% better than jpeg2k/kakadu.
Thanks. I guess for ~15% improvement, it wouldn't be worth the effort.