Results 1 to 17 of 17

Thread: PAQ compression in hardware ?

  1. #1
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts

    PAQ compression in hardware ?

    Has anyone looked into custom silicon for PAQ algorithm?
    Or FPGA ?

    I think medical image archives, which must use lossless, would
    be interested in this.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    FPGA/ASIC would have the same problems as hardware miners for scrypt-based cryptocoins.
    PAQ requires a lot of memory for good compression, so you need a complex memory controller for something like DDR5,
    plus 2-3 levels of caches, to compete with x86.
    And then anyway, a custom chip at 10nm/3Ghz would cost millions (per chip), so nobody would make it.

    Maybe once/if google's TPUs and such become accessible, we could design a new codec to make use of that architecture.
    But it won't be paq as it is.

  3. #3
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    Thanks. I doubt Google will ever sell the TPU, but is available on cloud.
    GPU memory speeds are going up with HBM2, but still local on-chip memory
    cache is very small, up to 32 kb.

  4. #4
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.
    If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops, so this would
    double the compute level of these cards.

  5. #5
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    Also, how does PAQ perform on image data, rather than text ?

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    > Thanks. I doubt Google will ever sell the TPU, but is available on cloud.

    I'd not be surprised if something similar is added as part of next
    intel's instruction set extension.

    > GPU memory speeds are going up with HBM2, but still local on-chip memory
    > cache is very small, up to 32 kb.

    Unfortunately HBM is high-bandwidth, not low-latency.
    While paq encoding can be parallelized, using data bit sorting by context,
    decoding speed mostly depends on random access latency.

    > What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.

    16 bits may be enough actually, with non-linear quantization.
    Paq uses 8-bit counter states and 12-bit probabilities.

    > If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops,
    > so this would double the compute level of these cards.

    If you mean new 16-bit FP type, it doesn't really help, because current C/C++
    compilers are unable to work with floats in a consistent way.
    It certainly could be useful, but we'd get incompatible formats with each
    compiler update or even change in compile options.

    Also, precision is not really much of issue - hashtable lookups are.
    For example, see this: https://github.com/hxim/paq8px/blob/...q8px.cpp#L3160

    Sure, mixing can be vectorized - see https://github.com/hxim/paq8px/blob/...q8px.cpp#L1186
    And we can actually improve paq speed by optimizing these functions.
    But first, we need to load the probabilities for all of contexts.
    And they are kept in hashtable, so its not even a direct memory read by hash value -
    there's also some scanning involved, to resolve hash collisions.

    > Also, how does PAQ perform on image data, rather than text ?

    Pretty good, although its image models are relatively simple.
    Lossless image compression has to deal with lots of noise, so
    more complicated models are not necessarily better.

    http://www.squeezechart.com/bitmap.html
    http://maximumcompression.com/data/bmp.php

  7. Thanks:

    boxerab (26th July 2017)

  8. #7
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    >> Thanks. I doubt Google will ever sell the TPU, but is available on cloud.

    > I'd not be surprised if something similar is added as part of next
    > intel's instruction set extension.

    Yes, GPU and TPU are big threat to Intel.

    By the way, Google is making a free TPU cloud available for researchers:
    https://techcrunch.com/2017/05/17/th...to-scientists/

    >> GPU memory speeds are going up with HBM2, but still local on-chip memory
    >> cache is very small, up to 32 kb.

    > Unfortunately HBM is high-bandwidth, not low-latency.
    > While paq encoding can be parallelized, using data bit sorting by context,
    > decoding speed mostly depends on random access latency.

    Yes, latency needs to be hidden with many parallel threads. If the algo is
    very serial, this won't work out very well.

    >> What level of precision is needed for PAQ algorithm? Standard AC requires 32 bits, I believe.

    > 16 bits may be enough actually, with non-linear quantization.
    > Paq uses 8-bit counter states and 12-bit probabilities.

    >> If 16 bits is enough, the newer GPUs are supporting 16-bit packed math ops,
    >> so this would double the compute level of these cards.

    > If you mean new 16-bit FP type, it doesn't really help, because current C/C++
    > compilers are unable to work with floats in a consistent way.
    > It certainly could be useful, but we'd get incompatible formats with each
    > compiler update or even change in compile options.

    Yes, but this is 16 bit FP on GPU, so AMD/nVidia compiler will handle it.

    > Also, precision is not really much of issue - hashtable lookups are.
    > For example, see this: https://github.com/hxim/paq8px/blob/master/paq8px.cpp#L3160

    How large are the hash tables? Small enough to fit in a few kb ?

    > Sure, mixing can be vectorized - see https://github.com/hxim/paq8px/blob/...q8px.cpp#L1186
    > And we can actually improve paq speed by optimizing these functions.
    > But first, we need to load the probabilities for all of contexts.
    > And they are kept in hashtable, so its not even a direct memory read by hash value -
    > there's also some scanning involved, to resolve hash collisions.

    I see.

    >> Also, how does PAQ perform on image data, rather than text ?

    > Pretty good, although its image models are relatively simple.
    > Lossless image compression has to deal with lots of noise, so
    > more complicated models are not necessarily better.
    > http://www.squeezechart.com/bitmap.html
    > http://maximumcompression.com/data/bmp.php[/QUOTE]

    Very good. Fascinating stuff! Thanks.

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    > Yes, latency needs to be hidden with many parallel threads.
    > If the algo is very serial, this won't work out very well.

    Its normal for compression to be very serial though.
    The whole point is that you can use already known data
    to improve prediction of still unknown data.
    And vice versa, if we reduce the amount of known data
    due to parallelism, it hurts prediction (and thus compression ratio).

    So its usually easy enough to parallelize encoder without
    compression loss (because all of data is known anyway).
    But decoder has to wait for context to become available.

    > Yes, but this is 16 bit FP on GPU, so AMD/nVidia compiler will handle it.

    I don't think it matters. FP arith is tricky - a+b!=b+a etc.
    And compiler writers usually try to improve its speed, not consistency.
    It works pretty well, for games etc, though there can be occasional glitches
    like NFS timer bug.

    But in compression we need algorithms to produce exactly the same
    values each time, even if program is recompiled on different platform or something.

    I mean exactly the same binary values, not approximately the same
    plus/minus 1E-10.

    And with FP arith its impossible.
    Well, possible, but only if everything is written in asm manually.

    > How large are the hash tables? Small enough to fit in a few kb ?

    ~1Gb in paq, ~32Gb in cmix.

    Even plain order1 model needs 64k already.

  10. #9
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    Oh well, 1 GB hash makes it impossible to run on GPU. I wonder if someone will benchmark with new AMD
    Threadripper 16 core chip ?

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    Modern GPUs have plenty of memory, and in theory it should be possible to port paq there...
    just that it would be slow.

    As to 16-core, I guess you can test zpaq there - it has MT.

  12. #11
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    > Modern GPUs have plenty of memory, and in theory it should be possible to port paq there...
    > just that it would be slow.

    Yes, for random access buffer such as hash, you really need it in on-chip memory, or performance will be poor.

    > As to 16-core, I guess you can test zpaq there - it has MT.

    Thanks. zpaq seems to be for incremental backup. Which PAQ variant would be suitable for 2d image compression ?

  13. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    > Thanks. zpaq seems to be for incremental backup. Which PAQ variant would be suitable for 2d image compression ?

    Most of them have some image models... For best compression you can check the recent paq8px releases.

    But image models in paq are implemented the same way as everything else - ie hashtables and sequential bit-by-bit
    coding which can't be parallelized.

    And unlike universal compression, it _is_ possible to parallelize lossless image compression.

  14. #13
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    Excellent. Thanks.

  15. #14
    Member
    Join Date
    Dec 2014
    Location
    Berlin
    Posts
    29
    Thanks
    36
    Thanked 26 Times in 12 Posts
    ZPAQ is a format which stores the needed decompression algorithm in the archive. If a hardware should decompress any ZPAQ block it has to feature a CPU for the ZPAQL bytecode along with the standard components for the context mixing. FPGAs would be suited for this I guess (if the model doesn't use >1 GB of RAM).

  16. Thanks:

    boxerab (6th August 2017)

  17. #15
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    So, for image compression, standard approach is to transform the image (DCT or wavelet) to decorrelate, then break into blocks and entropy-encode (huffman or arithmetic coding).
    Blocks can be around 1KB in size. So, how would PAQ fit into this scheme? Would it require such a large hash table for 1KB blocks ? The blocks are encoded independently.

    For lossless JPEG 2000, for example, you can expect around 2:1 compression. How much better could PAQ get?

  18. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,693
    Thanks
    267
    Thanked 1,180 Times in 651 Posts
    DCT is only good for lossy compression.
    For audio it does at least make some sense - there're really some natural sources for sinusoidal waves and harmonics
    (though even for audio, lossless compression of DCT coefs produces much worse results than direct LPC modelling).
    And for images there's really little reason to use specifically DCT - there could be periodic patterns sometimes,
    but spectral representation is not more useful than original pixels at all.

    > So, how would PAQ fit into this scheme?

    PAQ (same as other CM compressors; CM = context model) works by estimating probability distribution for the next data bit,
    using a statistical data model (actually multiple models, which are mixed together in a NN-like way).
    After the bit is encoded/decoded, the model is updated to include actual bit value, relevant stats are adjusted, etc.

    CM is basically when compression consists of entropy coding only.
    For example, LZMA entropy coder is a fairly standard order-1 CM.

    > Would it require such a large hash table for 1KB blocks?

    Specifically with PAQ, yes, its just how it is designed.

    And in general, it can be implemented differently - we can simply quantize the necessary contexts
    and read stats directly from counter arrays, rather than a single hashtable.

    > The blocks are encoded independently.

    Encoding 1KB blocks of an actual image independently would be a bad idea, even jpeg usually has solid entropy coding for the whole image.

    But an image has lots of various components, so it might make sense after some preprocessing, like colorspace and quadtree transformations.

    > you can expect around 2:1 compression. How much better could PAQ get?

    You can download it (like paq8px or paq8pxd) and actually test with your own samples.
    Or you can look here: http://www.squeezechart.com/bitmap.html - old version of paq8px
    seems to be ~15% better than jpeg2k/kakadu.

    That's normal for lossless compression, because most of compressed data size is taken by noise.
    Paq also can compress lossy jpegs too, with a similar gain (10-30%).

  19. #17
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    137
    Thanks
    62
    Thanked 21 Times in 12 Posts
    >> The blocks are encoded independently.

    > Encoding 1KB blocks of an actual image independently would be a bad idea, even jpeg usually has solid entropy coding for the whole image.

    > But an image has lots of various components, so it might make sense after some preprocessing, like colorspace and quadtree transformations.

    Well, the blocks allow entropy encoding to be coarse-grained parallelized.

    >> you can expect around 2:1 compression. How much better could PAQ get?

    > You can download it (like paq8px or paq8pxd) and actually test with your own samples.
    > Or you can look here: http://www.squeezechart.com/bitmap.html - old version of paq8px
    > seems to be ~15% better than jpeg2k/kakadu.

    Thanks. I guess for ~15% improvement, it wouldn't be worth the effort.

Similar Threads

  1. ULTRACompress (Still using PAQ)
    By moisesmcardona in forum Data Compression
    Replies: 5
    Last Post: 9th May 2019, 22:10
  2. Non-DEFLATE hardware-based compression
    By SolidComp in forum Data Compression
    Replies: 0
    Last Post: 22nd June 2017, 03:00
  3. LZMA implemented on Hardware.
    By Gonzalo in forum Data Compression
    Replies: 2
    Last Post: 18th November 2014, 15:54
  4. Hardware compression without software
    By BetaTester in forum Data Compression
    Replies: 0
    Last Post: 23rd January 2013, 21:05
  5. password cracking hardware
    By willvarfar in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 27th June 2011, 11:52

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •