Results 1 to 5 of 5

Thread: PAQ and CMIX as a hardware circuit?

  1. #1
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    102
    Thanks
    95
    Thanked 18 Times in 17 Posts

    Thumbs down PAQ and CMIX as a hardware circuit?

    One of the caveats for PAQ and CMIX is that they take huge amounts of RAM and CPU power to compress a little better than LZMA / XZ / 7zip, BWT, RAR, and the long list of all those LZ variant algorithms that are out there.

    But what if those resources were safely allocated away from the computer with a dedicated SBC (single board computer) and implemented as an EEPROM or an FPGA either as PCIe or a USB 2.0+ device?

    If a device preallocates the storage needed for a file passed to it as a temp file over an SLC SSD and then uses DDR4 RAM up to 32GB on board or higher to meet the RAM requirements for CMIX and PAQ currently...would it be fast enough to use this as a hardware circuit running in parallel to computer systems using <1% of the system CPU to watch it?

    (Basically, one can look at this as a hardware implementation of telling a computer with resources on a network: "I want you to grab this file, make a copy of it from the host to your local machine. Use all the ram and cpu that you have there as-needed, but keep me updated on estimated completion time and then pass the compressed file back over the network to me when you're done with it. Don't use any of the resources on the host machine and contain all that over there, but when it's ready send the output here to the host over the network")

    The network would be, in this case, the hardware bus for any device or implementation that does this, and the remote machine would be the dedicated SBC.

    As a hardware circuit it wouldn't need any resources from the host machine, and could work even on older 386 or 486 systems with a usb card...but would still work hundreds to thousands of times faster per cycle than it does with today's best laptops and desktops when running programs as close to bare metal as possible.

    It would likely make things like LZ4 and LZTurbo seem near-instant, but the time-consuming and (sometimes) out of reach resources becomes attainable and modular to any device with a usb connection when a dedicated computer can be added instantly to make those hours, days, or weeks (!) running some algorithms only take seconds, minutes, or days instead.

    The only thing that would run on the host would be a type of driver manager program to monitor things, queue compression requests, and move compressed files to the destination directory after they're completed.

    Anyone without the device but with adequate hardware (32GB or greater if required for RAM) could still compress and decompress more slowly of course, but having a device like this would make long compression cycles far more productive and easier to test results from, and help streamline compression development if there were a dedicated device for development.

    BUT is this still practical in today's world to do this?

    Or have "tablets" and "mobile devices" as they have been marketed today to the masses ruined the ability to implement this even as a usb device mainstream after the 2010's?
    Last edited by JamesWasil; 3rd January 2020 at 10:10. Reason: Accidentally put a thumbs down icon rather than a lightbulb

  2. #2
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    102
    Thanks
    95
    Thanked 18 Times in 17 Posts
    (And if it's still possible, practical, and economically viable to do this, would one or two Intel Celeron to Intel Core I7 based NUC SBCs working in parallel be enough to handle CMIX and PAQ with this and support the cache sizes and other prerequisites needed? I was thinking even a self-contained system, either a board or series of boards with parallel Intel Atom N270 or similar might handle it, but they'd probably run way too hot and still not be fast enough considering. What do you guys think?)

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,974
    Thanks
    296
    Thanked 1,301 Times in 738 Posts
    The only problem of CMs is that they don't fit in caches. Order1 statistics already won't fit in L1, etc.
    And there're 100s of submodels, most of them requiring random access to hashtable, which get stuck for 1000 clocks due to cache misses, then TLB misses.
    While even a naive ASIC/FPGA port could get submodels to run in parallel (they're independent, but have to sync too frequently to run in MT),
    so with 16 or 32MBs of SRAM memory and 2Ghz clock and 50 clocks per bit, we'd get ~5MB/s, which is faster than encoding speed of most good LZs :)
    Unfortunately, it would be very expensive to do on your own: something like this for FPGA: https://www.xilinx.com/products/boar...ts/zcu111.html .
    Of course, there're cheap FPGAs, but these have freqs like 30MHz - with that speed would not be better than on a PC.

    However, there're TPUs now, which imho have compatible architecture.
    They also have limited precision, so a direct port of PAQ8* is impossible, but I'd not be surprised if some TPU-based LSTM compressor appears any day now.

  4. Thanks:

    JamesWasil (3rd January 2020)

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    what about decompression? do you mean 16 MB SRAM per model, i.e. 1GB per entire chip? The device you mentioned has only 8 MB SRAM (for $9K price )

    afaik, TPUs efficiently perform only a+=x*y operation. Do you think they have efficient random memory access?

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,974
    Thanks
    296
    Thanked 1,301 Times in 738 Posts
    > what about decompression?

    Symmetric of course.
    But with platform-specific optimizations and without keeping the format compatibility we can probably reach 100MB/s or so.
    CM models are pretty simple in the end, the only problem is RAM.

    Its also possible to apply some pipeline optimization for encoder (all the data is available, so submodels
    can run asynchronously for whole blocks) and speculative optimizations for decoder (predict the next byte value,
    then continue updates etc as if it was known, without waiting for entropy decoder).

    Also we can add some asymmetry - detect useless submodels during encoding and disable these for decoder.

    > do you mean 16 MB SRAM per model, i.e. 1GB per entire chip? The device you mentioned has only 8 MB SRAM (for $9K price )

    I meant total, submodels can use the same common hashtable, the difference is mostly in contexts.
    8MB should be enough to show some paq-like results on small files... though of course more memory is better.

    I'm also not sure if there's some associative memory or something which could directly replace the hashtable.

    > afaik, TPUs efficiently perform only a+=x*y operation.

    They also have LUTs, and I think they can be used to approximate logistic mixing.
    https://github.com/hxim/paq8px/blob/...q8px.cpp#L1673

    > Do you think they have efficient random memory access?

    Afaik they have some embedded memory per computing cell, which could be used for mixer weights.

Similar Threads

  1. cmix
    By Matt Mahoney in forum Data Compression
    Replies: 449
    Last Post: 4th April 2020, 00:11
  2. LSTM and cmix
    By AndrzejB in forum Data Compression
    Replies: 4
    Last Post: 10th July 2019, 00:24
  3. PAQ compression in hardware ?
    By boxerab in forum Data Compression
    Replies: 16
    Last Post: 9th August 2017, 22:36
  4. Replies: 7
    Last Post: 4th October 2014, 22:00
  5. Hardware compression without software
    By BetaTester in forum Data Compression
    Replies: 0
    Last Post: 23rd January 2013, 20:05

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •