Results 1 to 13 of 13

Thread: How much would it cost to develop 7z fork which use Nvidia CUDA / OpenCL?

  1. #1
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    200
    Thanks
    136
    Thanked 13 Times in 12 Posts

    How much would it cost to develop 7z fork which use Nvidia CUDA / OpenCL?

    for say LZMA2 compression acceleration
    as I understand whole LZMA2 algorithm should be adapted?
    Is it practically possible with certain amount of donation or will require non practical time period?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    1. LZMA2 is open-source, you don't specifically have to ask Igor Pavlov to port it.

    2. Just "using" CUDA/openCL is mostly a matter of adding some tags to memory-to-memory versions of enc/dec functions.
    Running on GPU doesn't automatically mean being faster than on CPU.

    3. Do you ever use LZMA2 in a way that loads all cpu cores? (via adjustment of :d and :c and :mt parameters)
    Because GPU just has more cores (not even much more - 2x or 4x compared to CPU),
    but they have lower clock and are are overall dumber (less caches, less branch prediction).
    (GPU "threads" are not actually threads, but rather vector elements - only different "warps" have different instruction pointers).

    4. Matchfinding (which is the slowest part of LZ encoding) doesn't benefit from vectorization or GPU architecture.
    Entropy coding in theory can be vectorized, but LZMA/2 uses a previous unpacked byte and next byte after last match as context.

    5. You can test cuda mode of LIBBSC - http://libbsc.com/ - it can be added to 7z and you'd have "7z with CUDA".
    You can ask fgiesen/RAD about this, since they did some related experiments.
    But my opinion is that LZMA speed improvement on GPU can be only achieved at the cost of significant drop in compression ratio (processing many small blocks in parallel etc).
    LZMA is also mostly incompatible with MT optimizations, so its better to consider modern LZ codecs (oodle/brotli/zstd mostly), or corresponding modifications in LZMA (incompatible).

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Quote Originally Posted by Shelwien View Post
    4. Matchfinding (which is the slowest part of LZ encoding) doesn't benefit from vectorization or GPU architecture.
    Entropy coding in theory can be vectorized, but LZMA/2 uses a previous unpacked byte and next byte after last match as context.
    Doesn't match finder like in https://github.com/conor42/Radyx speed things up? If it does speed up match finding a lot then parsing could become the bottleneck and target for porting to GPGPU.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Yes, but we won't necessarily benefit from creating more matchfinder threads.
    Just as well we could run full LZMA encoders for individual blocks, since the question didn't say anything about window size.

    Also you should think about GPU ports in terms of vectorization rather than MT.
    The number of actual cores in GPUs is similar to modern CPUs.
    In that sense, Radyx matchfinder is not really any better than bt4: https://github.com/conor42/fast-lzma...adix_mf.c#L386

    I don't say that matchfinder vectorization is impossible, just that there's no existing implementation (Radyx has lots of branches etc).
    But then, in any case, matchfinding is mainly about random access to memory, so GPUs, with their relative lack of caches, would have worse performance.

    And anyway, the main problem is entropy coding - it _can_ be vectorized, but not without breaking LZMA format compatibility.
    But without that, the decoding speed of any GPU port would be 2x slower than on CPU.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Because GPU just has more cores (not even much more - 2x or 4x compared to CPU),
    Similarly-priced 6-core CPU has 12 threads, while GF1660 has 1408/16=88 cores, each core can run up to 16 threads, and in order to hide latencies of loads, jumps and other slow operations, each core should run at least 3-4 threads. So, we have 12 threads vs 300 threads. And then, CPU has LLC about 1 MB/thread, while GPU cache is about 10 KB/thread, making many CPU-optimized algos inefficient.

    BSC employs GPU only for radix sorting, Radyx seems to have similar part of algorithm, so GPU may be used to offload this part of work.

    Regarding match finders and optimal parsing, I have some thoughts, we may discuss it.
    Last edited by Bulat Ziganshin; 15th June 2020 at 18:23.

  6. Thanks:

    Shelwien (1st March 2020)

  7. #6
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    176
    Thanks
    28
    Thanked 74 Times in 44 Posts
    You can build a suffix array in parallel on gpu, you can use that for match finding. Bulat has suggested using Nvidia's NvBio library before since it has a GPU accelerated suffix array algorithm, I think it's a good idea. As for the cost of integrating it into LZMA, probably a week if you know what you're doing.

  8. #7
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,497
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Quote Originally Posted by Lucas View Post
    Bulat has suggested using Nvidia's NvBio library before since it has a GPU accelerated suffix array algorithm
    This https://nvlabs.github.io/nvbio/nvbwt_page.html https://github.com/NVlabs/nvbio/tree/master/nvBWT ? Is that a regular BWT? I don't have nVidia GPU to check.

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Let's continue. Before dreaming about massive-parallel processors, we should finish our homework with m/t, vectorized CPU code. Yeah, GPUs are much more advanced in SIMD operations, but finally AVX512 became pretty close to the first CUDA GPU, Tesla.

    So, what we can do

    1. Entropy codecs. Huffman coding is easy for GPU, and we have seen even parallel decoders (although less efficient). For ari/ans and dynamic huffman we can have more jobs by processing multiple blocks simultaneously. Moreover, decoding can be started from multiple positions inside the block by saving internal state for these points.

    2. M/t match finder for optimal parsing, as implemented in RAR5/LZHAM, just splits all positions p into N buckets based on hash(get_K_bytes_at(p)) % N, and then processes each bucket in its own OS thread. Can we go from 12 to 300 buckets? Sure, but we yet have to research how number of buckets, hardware cache size and match bytes caching in the match table affect performance.
    Last edited by Bulat Ziganshin; 1st March 2020 at 23:13.

  10. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    There are two GPU BWT implementations, it's one of them. They claimed faster speed than any CPU code and anyway allows to offload part of work to GPU.

    I don't proposed to use BWT for LZ match finding, although it's used by ZPAQ. I think that radix sort, which is also very fast on GPU, may be useful, but overall it's a huge space for research.

  11. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by necros View Post
    for say LZMA2 compression acceleration
    as I understand whole LZMA2 algorithm should be adapted?
    Is it practically possible with certain amount of donation or will require non practical time period?
    You can get the same density within 0.6 % with brotli or within 5 % with zstd. Why bother with the 5x slower to decompress LZMA?

  12. #11
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    You can get the same density within 0.6 % with brotli or within 5 % with zstd. Why bother with the 5x slower to decompress LZMA?
    I got a feeling that brotli improved over zstd in the department of text compression, but what about compression of non-text files such as executables or databases or game resources against LZMA?

    Brotli allows a denser packing than gzip and deflate because of several algorithmic and format level improvements: the use of context models for literals and copy distances, describing copy distances through past distances, use of move-to-front queue in entropy code selection, joint-entropy coding of literal and copy lengths, the use of graph algorithms in block splitting, and a larger backward reference window are example improvements.
    Context modeling looks like the showstopper for parallel encoding/decoding, how you solve this problem?

  13. #12
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    876
    Thanks
    242
    Thanked 325 Times in 198 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    I got a feeling that brotli improved over zstd in the department of text compression,
    This feeling is not historically accurate. Zstd was a single quality fast lz4-like at the time when brotli was already positioned as a general purpose compressor, and a competitor for gzip. IIRC, zstd would not find matches shorter than seven bytes in its incarnation from those days. This made it pretty bad in normal text compression or compression of short files. I consider that zstd positioned itself into the same box as brotli only after brotli was launched, with some less emphasis on small files and on density, and some more on simplicity and decoding speed.

    Also, zstd was frozen a year or two later than brotli.

    Quote Originally Posted by Bulat Ziganshin View Post
    but what about compression of non-text files such as executables or databases or game resources against LZMA?
    If you have a homogenous file of 32 bit floats, lzma does work better than brotli. If you have a more complex structure, then brotli may be better. LZMA's larger context model helps some with floats to know what those bits are for.

    Brotli the format has some additional considerations for such homogeneous data -- but brotli the opensource encoder has not adopted them yet.

    The first use case and the first customer of brotli was fonts. Fonts are complex geometry, with binary data. In that use case we met the lzma compression density with gzip's decoding speed performance.

    For more complex structured data, I'd recommend riegeli.

    For large homogenous data, transposing before compression can help in compression while maintaining a shorter context.

    Quote Originally Posted by Bulat Ziganshin View Post
    Context modeling looks like the showstopper for parallel encoding/decoding, how you solve this problem?
    Parallel encoding is simple: you just encode in 512 kB to 2 MB meta-blocks. In brotli (q10 and 11) practically all the time goes into sub-block dance that is done after LZ77.

    Parallel decoding, I'm not convinced if that is something one would even want to have for algorithms as fast as gzip, brotli and zstd. For actual system performance I consider streaming (in the case of such fast decoders) more important. Then you have already decompressed the data when it is being received through the nic/flash/... and next stages of processing have started during the transmission.

  14. #13
    Member
    Join Date
    May 2014
    Location
    Canada
    Posts
    141
    Thanks
    72
    Thanked 21 Times in 12 Posts
    I've had some success running j2k *encoding* on the GPU, as the context modelling and arithmetic coding can be put into separate kernels, so there is decent occupancy on the SIMDs. Combined with the right memory access patterns, it is an order of magnitude faster than similarly priced CPU. This was on an AMD RX 470 - equivalent 580 is quite cheap these days, with lots of compute horsepower. For the decoder it is much harder to take advantage of the GPU architecture,as decode naive approach requires a large, very serial, very branched kernel. One approach is to use the high memory bandwidth of the cards to break up the kernel into a pipeline. 400 GBPS global memory bandwidth is commonplace.

Similar Threads

  1. New nvJPEG decoder from Nvidia
    By SolidComp in forum Data Compression
    Replies: 7
    Last Post: 26th June 2018, 01:46
  2. Replies: 45
    Last Post: 25th November 2016, 03:30
  3. Replies: 69
    Last Post: 16th August 2016, 18:46
  4. Replies: 41
    Last Post: 6th May 2016, 19:13
  5. CUDA Technology of nVIDIA for Compression?
    By Stephan Busch in forum Data Compression
    Replies: 13
    Last Post: 17th September 2008, 21:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •