Results 1 to 9 of 9

Thread: can there be some performance gain from using lizard in highly parallel compute?

  1. #1
    Member
    Join Date
    Jul 2019
    Location
    .
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    can there be some performance gain from using lizard in highly parallel compute?

    I have been wondering since Lizard scales pretty well with multicore cpus, could there be some noticeable compression time improvemevent using a gpu? why wouldnt lizard be suitable if the answer is no? and why the implementation in Bzip2 takes to long even it can use cpu and gpu shouldnt take less time than cpu only?
    Click image for larger version. 

Name:	68747470733a2f2f6d636d696c6b2e64652f70726f6a656374732f372d5a69702d7a7374642f646c2f6465636f6d702d.png 
Views:	61 
Size:	83.1 KB 
ID:	6728

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    Yes, it can be ported to opencl or cuda.
    And yes, its probably possible to make 64kb-window version to work faster on GPU than cpu (with enough threads).

    But GPUs overall are not very suitable for compression algorithms, because there's a speed/compression trade-off -
    best compression creates serial dependency, while splitting data for parallel compression improves speed, but hurts ratio.

    Another problem is that GPUs lack data caches and branch prediction, and both of these are important for compression
    (like, there'd be data-dependent branches, so branch prediction for compressible data would actually improve processing speed).

    Yet another problem is about GPU data errors. Like, did you ever see random glitches on screen in games?
    Electronics are not 100% safe actually, there's simply a low enough error margin, like 10^-20 or so.
    Which is why there's ECC memory for servers, for example.
    But with enough load and processed data errors are bound to appear.
    And while in games it would be just harmless rendering glitches, in compression it would usually break
    all the data after the point of error.

    As to bzip2, its likely simply an unoptimized implementation.
    Bzip2 is basically used only on linux, and not that frequently even there.

  3. Thanks:

    TTXX1 (22nd July 2019)

  4. #3
    Member
    Join Date
    Jul 2019
    Location
    .
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Well the thing is that at the end multithreading on CPU(like 7zip zstd) wouldnt improve performance_ how much of a difference would there be using the gpu ? I think I need to read further in opencl if the kernels can be executed in parallel (using free/idling compute units), but how does parallelism affect the compression ratio? I see that latest release of lizard can use 50% of a 4c/8t cpu but compression ratio is the best I have seen combined with huffman (setting -49)

    the idea was because since almost every cpu now a days as an igp(0.5-2tflop< ?) or some people have (1-16tflops of float precision 32 bit precision) dedicated gpus, I assumed this might help also GPUs have plenty bandwidth (200GB/s-1TB/s)

    about glitches in games? I dont think i see them often unless they are so small (hence unnoticeable) at least some modern (AA?/AAA) games dont have it.This wouldnt apply to common pc aswell? most consumer PCs lack of ECC memory(unless people mostly use xeons and maybe FX/Ryzen with ecc) and I dont see problems with compression unless the file has been damaged

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    > Well the thing is that at the end multithreading on CPU(like 7zip zstd) wouldnt improve performance_
    > how much of a difference would there be using the gpu ?

    Unfortunately its hard to say without actually attempting that.
    Its not really that simple to turn a serial compression algo into
    an efficient parallel one. We can split the file into chunks and
    compress each chunk in its own thread, but that's a lazy solution.
    Its also possible to implement a specialized parallel algorithm,
    with SOA, branchless code, etc.

    > I think I need to read further in opencl if the kernels can be executed in parallel

    You can find some papers on deflate implementation for GPU.
    Btw, afaik winzip has a GPU-based deflate version... though its not very successful.

    > (using free/idling compute units), but how does parallelism affect the compression ratio?

    You can test that by manipulating :c and :mt parameters in 7-zip.
    It actually depends on data type.

    > I see that latest release of lizard can use 50% of a 4c/8t cpu but
    > compression ratio is the best I have seen combined with huffman (setting -49)

    Its also possible to make a parallel encoder without compression loss
    (by processing overlapped data chunks), but that hurts performance.

    > I assumed this might help also GPUs have plenty bandwidth (200GB/s-1TB/s)

    Most compression algorithms (especially fast ones) are not GPU-friendly.
    There's little computing, no float-point - just comparing short strings basically.
    So smart data caches and branch prediction are important, but GPUs don't really have these.

    > about glitches in games? I dont think i see them often unless they are so
    > small (hence unnoticeable) at least some modern (AA?/AAA) games dont have it.

    Well, recent GPUs have more integrated error-correction probably (via redundant circuits).
    But we actually encountered real bugs like that before:
    https://encode.su/threads/586-bsc-ne...ll=1#post26200

    So I guess modern GPUs are safer in that sense, but I think they're still
    less robust than CPUs, and also more likely to be unsafely overclocked.

    > This wouldnt apply to common pc aswell? most consumer PCs lack of ECC
    > memory(unless people mostly use xeons and maybe FX/Ryzen with ecc) and I
    > dont see problems with compression unless the file has been damaged

    Even with cpu processing you might see some errors after processing a few TBs of data.
    For example, I've had such issues myself once, when downloading some set of rainbow tables via http -
    files on server were valid, but I ended up downloading them with some flipped bits here and there.

    With GPUs these types of errors are just more likely atm (I think).

  6. #5
    Member
    Join Date
    Jul 2019
    Location
    .
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > Well the thing is that at the end multithreading on CPU(like 7zip zstd) wouldnt improve performance?


    Unfortunately its hard to say without actually attempting that.
    Its not really that simple to turn a serial compression algo into
    an efficient parallel one. We can split the file into chunks and
    compress each chunk in its own thread, but that's a lazy solution.
    Its also possible to implement a specialized parallel algorithm,
    with SOA, branchless code, etc.
    well looking at some studies someone failed to paralelize a LZ77 based compressor
    other study improved 20% using a GTX 580 on LZW (compared to a 2600) while it mantain same compression ratio
    parallel RLE can be up to 500% faster than cpu
    LZW ends faster than LZ77 but LZ77 has a 10x smaller size faster in GPU than LZ77

    about SOA Opencl doesnt allow pointer to pointer so using such approach will take more memory

    You can find some papers on deflate implementation for GPU.
    Btw, afaik winzip has a GPU-based deflate version... though its not very successful.
    link it doesnt seem that deflate is the correct algorithm for this



    You can test that by manipulating :c and :mt parameters in 7-zip.
    It actually depends on data type.
    Block-Sorting seems to improve compression ratio compared to huffman maybe there is something more to it?

    Its also possible to make a parallel encoder without compression loss
    (by processing overlapped data chunks), but that hurts performance.
    wouldnt it take more memory?

    Most compression algorithms (especially fast ones) are not GPU-friendly.
    There's little computing, no float-point - just comparing short strings basically.
    So smart data caches and branch prediction are important, but GPUs don't really have these.
    premise wouldnt be the same as using CPU but instead use stream processors to sort-find-reallocate/whatever it is needed that can be to speed up the process?

    Well, recent GPUs have more integrated error-correction probably (via redundant circuits).
    But we actually encountered real bugs like that before:
    https://encode.su/threads/586-bsc-ne...ll=1#post26200
    artifacting sounds like he overclocked the memory or maybe overclocked to gpu to unstable levels
    Last edited by TTXX1; 23rd July 2019 at 19:04.

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    > it doesnt seem that deflate is the correct algorithm for this

    At encoding there's no real difference between lizard/huffman and deflate.
    Stream interleaving in lizard (FSEhuf) does improve decoding speed, but
    afaik there're fixed 4 streams - it doesn't help that much with massive parallelism.

    Its possible to make a GPU-friendly version with much more interleaved streams,
    but I think it would break format compatibility, and there's a compression overhead
    from extra streams too.

    > Block-Sorting seems to improve compression ratio compared
    > to huffman maybe there is something more to it?

    Yes, it seems much easier to make efficient GPU ports of BWT-based compression methods.
    But its also possible to use BWT suffix arrays for LZ matchfinding.

    > wouldnt it take more memory?

    Of course, but it shouldn't matter with 64kb window,
    and with a large window it would be much harder to find independent data chunks
    to compress in parallel.

    > premise wouldnt be the same as using CPU but instead use stream processors
    > to sort-find-reallocate/whatever it is needed that can be to speed up the process?

    I'm not aware of there being anything helpful for matchfinding on GPUs.
    There may be some special logic for texture lookups with interpolation,
    but there's no associative memory or anything like that.

    So I think that LZ optimizations for GPU should be basically the same
    like what's already done for CPUs - vectorization, branchless processing, stream interleaving.
    With GPUs we just need to aim for more threads.

    > artifacting sounds like he overclocked the memory or maybe overclocked to gpu to unstable levels

    As I said, errors happen even during CPU processing,
    there're even exploits based on increasing the probability of that (rowhammer).

    With GPUs such errors are just more likely, since they're not normally used
    for data compression, so its more profitable to increase the number of cores (CUs),
    rather than add more ECC/parity circuits for safety.

    Well, there've been a significant reduction in errors with recent GPUs generations -
    in 2012 or so it was easy enough to reproduce bsc decoding errors
    even without overclocking.
    But I think that GPU-based compression algorithms intended for practical use
    would still require explicit output verification.

  8. #7
    Member
    Join Date
    Jul 2019
    Location
    .
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Also I've been wondering how do people manage the ram usage on these algorithm some way they have to know how long the char array has to be to allocate the bytes, because in opencl there isnt allocation(malloc,calloc)

    I notice that on cpu side on lizv1 (+ huffman) or lz4fast (+ huffman) the ram usage barely increases 100-200mb maybe more depending on file size but the total load is into the hdd with opencl yu can use both or either system ram and the opencl device's ram

  9. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,975
    Thanks
    296
    Thanked 1,302 Times in 739 Posts
    In theory there's gcl_malloc, or you can allocate buffers on cpu and pass them to kernel.
    But for LZ with a small window its quite possible to use fixed-size arrays for everything.

  10. #9
    Member
    Join Date
    Jul 2019
    Location
    .
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    @Shelwien I was referring to the 7zip ZSTD implementation which allocates a big part of the memory stream on the virtual memory (storage)

    also i was reading the Lizard compression code but I dont understand part of this function https://github.com/inikep/lizard/blo...ompress.c#L346 hashTableSize and chanTableSize are original to LZ4/Lizard or these are optimizations for performance? if so..how does it work the stream on LZ5 ? i cant find anything that explains the use of the stream (using cLevel =19)

Similar Threads

  1. Browser performance question
    By necros in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 5th October 2017, 23:34
  2. C# Implementation of fpaq0p_sh and Performance Analysis
    By osmanturan in forum Data Compression
    Replies: 23
    Last Post: 8th December 2011, 15:04
  3. LZ77 Performance Issues
    By david_werecat in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2011, 18:06
  4. Better compression performance across time?
    By Trixter in forum Data Compression
    Replies: 16
    Last Post: 16th June 2008, 23:35
  5. TC 5.0dev11 is here - Time to gain compression!
    By encode in forum Forum Archive
    Replies: 38
    Last Post: 1st August 2006, 09:24

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •