Results 1 to 3 of 3

Thread: LZ77 on GPU

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,569
    Thanks
    777
    Thanked 687 Times in 372 Posts

    LZ77 on GPU

    The string search sample finally light up the bulb in my mind and i got the GPU idea: you can compute anything as far as you simultaneously process 256 similar items, plus masked execution to simulate branching.

    Sketch of LZ-on-GPU implementation, based on the HST matchfinder idea:

    0. Split input data into blocks of ~64 MB. next stages are performed on the single block. some stages may be performed on CPU until efficient GPU implementation will be developed

    1. Find short matches: split block into 256 chunks and scan every chunk sequentially, looking for 2/3-byte matches. each chunk needs hash2[256] and hash3[4096] tables to store last matches and with 2-byte entries these tables will occupy 8.5 KB, i.e. 2.125 MB for all 256 chunks. On this stage we can aslo look for 4/5-byte matches (and use HST5/6 on the next stage) but it will go out of GPU cahes and therefore probably too slow

    2. Find long matches: build an HST4 index of the block. For every position in the index check previous 1..256 entries and calc number of matching bytes. My experience tells that, for optimum parsing, every position at average provides 1.5 matches in the increased length sequence. The question remains how to compact those 1..256 potential matches to the "increased length sequence" and then sort them by the LZ src position. At least it looks suitable for the "mass parallelism" paradigma

    3. Build optimal path: split the block into optimal-compressed chunks. The natural split points are large matches (easily found by the m/t sequential scan), otherwise data may be split after each 768 (a la lzma) .. 32K (a la tornado) positions. So, 64 MB block should provide enough data to process 256 chunks simultaneously. Then optimal parser enumerates all possible matches at every position and builds path with the minimum price

    4. Huffman encoding: the LZ ouput of previous stage may be split into 256 equal-sized chunks encoded simultaneously

    Overall, this builds up scheme of LZH/LZAri compressor with optimal parser and non-sliding window. Sliding window can be emulated by shifting out only half of the encoding window and encoding only second half of the window. Acually, even 4 MB data overlap should be enough to significantly reduce loss of compression ratio.

    Also, REP codes (repeat previous offset) should be implemented in the optimal parser in order to reach tornado/rar5 compression levels

  2. #2
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    1. Find short matches: split block into 256 chunks and scan every chunk sequentially, looking for 2/3-byte matches. each chunk needs hash2[256] and hash3[4096] tables to store last matches and with 2-byte entries these tables will occupy 8.5 KB, i.e. 2.125 MB for all 256 chunks. On this stage we can aslo look for 4/5-byte matches (and use HST5/6 on the next stage) but it will go out of GPU cahes and therefore probably too slow.
    GPU caches do not work like this. It is more like "explicit caching". GPUs have three types of storage: Global, on which you cannot synchronize. Shared: Between compute units within the same group, where you can synchronize on, and local: Within a work unit only. The local storage is the fastest, but - as said - local and not shared between units. There is only a very limited amount of it, typically something like 256 cells. Local storage is typically realized as registers. (GPUs are massive register machines). Shared storage is one magnitude slower, and allows exchange of data, between groups. Global storage is again probably one magnitude slower and is the on-GPU memory. Access can be pipelined if you follow a predictable access pattern, but it is usally slow.


    Thus, the trick is not only to come with a set of algorithms that allow massively parallel operations. The trick is to design the algorithms in such a way that memory is as local to the compute unit (work item in the language of OpenCL) as possible, avoiding memory that is "further away", and if you use any type of global memory, to access it in a linear pattern.

    The CudaJ2K implementation here from our university (not my group) is a GPU JPEG 2000 implementation, but only as fast as a dual-core processor. A quad core can outrun it easily. Thus, do not expect too much - data locality is a major issue.

    GPUs work entirely different from CPUs, and algorithm design is really much more than having the algorithmic description.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,495
    Thanks
    26
    Thanked 131 Times in 101 Posts
    GPUs have lots of bandwidth anyway so with proper pipelining you can extract a lot of performance. But you need thousands of threads, not just 256, unless you're targetting low-end GPUs. Originally, shaders were meant to instance one short lived thread processing one pixel in one stage of graphics creation. You need many threads per each ALU to allow the hardware to mask memory latencies.

    As to the orders of magnitude I think you exagerrated that. Looking at http://en.wikipedia.org/wiki/List_of...ocessing_units Radeon R9 290X has 320 GB/s global memory banwidth and 5632 GFLOPS of raw computing power. That amounts to 17,6 FLOPS per 1 B/s. Of course, memory transfer granularity is definitely higher than 1 byte. I'm not sure but it also shouldn't be very high, I think it's under 10 bytes.

    The algorithm I've outlined here: http://encode.su/threads/2011-Mass-p...ll=1#post39609 should parallelize well, ie the number of threads that can be created should go into many thousands, so the hardware would have a lot of opportunities to mask memory latencies. But still the algorithm would be bottlenecked by memory bandwidth (and random memory access if that really matters in this case).

Similar Threads

  1. lz77 visualisation
    By chornobyl in forum Data Compression
    Replies: 3
    Last Post: 7th June 2016, 16:04
  2. Byte oriented LZ77 compressor
    By Matt Mahoney in forum Data Compression
    Replies: 21
    Last Post: 30th December 2011, 17:27
  3. LZ77 Performance Issues
    By david_werecat in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2011, 18:06
  4. balz v1.00 - new LZ77 encoder is here!
    By encode in forum Forum Archive
    Replies: 61
    Last Post: 17th April 2008, 22:57
  5. Fast arithcoder for compression of LZ77 output
    By Bulat Ziganshin in forum Forum Archive
    Replies: 13
    Last Post: 15th April 2007, 17:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •