Results 1 to 2 of 2

Thread: Entropy coding on GPU

  1. #1
    Join Date
    Jun 2009
    Kraków, Poland
    Thanked 136 Times in 104 Posts

    Entropy coding on GPU

    I'm slowly developing entropy coding algorithm for GPUs. Right now I have somewhat optimized histogram computation (that isn't exactly trivial, despite its simplicity when doing single threaded implementation) and I need to compute Huffman codes on GPU. Thus I need some ideas.

    To extract performance from GPU one would need to keep as much computations within the chip, without going to global VRAM too much. Also, because single threaded performance is low (on AMD GCN instructions are issued in order and each has 4 cycle latency) we need many threads (4 per ALU allows to keep them busy, unless there's control flow divergence). Per each compute unit there's some amount of local memory (on AMD GCN it's 64 KiB) which is accessible in max 32 KiB chunks per workgroup. There's 4 times more "private memory" which is a register file, but registers aren't indexed so every instruction in a SIMD has to access the same register during simulteanous execution of an instruction.

    Overall, we need either to work in one single problem in parallel or reduce working area per single problem to minimum so we can execute as much problem instances simultaneously as we can. In all cases, the code divergence has to be low - if we have a while loop which has different iterations in different threads, then the slowest thread in a workgroup (or subgroup?) determines the overall computation speed.

    So, for starters, I need some ideas to either compute prefix codes with small working area (so I can run many independent threads) or with high degree of parallelism (so different threads can work simultaneously on the same problem instance).

    I have some idea on how to compute prefix codes lengths with very low memory - just select them at random from pair (floor(-log2(prob)), ceil(-log2(prob))) and check whether they could form a valid prefix code and how good the ratio would be. Do it in parallel and there will be relatively high chance that the best found codes would be quite good. Some sort of genetic algorithm could also be good for that. Anyway, such algorithms require some source of randomness, so I need to solve that additional problem first.

    Do you have any idea how to compute Huffman codes with lowest memory limits for indexed arrays? Eg 1 KiB for 256 symbols alphabet and 16-bit frequencies? Also keep in mind that cells are 32-bit, so I would need to interleave arrays into one (using some bit fiddling tricks for example).

    (sorry for repetitions, I edited the post for too long)

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Thanked 698 Times in 378 Posts
    building huffman tree usually takes much less time than encoding itself, so you can start with slow algo. how about encoding itself? as well as producing these data since even encoding is so fast that copying these data from cpu will ruin performance

Similar Threads

  1. Benchmarking Entropy Coders
    By dnd in forum Data Compression
    Replies: 194
    Last Post: 22nd February 2020, 01:02
  2. Finite State Entropy
    By Cyan in forum Data Compression
    Replies: 203
    Last Post: 20th February 2015, 21:22
  3. Replies: 1
    Last Post: 17th February 2014, 23:05
  4. ENTROPY 0.5
    By comp1 in forum Download Area
    Replies: 3
    Last Post: 7th February 2013, 00:21
  5. What is best for a pure Entropy Encoder?
    By biject.bwts in forum Data Compression
    Replies: 4
    Last Post: 28th September 2011, 13:45

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts