I'm slowly developing entropy coding algorithm for GPUs. Right now I have somewhat optimized histogram computation (that isn't exactly trivial, despite its simplicity when doing single threaded implementation) and I need to compute Huffman codes on GPU. Thus I need some ideas.

To extract performance from GPU one would need to keep as much computations within the chip, without going to global VRAM too much. Also, because single threaded performance is low (on AMD GCN instructions are issued in order and each has 4 cycle latency) we need many threads (4 per ALU allows to keep them busy, unless there's control flow divergence). Per each compute unit there's some amount of local memory (on AMD GCN it's 64 KiB) which is accessible in max 32 KiB chunks per workgroup. There's 4 times more "private memory" which is a register file, but registers aren't indexed so every instruction in a SIMD has to access the same register during simulteanous execution of an instruction.

Overall, we need either to work in one single problem in parallel or reduce working area per single problem to minimum so we can execute as much problem instances simultaneously as we can. In all cases, the code divergence has to be low - if we have a while loop which has different iterations in different threads, then the slowest thread in a workgroup (or subgroup?) determines the overall computation speed.

So, for starters, I need some ideas to either compute prefix codes with small working area (so I can run many independent threads) or with high degree of parallelism (so different threads can work simultaneously on the same problem instance).

I have some idea on how to compute prefix codes lengths with very low memory - just select them at random from pair (floor(-log2(prob)), ceil(-log2(prob))) and check whether they could form a valid prefix code and how good the ratio would be. Do it in parallel and there will be relatively high chance that the best found codes would be quite good. Some sort of genetic algorithm could also be good for that. Anyway, such algorithms require some source of randomness, so I need to solve that additional problem first.

Do you have any idea how to compute Huffman codes with lowest memory limits for indexed arrays? Eg 1 KiB for 256 symbols alphabet and 16-bit frequencies? Also keep in mind that cells are 32-bit, so I would need to interleave arrays into one (using some bit fiddling tricks for example).

(sorry for repetitions, I edited the post for too long)