Let's discuss what kinds of lossless compression algos (or their parts) can be made faster employing mass-parallel computers and in particular modern GPUs
I don't have good knowledge of modern GPUs (still have to read the book), but afaik they have dozens of simultaneously executed threads (like CPU cores) with dozens of numbers processed by each thread simultanelously in SIMD fashion. also, they support dozens of ready-to-run thread per each execution thread, like the HyperThreading cpu technology
taking into account execution delays and especially memory access delays (partially hidden by 16-48kb cache), GPU may be treated as cpu executing hundreds of threads simultaneously with wide SIMD device on each thread. so efficient GPU execution needs algorithms with the following properties:
1) operation may be split into hundreds of independent tasks, or alternatively - into dozens tasks accessing only 16-48 kb of data
2) data can be processed in SIMD fashion with dozens of parallel items
computations meeting both criterions are ideal for GPUs. but probably any computation that may be split into 100+ independent jobs, will be faster on GPU since cpu can't provide such level of parallelism. and vice versa - computation that cannpot be split into more than 16 jobs, probably cannot benefit from GU executon since modern cpus can execute 8 threads simultabeously, has pretty wide SIMD devices and pretty the same cache hiearachy as GPUs. well, cpu memory/cache is faster while gpu memory/cache is wider, so bandwidth-hungry algorithm will be ideal for GPU, too
so, basically we just need to select algos that may be split into 100+ independent jobs. i will name a few:
1. static bit-precise (huffman) encoding, followed by the m/t shift into destination positions
2. any other static-model encoding such as PPM, with each next block started from the next bit/byte. decompression can be made m/t too
3. static-dictionary LZ/dictionary replacement
4. CM encoding with multiple models running simultaneously