I want to try creating a CM compressor and decompressor running on GPU's via Nvidia CUDA. Therefore I need some Ideas how I can realize the parallelism required. (Each instruction need to be executed in >1000 Threads in parallel. And no branching, because the instruction scheduler in each package of threads can only exexcute one common Instruction at a time.
So just using the code we currently have (for example ZPAQ: one block for each of the ~1000 threads) on the GPU doesn't work.
I can not guarantee to coplete the project, but I will try to.