I may have missed something or my knowledge of GPU be lackluster,

but while i do understand that each GPU core can calculate a probability, or to be more precise one prediction stage, based on some hashed-context distribution, i do not understand how the parallelizing can be kept (and therefore taken advantage of) when reaching the final coding step. Assuming an arithmetic or range coder, all final probabilities must be merged into a single stream (or a few if you know how to jump from one to another), and each flow need serialising in the same order as source input.

Now, maybe this final stage can be considered not so costly compared to probability estimation itself. Then, there is still the need to ensure probabilities are encoded in the correct order. Meta-tags, maybe....