Surprisingly, currently the fastest available for CPU is rANS: ~1500MB/s/core decoding:

https://github.com/jkbonfield/rans_static vs ~500 for tANS.

This amazing speed is thanks to using SIMD ... but requires 32 independent streams.

Splitting data into independent parts, like H.265 into slices, is usually at cost of compression ratio.

For LZ77 we can e.g. prepare a common dictionary, and expand it independently for each "slice".

However, there is practically no penalty for simple models like i.i.d., Markov ... like used for entropy coders in zstd-like compressors.

Zstd uses (4) static probability distributions per frame - we could accumulate multiple such entropy coding tasks and encode/decode them simultaneously exploiting the speedup.

... ok, James' implementation rather uses the same probability distribution for all 32 independent streams - the question is speed penalty for using separate distributions?

Also, James has super-fast o1 Markov, which might be worth to consider for improving compression ratio of zstd ... ?

The biggest problem here is size of such models (to be stored in the header) - it would need some work, but should allow for significant improvement of compression ratio.