nburns, encoding probability distribution is a general problem, but indeed we should have concrete applications in mind while designing algorithms, like for tANS or rANS (which with

alias method can fit huge alphabet (up to ~ 2^15) with perfect accuracy into L1 cache).

I have written ~1/L as minimal distinguished probability, but indeed for tANS I would distinguish e.g. singleton in the middle (q~1/L) from singleton at the end of the table (q~1/L/sqrt(2)) - and generally such quantization can have basic tuning in consideration (changing starting bias of symbols), what means getting out of f[s]/L quantization.

Formulation is as I have written, or exactly as in Yuriy's paper - but for data compression it is KL distance, and we can use the fact that some quantizers are used more often, and some much rarer (like at most once for q>1/2 as you have written) - we can use entropy coder to compress the final quantizer(s).