About nonlinear scaling of values before using linear quantization to store them, for example Huffman does something like this - store ~round(lg(p)).
I have tried to make some very brief general analysis of this kind of method in some previous post - that probabilities from some range are written as the central point of this range.
If these ranges is (r_i, r_{i+1})_i family, to make all of them have the same impact on deltaH ~ sum_s (p_s-q_s)^2/p_s approximation, we should choose r-s accordingly to recurrence:
r_{i+1} = r_i + C + sqrt(C^2 + 4 C r_i)
for some small constant C (impact to deltaH) - however, the number of such ranges turns out relatively large, like ~ 79 for deltaH~0.01, lg(79)~6.3 bits/symbol.
If you would take just some nonlinear function, it would be rather even worse.
It can be improved by using the fact that larger probabilities are less frequent - so additionally we could use entropy coder to store these quantized values.
thorfdbg, this is indeed a problem of finding optimal quantizer (with a probability distribution among quantized values), but the solution should be also practical - cannot be computationally too costly.
Another issue is what type of probability distribution should we expect? Can we assume e.g. Zipf law?
Regarding Yuriy article, he uses uniform quantization(?), while I am certain that lower probabilities should be given with better accuracy - do you disagree?
Regarding storing the order, indeed in many cases, like two-sided exponential distribution in image compression, we can neglect the difference from perfect order and just store parameters of the distribution.
However, if we rely on some general laws like Zipf's, or in methods like BWT+MTF mentioned by nburns, the disturbance from perfect ordering can be essential (e.g. single appearance of EOF symbol in MTF...) and so should be somehow stored.