The choice of tANS L state entropy coding automaton consists of:
- quantization of symbol probability distribution as p[s] ~ q[s]/L fractions
- spreading these symbols is range [0, L-1], such that s appears q[s] times
I have created a C++ toolkit to test different methods here, compare with Huffman: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
A few improvements were made on the way - the most important is finally fast "tuned" symbol spread: using inequalities in q[s] ~ p[s]/L approximations.
So in currently used implementations, the symbol spread usually uses only q, slowly getting some basic "tuning": shifting low probable single appearance symbols to the end of range (recent FSE).
The new spread_tuned() uses all symbol probabilities in addition to q. For example for 4 symbol and L=16 states (binary(0.2,2)):
p: 0.04 0.16 0.16 0.64
q/L: 0.0625 0.1875 0.125 0.625
this quantization itself has dH/H ~ 0.011
spread_fast() gives 0233233133133133 and dH/H ~ 0.020
spread_prec() gives 3313233103332133 and dH/H ~ 0.015 - generally it is close to quantization dH/H
while spread_tuned(): 3233321333313310 and dH/H ~ 0.0046 - better than quantization dH/H due to using also p
Tuning shifts symbols right if q[s]/L > p[s] and left otherwise, getting better agreement and so compression rate.
This compression rate income is the most significant for low L, for example when we need to operate on large number of small distributions.
The idea behind spread_tuned() is pretty simple: we know that probability distribution of states should be approximately P(x) \propto 1/x, so we choose expected symbol appearances to get this rule. It is especially important when switching between different coding tables, but also gives improvements while using single table.
So P(x) \propto 1/x leads to ~1/ln(2) normalization factor (harmonic numbers). Encoding to some appearance of a symbol is made from a range of values: all x such that after k steps "x->floor(x/2)", we get i\in [q[s], 2q[s]-1). Probability of such range for our assumption turns out to be ~ ln(1+1/i)/ln(2). Multiplied by p[s], it is probability of obtained state x: should be Pr(x) ~ 1/x/ln(2), finally getting:
preferred position for i \in [q[s], 2q[s]-1) appearance of symbol s is x = 1/(p[s] * ln(1+1/i))
The spread_tuned() creates lists of preferred symbols for each position, then spreads symbols by reading these lists in order.
It has linear complexity, but might be too slow some applications. For speedup, only low probable symbols can be treated this way (e.g. q=1, maybe 2), and fast spread can be used for the remaining ones.
Single appearance at the end of table corresponds to p ~ 1/(L*2*ln(2)) ~ 0.72/L. it is the lowest probability we can represent well - if you have lots of lower probability symbols, consider grouping them into an escape symbol, decoded in eventual succeeding step (should be rare - nearly does not affect speed).
Another new function is quantize_prec() which finds the best quantization accordingly to sum_s (p[s] - q[s]/L)^2/p[s] approximation of dH. Unfortunately it is n*log(n) complexity.
There is also quantize_fast() which after q[s] = round(p[s]*L), just uses the most probable symbol to repair the balance - sometimes it does not work(!).
There is also new fast method to propagate probability distribution of states (make_step()) - it is low memory and usually quickly converge to stationary probability distribution - required to get hANS (average number of bits/symbol we use). This convergence ("power method") could be replaced by a faster direct method to find the dominant eigenvector, but it would require storing the whole matrix.
Feel free to add more probability distributions, better quantizers and spreads (and let me know if it crashes for some reasonable parameters).
Another important thing to add is quantization/compression of probability distribution, which often has to be stored in the header.