# Thread: tANS - rANS fusion: further simplification at cost of accuracy

1. ## tANS - rANS fusion: further simplification at cost of accuracy

If anybody (like me) would think that ANS will not bring any more surprises ...
While rANS allowed to reduce the number of multiplications of range coding from 2 to 1 ... this time we will reduce it to 0!
The idea comes from Yann's remark that we can see Huffman decoder as a special case of tANS decoder - when symbols are spread in ranges of 2^k size.
So what if we get rid of the "q=2^k" restriction and use q_s ~ L*p_s quantization for sizes of symbol ranges instead - you can test it in the last version of toolkit.
Unfortunately dH does not decrease to 0 with L as previously - it stops at a comparable rate as Huffman (usually better, but sometimes slightly worse).

In rANS we have used:
p[s] ~ q[s]/M quantization (M divides L, both are a power of 2)
encoding:
C(s,x) = M floor(x / q[s]) + mod(x , q[s]) + B[s]
where B[s] = q + ... + q[s-1]
decoding:
s = s(x mod M)
D(x) = (s, q[s] floor(x / M) + mod(x , M) - B[s])
where s is such B[s] <= mod(x , M) < B[s+1]
and we need normalization to return to x in I = {L,..., bL-1} range.

Let us choose M=L, b=2 degenerate rANS case. Equivalently, take tANS, b=2. Use quantization to L denominator:
p[s] ~ q[s]/L
This time L=M can be chosen as any number, making quantization trivial, e.g. q[s]=round(2^R * p[s]) and then L=sum_s q[s].
Now spread tANS symbols in ranges: symbol s in {B[s], ..., B[s+1]-1} positions
where B[s] = L + q + ... + q[s-1] (larger by L this time)
x is in I = {L , ... , 2L-1} range, symbol s encodes from {q[s], ..., 2q[s] - 1} range

decoding step is
```s = s(x);                                      // like for rANS, different approaches possible
x = q[s] + x - B[s];                                                     // the proper decoding step
while (x<L) x = x<<1 + "bit from stream";                        // renormalization```

encoding s:
```while (x >= 2q[s]) {produce(x&1); x >>=1;}                     // renormalization
x = B[s] + x - q[s];                          // the proper coding step```

We get kind of extension of Huffman concept (it would be Huffman when all q are powers of 2) - with trivial quantization (instead of costly tree building) and usually slightly better rate (due to using more accurate probabilities).
It should be slightly faster than rANS at cost of slightly worse compression rate (comparable to Huffman) - possible SIMD or single step renormalization using CountLeadingZeros operation - maybe it could be useful for some ultrafast e.g. texture compression?
Obviously it can be simplified by storing B[s] - q[s].
dH depends on frequency order, but it is not a simple dependence (to work out).
Any good name for this Huffman-like variant? (hANS?)
Any more of interesting variants ... ?

------------------------------------
Example:
For q=1, q=3, L=M=4, I={4,5,6,7}, the first position corresponds to '0', the other three to '1'
B=L=4, B=L+q=5
B-q=3, B-q=2
Now from state x=4 we decode to symbol 0, new x = 4 - 3 = 1, then you read 2 bits to get back to I
From state 5,6,7 we decode to symbol 1, new x-=2 is 3,4,5, the renormalization reads 1 (for 3) or 0 (for 4,5) bits to get back to I.
Encoding is opposite: first you renormalize to get correspondingly to {1} or {3,4,5} range, then add correspondingly 3 or 2.  Reply With Quote

2. ## Thanks:

Cyan (25th August 2014)

3. This may be a dumb question, but I am not sure of the answer, so here it is.

Wouldn't this technique also work for AC if it had the same restrictions on L and M?
If not, can you explain why not?  Reply With Quote

4. You mean for binary alphabet? Sure
M was for rANS, in this simplified variant M=L. The only restriction here is that L is a power of 2 (can be weakened).
Some quick test from toolkit for binary alphabet (init_binary(p,1)), L=128: dH for smaller of spread_range() and Huffman (q_0=q_1=L/2) are
p, dH/H for spread_range(), dH/H for Huffman:
0.01 0.0035 11.3
0.02 0.007 6.1
0.03 0.017 4.1
0.04 0.017 3.1
0.05 0.015 2.5
0.06 0.018 2.1
0.10 0.02 1.1
0.15 0.023 0.64
0.20 0.023 0.39
0.25 0.013 0.23
0.30 0.026 0.13
0.35 0.023 0.07
0.40 0.030 0.03
0.45 0.007 0.007
0.50 0 0  Reply With Quote

5. ## Thanks:

6. I made a "fusion" algorithm quite some time ago.
(Also I wasn't aware of it, "rANS" did not exist yet)

The starting point was the realization that, by simply duplicating the distribution pattern 2^N times,
it was enough to considerably improve encoding/decoding accuracy.
The intention was to use smaller decoding table, typically a 4-state table for a 25/75 probability distribution.

I made a demo, and it worked as intended.
Sure, accuracy was improved, but encoding/decoding speed was also largely down.
This is because using the left-side bits required a multiplication on top of usual LUT for right-side bits.
It proved a performance killer. I simply stopped the experimentation there, because the small compression gain was not worth the speed cost.

Your comment seem to imply that this methodology could be simplified as a simple addition,
which would give it a much better prospect for performance
(although the scope is a bit different, but I've yet to determine exactly by how much).

Sounds good. However, since then, we have learned a bit more :
single-pattern tANS distribution is not "linear", not all positions in the pattern have the same probability.
This unequal probability distribution can be exploited to gain a bit more compression,
by intentionally positioning low-probability symbols at the extreme end of the table.

This simple trick does improve compression,
to the point of getting even better results than what would provide a theoretical perfectly-flat-probability distribution.

Such gain would be lost in an algorithm which improves the distribution equality.

At the end of the day, it's not simple to tell which method is best,
since there are some gains on one side, compensating some losses on the other side.
Trials and tests will help determine the better result,
Also, note we are talking about minor differences here, in the range of < 0.1%, so it's not going to change compression efficiency too much.  Reply With Quote

7. ## Thanks:

Jarek (25th August 2014)

8. Originally Posted by Jarek encoding s:
```while (x >= 2q[s]) {produce(x&1); x >>=1;}                     // renormalization
```
q[s] doesn't change during encoding, right? It appears that this will output the same number of bits for a particular symbol every time it encodes it. If that is the case, then you should not expect to beat Huffman. You're playing Huffman's own game.   Reply With Quote

9. Hi Cyan,
Indeed from the perspective of tANS it is not interesting, but from perspective of rANS it allows to make the ultrafast 8x16bit SIMD version ever faster (or maybe even 16x8bit for a small alphabet) at cost of compression rate - maybe it would allow entropy coders to get to applications where they probably did not fit before, like texture compression? (I know, the main problem here is unpredictable size).
Also, not needing multiplications means simpler hardware decoder.

Yes we can simplify it to just addition/subtraction:
positions B[s], ... , B[s+1]-1 of I ={L, ..., 2L-1} correspond to q[s], ..., 2q[s] - 1 appearances of symbol s
so ANS step is just adding/subtracting B[s]-q[s].

The big question here is indeed the optimal order of frequencies, but as its main advantage is speed - the first answer is just leave the symbols as they are - without any ordering.
And I think we can attack this problem from the other side: of the quantizer - give lower q to low number symbols, higher to high number symbols.

nburns,
to encode symbol s, we need to get to {q[s], ... , 2q[s]-1} range - this loop makes it.
In static compression q[s] ~ p[s]/L does not change, but generally we can update it or choose accordingly to context.
No, it will not use every time the same amount of bits for a specific symbol: take q=3, L=4 - there will be 0 bits produced for x=4,5 and 1 for x=6,7.
Yes, I'm playing Huffman's game - due to Piotr's post   Reply With Quote

10. Originally Posted by Jarek Hi Cyan,
Indeed from the perspective of tANS it is not interesting, but from perspective of rANS it allows to make the ultrafast 8x16bit SIMD version ever faster (or maybe even 16x8bit for a small alphabet) at cost of compression rate - maybe it would allow entropy coders to get to applications where they probably did not fit before, like texture compression? (I know, the main problem here is unpredictable size).
Also, not needing multiplications means simpler hardware decoder.

Yes we can simplify it to just addition/subtraction:
positions B[s], ... , B[s+1]-1 of I ={L, ..., 2L-1} correspond to q[s], ..., 2q[s] - 1 appearances of symbol s
so ANS step is just adding/subtracting B[s]-q[s].

The big question here is indeed the optimal order of frequencies, but as its main advantage is speed - the first answer is just leave the symbols as they are - without any ordering.
And I think we can attack this problem from the other side: of the quantizer - give lower q to low number symbols, higher to high number symbols.

nburns,
to encode symbol s, we need to get to {q[s], ... , 2q[s]-1} range - this loop makes it.
In static compression q[s] ~ p[s]/L does not change, but generally we can update it or choose accordingly to context.
No, it will not use every time the same amount of bits for a specific symbol: take q=3, L=4 - there will be 0 bits produced for x=4,5 and 1 for x=6,7.
Yes, I'm playing Huffman's game - due to Piotr's post You're right... I was about to post. I was wrong about that.

For some reason, I have not been able to understand the rANS formula and how it works. I guess there is enough detail provided to implement it. Maybe what I need to do is code it and see what it does.  Reply With Quote

11. rANS is just taking spread: q appearances of '0', q of '1' ... and repeating such length M blocks into infinity.
for example 0111011101110111 ... for q=1, q=3 (M=4) for P(0)=1/4 probability distribution.
Then using ANS coding rule: "x becomes x-th appearance of symbol s", you get the rANS formulas (floor is to determine the number of block) - just try to derive them yourself.  Reply With Quote

12. I'm not totally sure to get it.
In the formula :

Code:
```encoding s:
while (x >= 2q[s]) {produce(x&1); x >>=1;}                     // renormalization
x = B[s] + x - q[s];```
while renorm is fine, B[s] is a problem for my understanding.
You mention that :
Code:
`B[s] = L + q + ... + q[s-1]`
So it's a static value.

While I can get it for L=M, it's not so clear when L/M=2^k
Could you provide another example with L/M != 1 ?
I believe the sum needs some kind of scaling factor, because
Code:
`p[s] ~ q[s]/M`  Reply With Quote

13. This degenerated rANS variant requires L=M and b=2 to replace multiplication/division with single addition/subtraction.
Let us look at example q=1, q=3, L=M=4, I={4,5,6,7}, the first position corresponds to '0', the other three to '1'
B=L=4, B=L+q=5
B-q=3, B-q=2
Now from state x=4 we decode to symbol 0, new x = 4 - 3 = 1, then you read 2 bits to get back to I
From state 5,6,7 we decode to symbol 1, new x-=2 is 3,4,5, the renormalization reads 1 (for 3) or 0 (for 4,5) bits to get back to I.
Encoding is opposite: first you renormalize to get correspondingly to {1} or {3,4,5} range, then add correspondingly 3 or 2.  Reply With Quote

14. Yes, but what is less obvious is when L/M = 2^k > 1  Reply With Quote

15. In this case we need these floor(x/M) or floor(x/q[s]) to determine in which block we are - the general rANS.
Here we degenerate to a single block of interest - the first one: I={L,...,2L-1} (there was zero-th before: {0, ..., L-1}), allowing to simplify the proper step to a single addition/subtraction.  Reply With Quote

16. OK, I feel I'm starting to get it.

My understanding :
It's like a tANS encoding, except that symbols are not spread within M=L, they just lay next to each other in their natural order.
Because of this property, there is no longer a need for a spread table, so it makes encoding/decoding more straighforward.

Since it's a degenerated case, its ability to grab fragment bits is lessened, so its compression ratio will be worse.
But it will still be better than Huffman.
So it's another trade-off in the complexity/efficiency mix.  Reply With Quote

17. Yes, it is like tANS with symbols spread in ranges of q[s] length in their natural order (or any other order).
Equivalently, it is rANS M=L, b=2 degenerate case.
So it is kind of in the middle, getting extremely simple coding step - addition/subtraction.

The compression rate is usually better than Huffman, however there are exceptions which probably can be improved by quantizer (decreasing q for low number symbols, increasing for the rest) - I could work on that if there will be some interest.  Reply With Quote

18. > The compression rate is usually better than Huffman, however there are exceptions which probably can be improved by quantizer

Indeed, remember that positions are not equivalent, being at the beginning of the range is worth more than at the end.
It might be worth considering in the quantizer estimation.  Reply With Quote

19. Indeed, so first I thought about using p[i] -> p[i] (1 + p + ... + p[i-1] + p[i]/2) deformation while calculating quantization to relate to Pr(x)~1/x probability distribution of states, but ...
... for Huffman decoder this stationary probability distribution of states is very different: is uniform Pr(x)~1!
And in this case we also get ... closer to this Huffman-like behavior of stationary probability of states instead (hANS?).

Specifically, encoding a symbol in Huffman rescales the whole range of states into a 2^k size subrange corresponding to this symbol.
This time, when the size of subrange is no longer a power of 2, encoding this symbol divides the range into two parts (the one corresponding to k and to k+1 produced bits), which go to the subrange corresponding to this symbol in switched order and different scaling factors (the k+1 bit part is twice more squeezed).
So the stationary probability distribution would be an IFS fractal for L->infinity, e.g. here is some histogram for q=100, q=60 (Mathematica document): Additionally, it turns out that the graph of transitions is not always connected for such coder (switching symbol order can repair it), what means staying in a subset of possible states.
These two factors are probably the reason of disappointing Huffman-like rates.

Improving its compression rate seems a difficult task, I think I will leave it for now.
It could be used as an improvement for Huffman concept - usually has slightly better rate, doesn't need to build the tree ... but in ANS time it no longer seems to be attractive (?)  Reply With Quote

20. I started implementing this yesterday. I got the encoder apparently working. I'll probably post code at some point, regardless.  Reply With Quote

21. You can just take Fabian's SIMD implementation and replace arithmetics with simplified one ...  Reply With Quote

22. Originally Posted by Jarek You can just take Fabian's SIMD implementation and replace arithmetics with simplified one ...
That would have been a different exercise, and wouldn't necessarily have saved time.  Reply With Quote

23. Originally Posted by Jarek Additionally, it turns out that the graph of transitions is not always connected for such coder (switching symbol order can repair it), what means staying in a subset of possible states.
These two factors are probably the reason of disappointing Huffman-like rates.
My grasp of the math behind this is still kind of weak, but maybe what is needed to get a connected graph is a prime M, or prime numbers somewhere (I don't know where else besides M).  Reply With Quote

24. L=M is any number here - making quantization trivial (e.g. start with q[s]=round(2^R * p[s]), and then just take L=sum q[s] instead of 2^R).
The only issue of using L not being a power of 2 is that the number of bits to use while renormalization, depends on their sequence.
For example for L=5, starting with x=2, if the first bit to read is 1, we will get to 5 and stop reading. If it is 0, we will read one more bit.

However, conectiveness is a matter of how ranges cover each other (can be deduced by the way I={L,..,2L-1} transforms into subranges while renormalization), and changing L=M chooses the total scale only - shouldn't change much here.
And I think the inevitable fractal pattern is more essential for crippling the rate to Huffman levels.  Reply With Quote

25. Here's the code I promised. It still has bugs. It works on some files (enwik7) and fails on others (enwik7+bwts+mtf). It's fairly rudimentary -- it uses a binary alphabet and it treats all bits equally, no matter where they appear in a byte. One of the goals, though, was to see just how simple it could be.

I'm not sure if I'll work on it anymore, so anyone that wants to debug it is welcome to do so.

Edit:
I fixed it and a fixed version is attached. I haven't done extensive testing, but it works on those two files.  Reply With Quote

26. There is a reason while SIMD version of above variant (hANS?) will be rather slower than of rANS: that b=2, what means bit-by-bit renormalization which rather cannon be speed up for SIMD.
This bit-by-bit renormalization is also the reason why using this formulation directly for Huffman (q[s]=2^k[s]) will be rather slower than SIMD rANS decoding.

So the question is if we could speed up SIMD Huffman by switching it into e.g. byte-by-byte renormalization, like for b=256 rANS.
We can do something similar ... by using rANS - if all q are powers of 2: q[s] = 2^k[s], then again we don't longer need multiplications:
L is a multiplicity of M=2^R where R = max k[s] like in Huffman
I = {L, ..., bL - 1} e.g. for b=256
symbol[0 .. M-1] - table for single step Huffman decoding
msk[s] = 1<<k[s] - 1,
B[s] = q + ... + q[s-1].

Now rANS encoding step becomes ( e.g. M floor(x / q[s]) = (x >> k[s]) << R ):
x = (x >> k[s]) << R + (x & msk[s]) + B[s]
decoding:
s = symbol[x & (M-1)]
x = (x >> R) << k[s] + (x & (M-1)) - B[s]

So if someone really loves building Huffman trees (to get q[s]=2^k[s] quantization) and want fast SIMD decoding (a bit faster due to lack of multiplication), he can use this Huffman-like rANS (I am out of new names).  Reply With Quote

27. This algorithm doesn't remind me of Huffman at all. What is the relationship between M and L, exactly? L is the lower limit; b is the size of a chunk to read/write at one time (and the scale factor between upper and lower limit); but I'm confused by what essential role M plays.

Perhaps the concept could be reduced down to essentials and expressed more simply somehow. In fpaqc, the system can be conceived as a number in base 16: 8 bits are shifted on or off at one time, L is 2^12, and the upper limit is 2^20, so the greatest common denominator is 2^4.

Edit:
Sorry. That's not the GCD, but the GCD of all of the bit widths is 4. So that is a digit.  Reply With Quote

28. There are common parts with Huffman here: using q[s] = 2^k[s] quantization (e.g found by Huffman algorithm) and "symbol[]" table exactly like in one-step Huffman decoding.
However, as I have tried to emphasize, there are also differences - it is just rANS for q[s] = 2^k[s] quantization.
Comparing to them:
- rANS: thanks to using bitshifts instead of multiplication/division, it should allow for faster SIMD rANS - at cost of Huffman rate,
- Huffman: it allows for byte-by-byte renormalization, allowing for faster SIMD implementation.  Reply With Quote

29. How much faster are shifts and adds than multiplies? I thought most these days are 1 cycle anyway, or does it differ greatly when going to SIMD modes? It may depend on whether they can be interleaved with other instructions to achieve the desired throughput.  Reply With Quote

30. Indeed the difference would be probably negligible(?) - Fabian's 8 way 16bit SIMD takes ~ 6 cycles/symbol, if we could save let say 2 cycles at multiplication, it would be barely 2/8=0.25 cycles/symbol speedup.  Reply With Quote

31. I have had the impression that multiplies are still relatively slow, and divides are much worse. Even if the latency is negligible, I think multiplications may tie up more resources in the cpu, which decreases throughput.

If I'm wrong about that and multiplies are now basically free, that will come as news and change my thinking about optimization.  Reply With Quote

32. This may be more interesting for the 64 (also 32) bits scalar rANS, specially on processors with slow multiplication. For SIMD you need variable bit shifts witch is available only on AVX2 (_mm256_srlv_epi32).

Another advantage is the reduced size of the header, as you are storing only k[s]  Reply With Quote

33. brief intro into the sandy..haswell microarchitecture:

minimum execution time is 1 tick (cpu cycle), 4 GHz cpu has cycle of 0.25 ns. i will say about resourrces of one cpu core. haswell has 4 arithmetic units named p0156 (previous cpus has 3), 2 load and 1 store units (previous has a litlle less). each cycle up to 4 operations may be "retired", i.e. their results saved into real registers/memory. there is reservation station that holds about 100 commands that are already prepared, but waiting for operand readiness, execution or retirement. there are about 100 integer and about 100 SIMD registers that are dynamically assigned to real registers in operations to support parallel execution

the simplest operations are register moves and zeroing. they are handled on the register renaming stage (prior to placing in reservation station) so effecively executed in 0 cycles. this makes cpu as efficient as ones having 3-operand arithmetic and constant zero register

next level is simple scalar arithmetic - add/sub/and/or/xor can be executed on any of 4 (3 on previous cpus) ALUs and result is ready on the next cycle. note that if you have a lot of independent calculations, you may perform 4 operations per cycle, but if you have chain of dependent computations like ((x>>5)+7)<<9), the speed will be only 1 op/cycle. fortunately, cpu reorders operations in those 100-command reservation station and executes them as their operands are becoming ready. but usual program anyway can't fill even those 4 execution units due to dependencies between its operations

the same simple arithmetic on SIMD registers also has delay of 1 (i.e. results are ready on teh next cycle), but only 2 EUs - p15 -are capable to perform these operations

scalar shift by fixed number of bits (shl/shr/rol/ror) are performed in 1 cycle on one of 2 EUs - p06, while shift by variable number of bits requieres 3 micro-ops on the same 2 EUs and therefore has 2-cycle delay

SIMD shift (shl/shr/sar) are 1 cycle delay perfromed only on p0 for fixed-size shifts and 1 cycle, 2 mop perfromed on p0 and p5 for variable-sized shift (with the same shift amount for every word in SIMD register)

scalar MUL has delay of 3 and can be executed only on p1. note that almost all operations are pipelined, this means that you can start indpendent MUL operations each cycle. i.e. MUL started on cycle 1 will return result to the cycle 4, next MUL started at the cycle 2 will return result at the cycle 5 and so on

SIMD MUL performing 32*32->32 multiplication has delay 10 and occupies the only capable engine p0 for the 2 cycles. this means that you can start independent operation of this type only once per 2 cycles. SIMD MUL of 32*32->64 type as well as other types has delay 5 and occupies the only capable engine p0 only for 1 cycle

LOAD operations has delay of 4/12/30 if data going from the L1/L2/L3 cache, 40-50 ns for DRAM (and may be even more on -E cpus or when addressing multi-gigabyte tables)

DIV is the only non-pipelined operation, its execution time is ~10-100 cycles

add/or/cmp/test operaions are fused with the following jxx conditional jump and executed as single operation on the one of 2 EUs (1 in previous cpus). jump (fused or unfused) is executed in the 1 cycle just as ariop if it was successfully predicted, but needs at least 14 cycles to refill decoder pipeline otherwise. while decoder pipeline is refilled, remaining ops in the reservation statios are continues to execute but no new ops are added here

once again, http://www.agner.org/optimize/microarchitecture.pdf for details of microarchitecure and http://www.agner.org/optimize/instruction_tables.pdf for operation timings  Reply With Quote

34. ## Thanks (3):

JamesB (31st October 2016),Jarek (29th August 2014),nburns (29th August 2014)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•