1. ## Asymetric Numeral System

I've been interested lately in Jarek Duda's ANS, out of curiosity.

The intention is to experiment with an entropy decoder suitable for low-performance CPU.
By low-performance, I mean CPU without built-in multiplication capability (multiplication can be emulated, but it's much longer).
I understand that binary arith decoders can work properly with just multiplications and comparisons, making them quite fast and efficient.
(General arith decoders, on the other hand, require divisions.)

My understanding is that ASN exchange these requirements against the following ones :
1) Use of look-up tables, exclusively (with the risk of becoming very large,)
2) Decoding in reverse direction of encoding
3) Designed for fixed probability rates (as a consequence of 1 & 2) but Matt found a way around it by saving both the symbol and its probability into an intermediate table and encoding them in reverse order. (I still have to understand how the lookup table can be modified on-the-fly though...)

So it's not an "all win".

So far, I've read Jarek's paper (well, tried to...), Matt's fpaqa source code and DCE comments, and blog entry of Andrew Polar. I've also played with Wolfram demo.

And I still don't get it.
I feel I can make it work on some simple cases, but not understand why it works, and therefore how to use it in more generic circumstances.

My current understanding is as follows :
As the encoder receives new symbols to encode, the "state" value grows, according to a translation table.
The higher the probability, the less the state value increases.
The intuition is that a large number in the translation table results from
either a few low probability symbol, or many of high probability symbols. Or of course a mix of them.

Decoding is simply a reverse operation, traversing the translation table in the other direction.
So any number in the table represents a deterministic suite of symbols.

Obviously, the translation table is not infinite, so at some point, there is a need to "resize" the state value, so that it remains within the table.
That's where I get lost.

fpaq recommends keeping the "state value" between 2^N and 2^N+1.
WolfRam demo keeps it between 2^N and 2^N+2, updating 2 bits at a time.
Andrew Polar keeps it below 2^N, resizing by any number of bits, to remain within [2^N-1..2^N[
Showing there are different possibilities.

The decoding side is well described on Wolfram :
The "state" value decreases, outputting one symbol at a time.
When the "state" reaches a value within [2^N-2..2^N[, the bitstream is resized by 2 bits, ensuring the new "state" value is once again between [2^N..2^N+2[
The translation table below 2^N is not shown. (does it exist ?)
The corresponding encoding side, on the other hand, is mysterious.

To better study it, I've been playing with a simple/trivial example, using a binary distribution as follows :
0: 75% ; 1: 25%
The good thing is that allows to play with N=2, which is easier to manage by hand on paper.

It's easy to get Wolfram generate a working solution,
but I wish I can understand how it is generated, and how it works, in both direction (encoding & decoding).

Anyone having tried this before ?  Reply With Quote

2. I looked at it a while ago and it seemed to make no sense. Your post made me take another look at it. I haven't quite fully wrapped my head around it, but it seems like almost the same thing as arithmetic coding approached from a different angle. It keeps multiplying by 1/p and shifting off the low bits. The symbols aren't in ranges, they're sprinkled all throughout in proportions equal to their probability. I'm thinking that, unlike arithmetic coding, what it's doing might not have any simple mathematical explanation other than coding something.  Reply With Quote

3. I've been playing a bit with a few 75/25 layouts lately.
According to Jarek's theory, they are all equivalent.

They nonetheless result in different state transition rules.
Here are a few decoding state tables examples :

1) Using distribution 0 1 0 0
States are numbered 4-7
4 : output 0; read 1 bit => next 6 or 7
5 : output 1; read 2 bits => next 4 to 7
6 : output 00; read 1 bit => next 6 or 7
7 : output 01; read 2 bits => next 4 to 7

2) Using distribution 1 0 0 0
States are numbered 4-7
4 : output 1; read 2 bits => next 4 to 7
5 : output 0; read 1 bit => next 6 or 7
6 : output 00; read 1 bit => next 6 or 7
7 : output 000; read 1 bit => next 6 or 7

Both are supposed to compress a 75/25 input equally well.
Is that true ?
It's difficult to judge it just by looking at the transition table.
I guess coding an analysis program is required to get to the result.  Reply With Quote

4. Originally Posted by Cyan I've been playing a bit with a few 75/25 layouts lately.
According to Jarek's theory, they are all equivalent.
...

Both are supposed to compress a 75/25 input equally well.
Is that true ?
It's difficult to judge it just by looking at the transition table.
I guess coding an analysis program is required to get to the result.
Jarek's paper is hard to follow, I think partly because of English difficulties and partly because of the math terms he uses, which I'm not fully up to speed on. That said, the number of bits that you output after a given symbol determines the compression. I looked at the tables in Andrew Polar's post and checked through doing a few examples in my head that, for instance, symbol A in the very first table averages between 1 and 2 bits, as expected.  Reply With Quote

5. Originally Posted by Cyan I've been playing a bit with a few 75/25 layouts lately.
According to Jarek's theory, they are all equivalent.

They nonetheless result in different state transition rules.
Here are a few decoding state tables examples :

1) Using distribution 0 1 0 0
States are numbered 4-7
4 : output 0; read 1 bit => next 6 or 7
5 : output 1; read 2 bits => next 4 to 7
6 : output 00; read 1 bit => next 6 or 7
7 : output 01; read 2 bits => next 4 to 7

2) Using distribution 1 0 0 0
States are numbered 4-7
4 : output 1; read 2 bits => next 4 to 7
5 : output 0; read 1 bit => next 6 or 7
6 : output 00; read 1 bit => next 6 or 7
7 : output 000; read 1 bit => next 6 or 7

Both are supposed to compress a 75/25 input equally well.
Is that true ?
It's difficult to judge it just by looking at the transition table.
I guess coding an analysis program is required to get to the result.
looking at the first table if decoding a long random stream I would get a ratio of 7 zeros to 3 ones.
the second table if I was decoding a long random stream as soon as you get to state 5 6 or 7 you
never get to state 4 again so no more ones in output stream possible.

Of course I never really looked at the decompressor your talking about so maybe your leaving out
some information in your example of decompression. And I am assuming that the decompression of
a random stream should go to the 75/25 your seek.  Reply With Quote

6. I noticed that the lookup table is the state transition table for a finite state machine, and in that formulation, it's a lot easier for me to picture. I can't tell from your posts if you hit on that yet or not. It makes a perfectly good FSM, except for the dangling transitions at the end. However, you could easily fill in those states with transitions that go backward and simultaneously write one or more bits to the output, completing the machine. It's entirely deterministic.

The complete machine has n*2^(from+width) transitions (using the same variable names as Wolfram, e.g. n is the number of symbols in the alphabet). That table would get pretty large, so it might not be a good idea to literally program it as a DFA. Notwithstanding that, I think it's the right abstraction to use to describe the algorithm: the algorithm computes an FSM that matches the distribution, which is then used for encoding and decoding.  Reply With Quote

7. Originally Posted by biject.bwts the second table if I was decoding a long random stream as soon as you get to state 5 6 or 7 you
never get to state 4 again so no more ones in output stream possible.
Correct, this transition table is bugged.
Here is a corrected one :

Using distribution 1 0 0 0
States are numbered 4-7
4 : output 1; read 2 bits => next 4 to 7
5 : output 0; read 1 bit => next 6 or 7
6 : output 01; read 2 bit => next 4 to 7
7 : output 00; read 1 bit => next 6 or 7

=> It's basically the same as the first table, with 4-5 and 6-7 swapped. Originally Posted by biject.bwts looking at the first table if decoding a long random stream I would get a ratio of 7 zeros to 3 ones.
Interesting. Does it point to a table more adapted to 70/30, rather than the expected 75/25 ?

It also questions if a transition table of size 4 is large enough. Maybe it needs twice that size to get to the desired 75/25.

Notwithstanding that, I think it's the right abstraction to use to describe the algorithm: the algorithm computes an FSM that matches the distribution, which is then used for encoding and decoding.
That's my understanding.
Nevertheless, the algorithm to build the FSM should be "straightforward" enough to be implemented in a minimum amount of energy.
As a point of reference, it reminds me of an earlier attempt at creating an Huffman entropy compressor : the time required to build the tree was more important than the encoding/decoding time itself. It required quite some time and transformations to make it an efficient table creation stage.

The complete machine has n*2^(from+width) transitions (using the same variable names as Wolfram, e.g. n is the number of symbols in the alphabet). That table would get pretty large, so it might not be a good idea to literally program it as a DFA.
Indeed. I'm especially worried by the transition table for the compressor. This one looks really huge.
The decoding side, on the other hand, seems more manageable.
There is also a trade-off to find between pre-calculated, and dynamically calculated :
by calculating some of the required information on the fly (nb of bits to output for example), it makes the table smaller, which in turn can improve cache locality.
It's not easy to find out the right compromise without coding.

I guess I need to solve that point...  Reply With Quote

8. ABC uses rounding to encode bits. Given an integer state x and a symbol with probability p, you get new larger state x/p. Since x/p is not necessarily an integer, a rounding convention is used. A bit is encoded by following different conventions for 0 and 1 such that the two sets of possible new states do not overlap. The decoder figures out which way the rounding was done, then undoes the rounding and goes back to state x*p.

The rule is to encode 1 with probability p as floor(x/p). The decoder, given p can then detect a 1 because p*floor(x/p) will be in the range (x - p, x], which can be detected from the fractional part of the result. Any other value encodes a 0. Note that the number of possible encodings of 0 and 1 are proportional to their probabilities, just as in range coding.

For example, suppose p = p(1) = 0.4 and x > 0 is the state. Then to encode a 1, the possible next states x/0.4 are multiples of 2.5, which after rounding down are 2, 5, 7, 10, 12, 15, 17... Any other value encodes a 0. Notice that the states encoding 1 make up p = 40% of the set.

To decode, we multiply by p. If the fractional part is either 0 or greater than 1 - p = 0.6, then we decode a 1, else 0. In this example, x*p = 0.8, 2.0, 2.8, 4.0, 4.8, 6.0, 6.8...

To encode a 0 with probability 1-p = 0.6, the next state is ceil((x+1)/(1-p)) - 1, which is the same as floor((x+1)/(1-p)) except that you subtract 1 if the result is an integer. Thus, before rounding you have 10/3, 15/3, 20/3, 25/3, 30/3, 35/3, 40/3..., which rounds to 3, 4, 6, 8, 9, 11, 13... Decoding, x*p = 1.2, 1.6, 2.4, 3.2, 3.6, 4.4, 5.2, with fractional parts always in (0, 0.6]. (In my book, the comment incorrectly suggests [0, 0.6) but the code is right).

Also, fpaqc and lpaq1a keeps the state between 2^N and 2^(N+ - 1 and outputs a byte at a time.  Reply With Quote

9. Originally Posted by Cyan That's my understanding.
Nevertheless, the algorithm to build the FSM should be "straightforward" enough to be implemented in a minimum amount of energy.
As a point of reference, it reminds me of an earlier attempt at creating an Huffman entropy compressor : the time required to build the tree was more important than the encoding/decoding time itself. It required quite some time and transformations to make it an efficient table creation stage.
When you say entropy compressor, does that imply a tighter bound on matching the true entropy than you would get from plain, static Huffman?

ANS sounds pretty straightforward in how you build the machine. The forward (or downward) transitions for a given symbol are spaced out at intervals of about 1/p. No two forward transitions can go to the same state, so how you interleave them would be where details enter in. But there would be only so many ways to do it.

I'm not sure what the official guidance is on how to allocate the space at the end, which is where you can no longer go forward and the bits come out. It doesn't seem like it would be hard to come up with a strategy that just works, setting aside the question of how to be optimal.

Edit: But since no forward jump can ever be bigger than the maximum 1/p, that seems to set an upper bound. It seems like the allocation might fall out directly from the number of states, which you choose as a power of 2. So it might not even be a choice.

Edit2: Let me clarify (details, details...). The transition start points are bunched up at the beginning of the machine (top of the table). So it's not the distance between transition start points and transition end points that's about 1/p, it's the distance between successive end points. I think this makes sense if you think about it, because the backward moves are always scaling down by powers of two, which comes from the number of bits to represent the symbol: the famous -lg(p). So for the rare symbols, you're scaling down by a big power of two and it takes you back to a state near the start state/top of the table. (I think I'd like to flip the diagram over, because scaling down to go to the top just doesn't make sense.)

I'm going to try to shut up now, because I'm getting tired and making mistakes. I feel like whatever details are still unexplained are not big, and I'll probably think about it some more tonight.

Indeed. I'm especially worried by the transition table for the compressor. This one looks really huge.
The decoding side, on the other hand, seems more manageable.
There is also a trade-off to find between pre-calculated, and dynamically calculated :
by calculating some of the required information on the fly (nb of bits to output for example), it makes the table smaller, which in turn can improve cache locality.
It's not easy to find out the right compromise without coding.

I guess I need to solve that point...
I think the problem is equivalent to the problem encountered when searching text with regexes: you naturally get an NFA, and the NFA->DFA transformation tends to cause a state explosion. So you can simulate the NFA directly, generate the full DFA, or anything in between. It's a problem that's been dealt with since the dawn of time, and there's plenty of literature on it that might give some insight into how to manage this problem. There are also the solutions Jarek, Matt, etc already used.  Reply With Quote

10. Originally Posted by Matt Mahoney ABC uses rounding to encode bits. Given an integer state x and a symbol with probability p, you get new larger state x/p. Since x/p is not necessarily an integer, a rounding convention is used. A bit is encoded by following different conventions for 0 and 1 such that the two sets of possible new states do not overlap. The decoder figures out which way the rounding was done, then undoes the rounding and goes back to state x*p.

The rule is to encode 1 with probability p as floor(x/p). The decoder, given p can then detect a 1 because p*floor(x/p) will be in the range (x - p, x], which can be detected from the fractional part of the result. Any other value encodes a 0. Note that the number of possible encodings of 0 and 1 are proportional to their probabilities, just as in range coding.

For example, suppose p = p(1) = 0.4 and x > 0 is the state. Then to encode a 1, the possible next states x/0.4 are multiples of 2.5, which after rounding down are 2, 5, 7, 10, 12, 15, 17... Any other value encodes a 0. Notice that the states encoding 1 make up p = 40% of the set.

To decode, we multiply by p. If the fractional part is either 0 or greater than 1 - p = 0.6, then we decode a 1, else 0. In this example, x*p = 0.8, 2.0, 2.8, 4.0, 4.8, 6.0, 6.8...

To encode a 0 with probability 1-p = 0.6, the next state is ceil((x+1)/(1-p)) - 1, which is the same as floor((x+1)/(1-p)) except that you subtract 1 if the result is an integer. Thus, before rounding you have 10/3, 15/3, 20/3, 25/3, 30/3, 35/3, 40/3..., which rounds to 3, 4, 6, 8, 9, 11, 13... Decoding, x*p = 1.2, 1.6, 2.4, 3.2, 3.6, 4.4, 5.2, with fractional parts always in (0, 0.6]. (In my book, the comment incorrectly suggests [0, 0.6) but the code is right).

Also, fpaqc and lpaq1a keeps the state between 2^N and 2^(N+ - 1 and outputs a byte at a time.
Nicely stated. I think the ability to write clearly must be one of the primary skills one gets as a PhD student that is hard to get anywhere else. You can study the literature without being in grad school, work on problems without being in grad school, but the art of writing papers and giving talks is hard to self-teach. But I digress...

It sounds like you've already solved how to dispense with the lookup table entirely and calculate on the fly. At least for the case where symbols are binary, which simplifies things a bit. Also, it sounds like the "width" factor on Wolfram is basically how many bits you want to save up at a time. So you pick one that's convenient for your implementation.  Reply With Quote

11. I use a table to avoid division. It multiplies by the reciprocal instead. In fpaqa I used tables entirely: input the state and probability and output the new state, but this turned out to be slower than multiplication. It also loses a little compression to keep the table size reasonable.

It's too bad nobody discovered ABC when arithmetic coding was still patented, or else it would probably be the dominant coder today.  Reply With Quote

12. I was curious to know if the ANS FSM construction method would produce the same FSM table, with only a few cell-swaps to notice,
so I tried another layout.
It turns out that it generated a different transition table :

Using distribution 0 0 0 1
States are numbered 4-7
4 : output 0; read 1 bit => next 6 to 7
5 : output 00; read 1 bit => next 6 or 7
6 : output 000; read 1 bit => next 6 to 7
7 : output 1; read 2 bits => next 4 to 7

Using the rough assumption of a random input provided to the decoder, it can be calculated that it generates, on average, 12 zeroes for 3 ones.
That's a 80/20 distribution. So it's once again different from the intended 75/25.

So apparently, both FSM are "close to 75/25", but not exactly, and seem tuned for some different distribution (70/30 vs 80/20).
In a bid to better study these differences, I built an emulator, using above FSM to compress a random 0/1 input.
I ran several tests with different distributions, verifying if 0100 and 0001 FSM are equivalent.

They are not. Here are some detailed results :

Distribution 70/30 : Theoric : 88.13% / 0100 : 88.82% / 0001 : 91.96%
Distribution 75/25 : Theoric : 81.13% / 0100 : 82.14% / 0001 : 82.43%
Distribution 80/20 : Theoric : 72.18% / 0100 : 75.55% / 0001 : 72.78%

Conclusion :
0100 and 0001 FSM are not equivalent. As suspected, 0100 is better tuned for 70/30, and 0001 is better tuned for 80/20.

The ANS FSM construction method leads to various FSM which are "roughly okay" for the target distribution, but not optimal, and not equivalent.

It's also interesting to notice that, even when using the FSM on their optimal distribution, the "theoric" ratio is approached but not quite reached, with a noticeable difference of more than 0.5%.

Maybe all these effects are related to the size of the FSM : 4 states might be too small after all (even for a simple 75/25 distribution).
Maybe using a larger FSM would produce a better result.  Reply With Quote

13. I have a quick thought:
If we restrict ourselves to logistic mixing as a final modelling step and LUTs for coding, then we can skip converting stretched (or squashed? I forgot) frequencies to normal frequencies before doing encoding and instead do that once when buiding LUTs. That could offset the performance and efficiency losses. What do you think: would that work? Maybe someone already did that?  Reply With Quote

14. Using more states and more bits to represent probabilities will give you better compression (closer to Shannon limit) just like with arithmetic coding.

If you use a full lookup table implementation, then it makes sense to use greater precision for probabilities near 0 or 1 by stretching it (log(p/(1-p))) before quantization. I did something like this in fpaqa.  Reply With Quote

15. Originally Posted by Cyan I was curious to know if the ANS FSM construction method would produce the same FSM table, with only a few cell-swaps to notice,
so I tried another layout.
It turns out that it generated a different transition table :

Using distribution 0 0 0 1
States are numbered 4-7
4 : output 0; read 1 bit => next 6 to 7
5 : output 00; read 1 bit => next 6 or 7
6 : output 000; read 1 bit => next 6 to 7
7 : output 1; read 2 bits => next 4 to 7
I'm going to have to figure out how you're describing the machine. Why have you omitted states 0-3, because they're redundant? You've either reduced it down to the minimal FSM already, or else I don't know what's going on.

When you say the distribution is 0 0 0 1, do you mean that it's 75% zeroes and 25% ones?

Using the rough assumption of a random input provided to the decoder, it can be calculated that it generates, on average, 12 zeroes for 3 ones.
That's a 80/20 distribution. So it's once again different from the intended 75/25.
As I said, I can't follow your machine, so I can't help debug it at this point. But, intuitively, it seems to me that the more states there are, the closer it will match a distribution. Consider that it has to store the accumulated error in the state, so the more states, the more error information it can hold.  Reply With Quote

16. For an encoder in state x, the next state is approximately x/p or x/(1-p), depending on the encoded bit and the rounding convention. The encoding is optimal only if the next state is exact. You want x and p to be at least about 12 or 16 bits to make the effects of rounding on p negligible. So the states would start with e.g. 2^12 and grow from there.  Reply With Quote

17. Originally Posted by Matt Mahoney For an encoder in state x, the next state is approximately x/p or x/(1-p), depending on the encoded bit and the rounding convention. The encoding is optimal only if the next state is exact. You want x and p to be at least about 12 or 16 bits to make the effects of rounding on p negligible. So the states would start with e.g. 2^12 and grow from there.
Could you explain why you don't need the states from 0-2^12?  Reply With Quote

18. If you can't compute the transitions on the fly, another way to deal with them, besides an explicit table, is using rank and select on a vector of symbols. For a bit vector:

rank(0,n) -> the number of 0s in the range [0,n]
rank(1,n) -> the number of 1s in the range [0,n]
select(0,m) -> the index of the m'th 0
select(1,m) -> the index of the m'th 1

For 75% 0 and 25% 1, the vector would look like this, where each element is a state labeled by a bit:

0001000100010001...

To follow the transitions for a sequence of bits:

s = 0;
while(state_exists(s)) {
b = get_bit();
s = select(b,s);
}
// You need to add 1 to s sometimes to avoid an infinite loop. That fiddly piece is left out.

rank and select can be done efficiently with the right data structure. select(b,m) gives the same value as in the table cell (m,b)  Reply With Quote

19. Why have you omitted states 0-3, because they're redundant?
I'm using Wolfram ANS demo for guidance on how to build a state machine.

The main rules that can be derived from it :
1) A state machine with N=2^n states starts at state N
2) The first state delivering a 0 refers to state nb(0), then grow up by one at each 0
2) The first state delivering a 1 refers to state nb(1), then grow up by one at each 1

I don't know what's supposed to be into states 0-3, and I'm not sure it matters.
Reaching 0-3 is merely a signal that we need to output/input more bits stay within range.

When you say the distribution is 0 0 0 1, do you mean that it's 75% zeroes and 25% ones?
It's more precise, it means the distribution of values is exactly 0, then 0, then 0, then 1.
In other words :
State 4 : output a 0, then go to next state
State 5 : output a 0, then go to next state
State 6 : output a 0, then go to next state
State 7 : output a 1, then go to next state

Using more states and more bits to represent probabilities will give you better compression (closer to Shannon limit)
Yes, I agree, it's expected.

However, what I'm stressing here is that not all state machines are equal, even though they contain the same proportions of 0 and 1.

Coming back to the little 4-states machine studied so far, I've made a program which provides a precise estimate of the optimal probability for a given state machine.
Here are the results :

layout 1000 => 0:71.43% 1:28.57%
layout 0100 => 0:71.43% 1:28.57%
layout 0010 => 0:78.95% 1:21.05%
layout 0001 => 0:78.95% 1:21.05%

So it wasn't exactly the 70/30 and 80/20 roughly evaluated, but close enough.

This is matched by the experience : the amount of waste of each state machine is minimized when using it on a distribution which matches its optimal values :

P(1)=30.00% : waste(0100) = 0.79% -- waste(0010) = 4.35%
P(1)=28.57% : waste(0100) = 0.69% -- waste(0010) = 3.40%
P(1)=25.00% : waste(0100) = 1.25% -- waste(0010) = 1.61%
P(1)=21.05% : waste(0100) = 3.63% -- waste(0010) = 0.78%
P(1)=20.00% : waste(0100) = 4.66% -- waste(0010) = 0.82%

What matters here is that the construction method of the FSM doesn't provide a secure way to reach 75/25 (in constrast with arithmetic and huffman, which are very straightforward). Adding more states to the FSM is likely to improve its accuracy, but the divergence between the different possible FSM are likely to remain, and I guess it's going to be difficult to select the right (optimal) FSM for a given probability without checking them all to find the less wasteful one.  Reply With Quote

20. Getting one step further, I've tested all possible 8-states combinations for a 75/25 distribution.

As can be expected, they lead to different results.
The list is long, so I will provide here what I believe are the most important results :

layout 11000000 => 0:71.43% 1:28.57%
(...)
layout 10001000 => 0:71.43% 1:28.57%
layout 01010000 => 0:71.43% 1:28.57%
layout 01000100 => 0:73.58% 1:26.42%
layout 00100100 => 0:75.55% 1:24.45%
layout 00100010 => 0:75.90% 1:24.10%
layout 00010001 => 0:77.78% 1:22.22%
layout 00001100 => 0:78.95% 1:21.05%
(...)
layout 00000011 => 0:78.95% 1:21.05%

Minor detail : many layout result in exactly the same optimal distribution, notably at the edge :
9/28 FSM get exactly 0:71.43% 1:28.57%
6/28 FSM get exactly 0:78.95% 1:21.05%

Optimal distribution disparity is the same as the 4-states FSM : it ranges from 21.05% to 28.57%.
But, in contrast with the 4-states FSM, we now have several intermediate values, which allow to select a better one.

The best result is 75.55/24.45. Once again, it's not exactly 75/25, but fair enough.
Putting it to the test show that it is indeed better at compressing a 75/25 input than any 4-state FSM so far :

P(1)=25.00% : waste(0100) = 1.25% -- waste(0010) = 1.61% -- waste(00100100) = 0.31%

And once again, this particular layout is better at compressing a bitstream with distribution closer to its optimal values :

P(1)=24.45% : waste(00100100) = 0.30%

OK, fair enough, 25% is almost as good, so let's say we have our optimal layout (for an 8-states FSM).
It's important to note that this best result is achieved using the layout 00100100.
Why is that important ? because the 1 are not evenly spaced by a distance of 4 !

I've also put in the list the "evenly spaced" layouts, and none of them get to the winning position.
2 of them are even close to pathological corner cases !

layout 10001000 => 0:71.43% 1:28.57%
layout 01000100 => 0:73.58% 1:26.42%
layout 00100010 => 0:75.90% 1:24.10% <== best one
layout 00010001 => 0:77.78% 1:22.22%

What it means is that finding the "best" FSM is going to be a difficult exercise. There's no "obvious" pattern so far.  Reply With Quote

21. Originally Posted by nburns Could you explain why you don't need the states from 0-2^12?
Because there would be a large roundoff error. Look at how the coding works mathematically.  Reply With Quote

22. Continuing this journey into FSM entropy compression,
I was curious to see if the layout producing the same distribution of 0 & 1 were indeed fully equivalent, or just "close variations".
So I created the reverse operation, and tried to compress the same input with all of them.

It turns out they are *perfectly* equivalent.
Here are some detailed results :

Generating 100 Mb with P(1)=28.57%
Encoding with layout 0100 : result : 86.91% (wasted 0.69%)
Encoding with layout 11000000 : result : 86.91% (wasted 0.69%)
Encoding with layout 10100000 : result : 86.91% (wasted 0.69%)
Encoding with layout 10010000 : result : 86.91% (wasted 0.69%)
Encoding with layout 10001000 : result : 86.91% (wasted 0.69%)
Encoding with layout 10000100 : result : 86.91% (wasted 0.69%)
Encoding with layout 10000010 : result : 86.91% (wasted 0.69%)
Encoding with layout 10000001 : result : 86.91% (wasted 0.69%)
Encoding with layout 01100000 : result : 86.91% (wasted 0.69%)
Encoding with layout 01010000 : result : 86.91% (wasted 0.69%)

As you can see, they all produce the exact same result.
Perhaps even more interestingly, the "waste" level is exactly the same for the 8-states FSM and the single 4-state FSM sharing the same characteristics.
It seems to point out that it's not always worthwhile to increase the number of states, at least not with these "edge" layouts.

The same holds true for the other end of the spectrum :
Generating 100 Mb with P=21.05%
Encoding with layout 0001 : result : 74.80% (wasted 0.78%)
Encoding with layout 00001100 : result : 74.80% (wasted 0.78%)
Encoding with layout 00001010 : result : 74.80% (wasted 0.78%)
Encoding with layout 00001001 : result : 74.80% (wasted 0.78%)
Encoding with layout 00000110 : result : 74.80% (wasted 0.78%)
Encoding with layout 00000101 : result : 74.80% (wasted 0.78%)
Encoding with layout 00000011 : result : 74.80% (wasted 0.78%)

Same effect, same consequences.

I also had a deeper look into the "optimal" 8-states FSM for the target 75/25.
It turns out the best layout is not alone.
3 others layout were displaying almost the same characteristics.

layout 00100100 => 0:75.55% 1:24.45%
layout 00101000 => 0:75.56% 1:24.44%
layout 00011000 => 0:75.56% 1:24.44%
layout 00010100 => 0:75.56% 1:24.44%

On closer inspection, they are perfectly equivalent, producing the same compression performance :

Generating 100 Mb with P(1)=25.00%
Encoding with layout 00100100 : result : 81.38% (wasted 0.31%)
Encoding with layout 00101000 : result : 81.38% (wasted 0.31%)
Encoding with layout 00011000 : result : 81.38% (wasted 0.31%)
Encoding with layout 00010100 : result : 81.38% (wasted 0.31%)

For these ones, at least, I have an explanation.
When looking at how the FSM works, it's obvious in retrospect that last-bit positions are perfectly equivalent.
It means position 0&1, 2&3, 4&5, etc. can be swapped without changing the properties of the FSM.
This is exactly what happens for the above 4 FSM : they all have one 1 in the middle 2-seats spots.

Another interesting point to notice is that the waste level for the latest FSM is much lower (0.31%) than the waste level at the "edge" of the disparity (0.69% - 0.78%). This was also expected : the layout at the edge are not regular, and are expected to produce side-effects, detrimental to compression. Now, it's also clearly measured.  Reply With Quote

23. Originally Posted by Matt Mahoney Because there would be a large roundoff error. Look at how the coding works mathematically.
I guess the assumption is that only states that emit bits should count.

If you're viewing it as an FSM, it doesn't make sense to omit states. I haven't worked the implementation out in detail, but if you view the forward transitions as multiplications, then I'm sure you might as well not draw them. Andrew Polar's blog post draws them in, and that seems pretty easy to follow.  Reply With Quote

24. Matt: are you using floating point numbers to do the state changes?

If so, it might be worth figuring out how to do it with integer arithmetic.  Reply With Quote

25. It is using integer arithmetic. Look at the algorithm description in my book or fpaqc comments. Emitting bits is separate from coding. As an FSM you are free to relabel the states, but from a mathematical perspective it makes sense to think of states within a range like 2^12..2^32-1 with 12 bit probabilities, or something. Then whenever you have more than 20 bits, you emit the low 8 bits of the state.  Reply With Quote

26. Originally Posted by Matt Mahoney It is using integer arithmetic. Look at the algorithm description in my book or fpaqc comments. Emitting bits is separate from coding. As an FSM you are free to relabel the states, but from a mathematical perspective it makes sense to think of states within a range like 2^12..2^32-1 with 12 bit probabilities, or something. Then whenever you have more than 20 bits, you emit the low 8 bits of the state.
I looked up the section in DCE and tracked down the code in fpaqc that implements ABC. I haven't really gone over fpaqc yet, but I think what it says in DCE is making sense now. I can see how you could implement it with only integers. You could use the remainder the same way you'd use the fractional part -- or perhaps there's an even better way I haven't thought of*. The code in fpaqc looks very easy to read. I appreciate your efforts to create decent documentation, for esoteric stuff like this especially.

The way it distinguishes 0s and 1s when encoding bits -- does that scale naturally to accommodate larger alphabets? I gave it some thought, and it seems like the most natural way to handle a larger alphabet would be essentially to break it into binary and use the algorithm the way it is.

Edit: * I happened to come across a source on this and here's what it says, just in case it's any different from what's in fpaqc:

Divide Rounding

Joe Ibershoff, one of my students, pointed-out that integer divide noramlly yields the floor, but both ceiling and round-to-nearest are easy and useful. I thought these were fairly well-known tricks closely related to the Alignment of Pointers magic, but perhaps they aren't so obvious...? He points out that Ceiling(a/b) is (a+b-1)/b and RoundToNearest(a/b) is (a+(b/2))/b. Of course, these tricks also work if divide is implemented in less obvious ways, such as shifts or shift-and-subtract sequences.

Edit: Actually, that does sound obvious. The non-obvious part would be that you're dividing by a fraction. I imagine that you could store the fraction as a numerator and denominator and first multiply by the denominator and then divide by the numerator as above. You'd run the risk of overflow during the multiplication, though.  Reply With Quote

27. ## Tunstall coding Originally Posted by Cyan The intention is to experiment with an entropy decoder suitable for low-performance CPU.
Did you investigate Tunstall coding? It's a variable to fixed length coding (means that decoding will go the fast fixed to variable way).

The alphabet grows by successively adding new sequences made of the most frequent sequence (which is removed from the alphabet right after) followed by each original letters.

For instance on a two letters alphabet (A and B) having 75% and 25% probabilities of occurrence.

Code:
```A  0.75 <-
B  0.25

B   0.2500
AA  0.5625 <-
AB  0.1875

B     0.2500
AB    0.1875
AAA   0.4219 <-
AAB   0.1406

B         0.2500
AB        0.1875
AAB       0.1406
AAAA      0.3164 <-
AAAB      0.1054

B         0.2500 <-
AB        0.1875
AAB       0.1406
AAAB      0.1054
AAAAA     0.2373
AAAAB     0.0791

AB        0.1875
AAB       0.1406
AAAB      0.1054
AAAAA     0.2373 <-
AAAAB     0.0791
BA        0.1875
BB        0.0625

AB        0.1875 <-
AAB       0.1406
AAAB      0.1054
AAAAB     0.0791
BA        0.1875
BB        0.0625
AAAAAA    0.1780
AAAAAB    0.0593

AAB       0.1406
AAAB      0.1054
AAAAB     0.0791
BA        0.1875 <-
BB        0.0625
AAAAAA    0.1780
AAAAAB    0.0593
ABA       0.1406
ABB       0.0469

AAB       0.1406
AAAB      0.1054
AAAAB     0.0791
BB        0.0625
AAAAAA    0.1780 <-
AAAAAB    0.0593
ABA       0.1406
ABB       0.0469
BAA       0.1406
BAB       0.0469

AAB       0.1406 <-
AAAB      0.1054
AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABA       0.1406
ABB       0.0469
BAA       0.1406
BAB       0.0469
AAAAAAA   0.1335
AAAAAAB   0.0445

AAAB      0.1054
AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABA       0.1406 <-
ABB       0.0469
BAA       0.1406
BAB       0.0469
AAAAAAA   0.1335
AAAAAAB   0.0445
AABA      0.1054
AABB      0.0351

AAAB      0.1054
AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABB       0.0469
BAA       0.1406 <-
BAB       0.0469
AAAAAAA   0.1335
AAAAAAB   0.0445
AABA      0.1054
AABB      0.0351
ABAA      0.1054
ABAB      0.0351

AAAB      0.1054
AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABB       0.0469
BAB       0.0469
AAAAAAA   0.1335 <-
AAAAAAB   0.0445
AABA      0.1054
AABB      0.0351
ABAA      0.1054
ABAB      0.0351
BAAA      0.1054
BAAB      0.0351

AAAB      0.1054 <-
AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABB       0.0469
BAB       0.0469
AAAAAAB   0.0445
AABA      0.1054
AABB      0.0351
ABAA      0.1054
ABAB      0.0351
BAAA      0.1054
BAAB      0.0351
AAAAAAAA  0.1001
AAAAAAAB  0.0334

AAAAB     0.0791
BB        0.0625
AAAAAB    0.0593
ABB       0.0469
BAB       0.0469
AAAAAAB   0.0445
AABA      0.1054 <-
AABB      0.0351
ABAA      0.1054
ABAB      0.0351
BAAA      0.1054
BAAB      0.0351
AAAAAAAA  0.1001
AAAAAAAB  0.0334
AAABA     0.0791
AAABB     0.0263

and so on```
At this point each sequence could be written as a 4 bit value.
Code:
```0000  -> AAAAB
0001  -> BB
0010  -> AAAAAB
0011  -> ABB
0100  -> BAB
0101  -> AAAAAAB
0110  -> AABA
0111  -> AABB
1000  -> ABAA
1001  -> ABAB
1010  -> BAAA
1011  -> BAAB
1100  -> AAAAAAAA
1101  -> AAAAAAAB
1110  -> AAABA
1111  -> AAABB```  Reply With Quote

28. ## Thanks (2):

Matt Mahoney (13th November 2013),Piotr Tarsa (13th November 2013)

29. I did not so far, but it's an interesting suggestion.  Reply With Quote

30. Tunstall coding seems to be compatible with compressed BWT (edit: hmm, it was at first sight). At least it's some food for thought.

Actually, with Tunstall coding, you can use higher order probabilities than just order-0. More specifically - instead of computing a probability of sequence S (a dictionary entry) as p_0(s_0) * p_0(s_1) * p_0(s_2) * ... * p_0(s_n-1), you can compute it as: p_0(s_0) * p_1(s_1) * p_2(s_2) * ... * p_n-1(s_n-1) where p_n(X) is a probability of symbol in order n, where context is implied (I didn't want to obscure the expression)  Reply With Quote

31. Hi all,
If there is something difficult to understand about ANS, I would gladly explain. Here is a picture showing the difference between arithmetic coding and ANS: I have just written a new paper about ANS (to finally send to a journal) - completely rewritten, lots of examples, pictures, simulation results, improved precise initialization for ANS. Also written from perspective of probably the most important advantage of ANS - constructing low state entropy coding automates, extremely close to Shannon capacity (perfect for hardware implementation): less than 20 states are usually enough to get below 0.001 bits/symbol loss (practically unreachable for Huffman, arithmetic coding would need much more states):

http://arxiv.org/abs/1311.2540
Here is a poster gathering basic information.  Reply With Quote