Page 1 of 9 123 ... LastLast
Results 1 to 30 of 251

Thread: Asymetric Numeral System

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts

    Asymetric Numeral System

    I've been interested lately in Jarek Duda's ANS, out of curiosity.

    The intention is to experiment with an entropy decoder suitable for low-performance CPU.
    By low-performance, I mean CPU without built-in multiplication capability (multiplication can be emulated, but it's much longer).
    I understand that binary arith decoders can work properly with just multiplications and comparisons, making them quite fast and efficient.
    (General arith decoders, on the other hand, require divisions.)

    My understanding is that ASN exchange these requirements against the following ones :
    1) Use of look-up tables, exclusively (with the risk of becoming very large,)
    2) Decoding in reverse direction of encoding
    3) Designed for fixed probability rates (as a consequence of 1 & 2) but Matt found a way around it by saving both the symbol and its probability into an intermediate table and encoding them in reverse order. (I still have to understand how the lookup table can be modified on-the-fly though...)

    So it's not an "all win".

    So far, I've read Jarek's paper (well, tried to...), Matt's fpaqa source code and DCE comments, and blog entry of Andrew Polar. I've also played with Wolfram demo.

    And I still don't get it.
    I feel I can make it work on some simple cases, but not understand why it works, and therefore how to use it in more generic circumstances.


    My current understanding is as follows :
    As the encoder receives new symbols to encode, the "state" value grows, according to a translation table.
    The higher the probability, the less the state value increases.
    The intuition is that a large number in the translation table results from
    either a few low probability symbol, or many of high probability symbols. Or of course a mix of them.

    Decoding is simply a reverse operation, traversing the translation table in the other direction.
    So any number in the table represents a deterministic suite of symbols.

    Obviously, the translation table is not infinite, so at some point, there is a need to "resize" the state value, so that it remains within the table.
    That's where I get lost.


    fpaq recommends keeping the "state value" between 2^N and 2^N+1.
    WolfRam demo keeps it between 2^N and 2^N+2, updating 2 bits at a time.
    Andrew Polar keeps it below 2^N, resizing by any number of bits, to remain within [2^N-1..2^N[
    Showing there are different possibilities.

    The decoding side is well described on Wolfram :
    The "state" value decreases, outputting one symbol at a time.
    When the "state" reaches a value within [2^N-2..2^N[, the bitstream is resized by 2 bits, ensuring the new "state" value is once again between [2^N..2^N+2[
    The translation table below 2^N is not shown. (does it exist ?)
    The corresponding encoding side, on the other hand, is mysterious.


    To better study it, I've been playing with a simple/trivial example, using a binary distribution as follows :
    0: 75% ; 1: 25%
    The good thing is that allows to play with N=2, which is easier to manage by hand on paper.

    It's easy to get Wolfram generate a working solution,
    but I wish I can understand how it is generated, and how it works, in both direction (encoding & decoding).

    Anyone having tried this before ?
    Last edited by Cyan; 20th February 2014 at 15:22.

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I looked at it a while ago and it seemed to make no sense. Your post made me take another look at it. I haven't quite fully wrapped my head around it, but it seems like almost the same thing as arithmetic coding approached from a different angle. It keeps multiplying by 1/p and shifting off the low bits. The symbols aren't in ranges, they're sprinkled all throughout in proportions equal to their probability. I'm thinking that, unlike arithmetic coding, what it's doing might not have any simple mathematical explanation other than coding something.
    Last edited by nburns; 30th October 2013 at 11:03.

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    I've been playing a bit with a few 75/25 layouts lately.
    According to Jarek's theory, they are all equivalent.

    They nonetheless result in different state transition rules.
    Here are a few decoding state tables examples :

    1) Using distribution 0 1 0 0
    States are numbered 4-7
    4 : output 0; read 1 bit => next 6 or 7
    5 : output 1; read 2 bits => next 4 to 7
    6 : output 00; read 1 bit => next 6 or 7
    7 : output 01; read 2 bits => next 4 to 7

    2) Using distribution 1 0 0 0
    States are numbered 4-7
    4 : output 1; read 2 bits => next 4 to 7
    5 : output 0; read 1 bit => next 6 or 7
    6 : output 00; read 1 bit => next 6 or 7
    7 : output 000; read 1 bit => next 6 or 7

    Both are supposed to compress a 75/25 input equally well.
    Is that true ?
    It's difficult to judge it just by looking at the transition table.
    I guess coding an analysis program is required to get to the result.

  4. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    I've been playing a bit with a few 75/25 layouts lately.
    According to Jarek's theory, they are all equivalent.
    ...

    Both are supposed to compress a 75/25 input equally well.
    Is that true ?
    It's difficult to judge it just by looking at the transition table.
    I guess coding an analysis program is required to get to the result.
    Jarek's paper is hard to follow, I think partly because of English difficulties and partly because of the math terms he uses, which I'm not fully up to speed on. That said, the number of bits that you output after a given symbol determines the compression. I looked at the tables in Andrew Polar's post and checked through doing a few examples in my head that, for instance, symbol A in the very first table averages between 1 and 2 bits, as expected.
    Last edited by nburns; 30th October 2013 at 18:45.

  5. #5
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Cyan View Post
    I've been playing a bit with a few 75/25 layouts lately.
    According to Jarek's theory, they are all equivalent.

    They nonetheless result in different state transition rules.
    Here are a few decoding state tables examples :

    1) Using distribution 0 1 0 0
    States are numbered 4-7
    4 : output 0; read 1 bit => next 6 or 7
    5 : output 1; read 2 bits => next 4 to 7
    6 : output 00; read 1 bit => next 6 or 7
    7 : output 01; read 2 bits => next 4 to 7

    2) Using distribution 1 0 0 0
    States are numbered 4-7
    4 : output 1; read 2 bits => next 4 to 7
    5 : output 0; read 1 bit => next 6 or 7
    6 : output 00; read 1 bit => next 6 or 7
    7 : output 000; read 1 bit => next 6 or 7

    Both are supposed to compress a 75/25 input equally well.
    Is that true ?
    It's difficult to judge it just by looking at the transition table.
    I guess coding an analysis program is required to get to the result.
    looking at the first table if decoding a long random stream I would get a ratio of 7 zeros to 3 ones.
    the second table if I was decoding a long random stream as soon as you get to state 5 6 or 7 you
    never get to state 4 again so no more ones in output stream possible.

    Of course I never really looked at the decompressor your talking about so maybe your leaving out
    some information in your example of decompression. And I am assuming that the decompression of
    a random stream should go to the 75/25 your seek.

  6. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I noticed that the lookup table is the state transition table for a finite state machine, and in that formulation, it's a lot easier for me to picture. I can't tell from your posts if you hit on that yet or not. It makes a perfectly good FSM, except for the dangling transitions at the end. However, you could easily fill in those states with transitions that go backward and simultaneously write one or more bits to the output, completing the machine. It's entirely deterministic.

    The complete machine has n*2^(from+width) transitions (using the same variable names as Wolfram, e.g. n is the number of symbols in the alphabet). That table would get pretty large, so it might not be a good idea to literally program it as a DFA. Notwithstanding that, I think it's the right abstraction to use to describe the algorithm: the algorithm computes an FSM that matches the distribution, which is then used for encoding and decoding.

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Quote Originally Posted by biject.bwts View Post
    the second table if I was decoding a long random stream as soon as you get to state 5 6 or 7 you
    never get to state 4 again so no more ones in output stream possible.
    Correct, this transition table is bugged.
    Here is a corrected one :

    Using distribution 1 0 0 0
    States are numbered 4-7
    4 : output 1; read 2 bits => next 4 to 7
    5 : output 0; read 1 bit => next 6 or 7
    6 : output 01; read 2 bit => next 4 to 7
    7 : output 00; read 1 bit => next 6 or 7

    => It's basically the same as the first table, with 4-5 and 6-7 swapped.

    Quote Originally Posted by biject.bwts View Post
    looking at the first table if decoding a long random stream I would get a ratio of 7 zeros to 3 ones.
    Interesting. Does it point to a table more adapted to 70/30, rather than the expected 75/25 ?

    It also questions if a transition table of size 4 is large enough. Maybe it needs twice that size to get to the desired 75/25.

    Notwithstanding that, I think it's the right abstraction to use to describe the algorithm: the algorithm computes an FSM that matches the distribution, which is then used for encoding and decoding.
    That's my understanding.
    Nevertheless, the algorithm to build the FSM should be "straightforward" enough to be implemented in a minimum amount of energy.
    As a point of reference, it reminds me of an earlier attempt at creating an Huffman entropy compressor : the time required to build the tree was more important than the encoding/decoding time itself. It required quite some time and transformations to make it an efficient table creation stage.


    The complete machine has n*2^(from+width) transitions (using the same variable names as Wolfram, e.g. n is the number of symbols in the alphabet). That table would get pretty large, so it might not be a good idea to literally program it as a DFA.
    Indeed. I'm especially worried by the transition table for the compressor. This one looks really huge.
    The decoding side, on the other hand, seems more manageable.
    There is also a trade-off to find between pre-calculated, and dynamically calculated :
    by calculating some of the required information on the fly (nb of bits to output for example), it makes the table smaller, which in turn can improve cache locality.
    It's not easy to find out the right compromise without coding.

    I guess I need to solve that point...

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    ABC uses rounding to encode bits. Given an integer state x and a symbol with probability p, you get new larger state x/p. Since x/p is not necessarily an integer, a rounding convention is used. A bit is encoded by following different conventions for 0 and 1 such that the two sets of possible new states do not overlap. The decoder figures out which way the rounding was done, then undoes the rounding and goes back to state x*p.

    The rule is to encode 1 with probability p as floor(x/p). The decoder, given p can then detect a 1 because p*floor(x/p) will be in the range (x - p, x], which can be detected from the fractional part of the result. Any other value encodes a 0. Note that the number of possible encodings of 0 and 1 are proportional to their probabilities, just as in range coding.

    For example, suppose p = p(1) = 0.4 and x > 0 is the state. Then to encode a 1, the possible next states x/0.4 are multiples of 2.5, which after rounding down are 2, 5, 7, 10, 12, 15, 17... Any other value encodes a 0. Notice that the states encoding 1 make up p = 40% of the set.

    To decode, we multiply by p. If the fractional part is either 0 or greater than 1 - p = 0.6, then we decode a 1, else 0. In this example, x*p = 0.8, 2.0, 2.8, 4.0, 4.8, 6.0, 6.8...

    To encode a 0 with probability 1-p = 0.6, the next state is ceil((x+1)/(1-p)) - 1, which is the same as floor((x+1)/(1-p)) except that you subtract 1 if the result is an integer. Thus, before rounding you have 10/3, 15/3, 20/3, 25/3, 30/3, 35/3, 40/3..., which rounds to 3, 4, 6, 8, 9, 11, 13... Decoding, x*p = 1.2, 1.6, 2.4, 3.2, 3.6, 4.4, 5.2, with fractional parts always in (0, 0.6]. (In my book, the comment incorrectly suggests [0, 0.6) but the code is right).

    Also, fpaqc and lpaq1a keeps the state between 2^N and 2^(N+ - 1 and outputs a byte at a time.

  9. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    That's my understanding.
    Nevertheless, the algorithm to build the FSM should be "straightforward" enough to be implemented in a minimum amount of energy.
    As a point of reference, it reminds me of an earlier attempt at creating an Huffman entropy compressor : the time required to build the tree was more important than the encoding/decoding time itself. It required quite some time and transformations to make it an efficient table creation stage.
    When you say entropy compressor, does that imply a tighter bound on matching the true entropy than you would get from plain, static Huffman?

    ANS sounds pretty straightforward in how you build the machine. The forward (or downward) transitions for a given symbol are spaced out at intervals of about 1/p. No two forward transitions can go to the same state, so how you interleave them would be where details enter in. But there would be only so many ways to do it.

    I'm not sure what the official guidance is on how to allocate the space at the end, which is where you can no longer go forward and the bits come out. It doesn't seem like it would be hard to come up with a strategy that just works, setting aside the question of how to be optimal.

    Edit: But since no forward jump can ever be bigger than the maximum 1/p, that seems to set an upper bound. It seems like the allocation might fall out directly from the number of states, which you choose as a power of 2. So it might not even be a choice.

    Edit2: Let me clarify (details, details...). The transition start points are bunched up at the beginning of the machine (top of the table). So it's not the distance between transition start points and transition end points that's about 1/p, it's the distance between successive end points. I think this makes sense if you think about it, because the backward moves are always scaling down by powers of two, which comes from the number of bits to represent the symbol: the famous -lg(p). So for the rare symbols, you're scaling down by a big power of two and it takes you back to a state near the start state/top of the table. (I think I'd like to flip the diagram over, because scaling down to go to the top just doesn't make sense.)

    I'm going to try to shut up now, because I'm getting tired and making mistakes. I feel like whatever details are still unexplained are not big, and I'll probably think about it some more tonight.

    Indeed. I'm especially worried by the transition table for the compressor. This one looks really huge.
    The decoding side, on the other hand, seems more manageable.
    There is also a trade-off to find between pre-calculated, and dynamically calculated :
    by calculating some of the required information on the fly (nb of bits to output for example), it makes the table smaller, which in turn can improve cache locality.
    It's not easy to find out the right compromise without coding.

    I guess I need to solve that point...
    I think the problem is equivalent to the problem encountered when searching text with regexes: you naturally get an NFA, and the NFA->DFA transformation tends to cause a state explosion. So you can simulate the NFA directly, generate the full DFA, or anything in between. It's a problem that's been dealt with since the dawn of time, and there's plenty of literature on it that might give some insight into how to manage this problem. There are also the solutions Jarek, Matt, etc already used.
    Last edited by nburns; 31st October 2013 at 05:13.

  10. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    ABC uses rounding to encode bits. Given an integer state x and a symbol with probability p, you get new larger state x/p. Since x/p is not necessarily an integer, a rounding convention is used. A bit is encoded by following different conventions for 0 and 1 such that the two sets of possible new states do not overlap. The decoder figures out which way the rounding was done, then undoes the rounding and goes back to state x*p.

    The rule is to encode 1 with probability p as floor(x/p). The decoder, given p can then detect a 1 because p*floor(x/p) will be in the range (x - p, x], which can be detected from the fractional part of the result. Any other value encodes a 0. Note that the number of possible encodings of 0 and 1 are proportional to their probabilities, just as in range coding.

    For example, suppose p = p(1) = 0.4 and x > 0 is the state. Then to encode a 1, the possible next states x/0.4 are multiples of 2.5, which after rounding down are 2, 5, 7, 10, 12, 15, 17... Any other value encodes a 0. Notice that the states encoding 1 make up p = 40% of the set.

    To decode, we multiply by p. If the fractional part is either 0 or greater than 1 - p = 0.6, then we decode a 1, else 0. In this example, x*p = 0.8, 2.0, 2.8, 4.0, 4.8, 6.0, 6.8...

    To encode a 0 with probability 1-p = 0.6, the next state is ceil((x+1)/(1-p)) - 1, which is the same as floor((x+1)/(1-p)) except that you subtract 1 if the result is an integer. Thus, before rounding you have 10/3, 15/3, 20/3, 25/3, 30/3, 35/3, 40/3..., which rounds to 3, 4, 6, 8, 9, 11, 13... Decoding, x*p = 1.2, 1.6, 2.4, 3.2, 3.6, 4.4, 5.2, with fractional parts always in (0, 0.6]. (In my book, the comment incorrectly suggests [0, 0.6) but the code is right).

    Also, fpaqc and lpaq1a keeps the state between 2^N and 2^(N+ - 1 and outputs a byte at a time.
    Nicely stated. I think the ability to write clearly must be one of the primary skills one gets as a PhD student that is hard to get anywhere else. You can study the literature without being in grad school, work on problems without being in grad school, but the art of writing papers and giving talks is hard to self-teach. But I digress...

    It sounds like you've already solved how to dispense with the lookup table entirely and calculate on the fly. At least for the case where symbols are binary, which simplifies things a bit. Also, it sounds like the "width" factor on Wolfram is basically how many bits you want to save up at a time. So you pick one that's convenient for your implementation.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    I use a table to avoid division. It multiplies by the reciprocal instead. In fpaqa I used tables entirely: input the state and probability and output the new state, but this turned out to be slower than multiplication. It also loses a little compression to keep the table size reasonable.

    It's too bad nobody discovered ABC when arithmetic coding was still patented, or else it would probably be the dominant coder today.

  12. #12
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    I was curious to know if the ANS FSM construction method would produce the same FSM table, with only a few cell-swaps to notice,
    so I tried another layout.
    It turns out that it generated a different transition table :

    Using distribution 0 0 0 1
    States are numbered 4-7
    4 : output 0; read 1 bit => next 6 to 7
    5 : output 00; read 1 bit => next 6 or 7
    6 : output 000; read 1 bit => next 6 to 7
    7 : output 1; read 2 bits => next 4 to 7

    Using the rough assumption of a random input provided to the decoder, it can be calculated that it generates, on average, 12 zeroes for 3 ones.
    That's a 80/20 distribution. So it's once again different from the intended 75/25.

    So apparently, both FSM are "close to 75/25", but not exactly, and seem tuned for some different distribution (70/30 vs 80/20).
    In a bid to better study these differences, I built an emulator, using above FSM to compress a random 0/1 input.
    I ran several tests with different distributions, verifying if 0100 and 0001 FSM are equivalent.

    They are not. Here are some detailed results :

    Distribution 70/30 : Theoric : 88.13% / 0100 : 88.82% / 0001 : 91.96%
    Distribution 75/25 : Theoric : 81.13% / 0100 : 82.14% / 0001 : 82.43%
    Distribution 80/20 : Theoric : 72.18% / 0100 : 75.55% / 0001 : 72.78%

    Conclusion :
    0100 and 0001 FSM are not equivalent. As suspected, 0100 is better tuned for 70/30, and 0001 is better tuned for 80/20.

    The ANS FSM construction method leads to various FSM which are "roughly okay" for the target distribution, but not optimal, and not equivalent.

    It's also interesting to notice that, even when using the FSM on their optimal distribution, the "theoric" ratio is approached but not quite reached, with a noticeable difference of more than 0.5%.


    Maybe all these effects are related to the size of the FSM : 4 states might be too small after all (even for a simple 75/25 distribution).
    Maybe using a larger FSM would produce a better result.

  13. #13
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,496
    Thanks
    26
    Thanked 132 Times in 102 Posts
    I have a quick thought:
    If we restrict ourselves to logistic mixing as a final modelling step and LUTs for coding, then we can skip converting stretched (or squashed? I forgot) frequencies to normal frequencies before doing encoding and instead do that once when buiding LUTs. That could offset the performance and efficiency losses. What do you think: would that work? Maybe someone already did that?

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Using more states and more bits to represent probabilities will give you better compression (closer to Shannon limit) just like with arithmetic coding.

    If you use a full lookup table implementation, then it makes sense to use greater precision for probabilities near 0 or 1 by stretching it (log(p/(1-p))) before quantization. I did something like this in fpaqa.

  15. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Cyan View Post
    I was curious to know if the ANS FSM construction method would produce the same FSM table, with only a few cell-swaps to notice,
    so I tried another layout.
    It turns out that it generated a different transition table :

    Using distribution 0 0 0 1
    States are numbered 4-7
    4 : output 0; read 1 bit => next 6 to 7
    5 : output 00; read 1 bit => next 6 or 7
    6 : output 000; read 1 bit => next 6 to 7
    7 : output 1; read 2 bits => next 4 to 7
    I'm going to have to figure out how you're describing the machine. Why have you omitted states 0-3, because they're redundant? You've either reduced it down to the minimal FSM already, or else I don't know what's going on.

    When you say the distribution is 0 0 0 1, do you mean that it's 75% zeroes and 25% ones?

    Using the rough assumption of a random input provided to the decoder, it can be calculated that it generates, on average, 12 zeroes for 3 ones.
    That's a 80/20 distribution. So it's once again different from the intended 75/25.
    As I said, I can't follow your machine, so I can't help debug it at this point. But, intuitively, it seems to me that the more states there are, the closer it will match a distribution. Consider that it has to store the accumulated error in the state, so the more states, the more error information it can hold.

  16. #16
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    For an encoder in state x, the next state is approximately x/p or x/(1-p), depending on the encoded bit and the rounding convention. The encoding is optimal only if the next state is exact. You want x and p to be at least about 12 or 16 bits to make the effects of rounding on p negligible. So the states would start with e.g. 2^12 and grow from there.

  17. #17
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    For an encoder in state x, the next state is approximately x/p or x/(1-p), depending on the encoded bit and the rounding convention. The encoding is optimal only if the next state is exact. You want x and p to be at least about 12 or 16 bits to make the effects of rounding on p negligible. So the states would start with e.g. 2^12 and grow from there.
    Could you explain why you don't need the states from 0-2^12?

  18. #18
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    If you can't compute the transitions on the fly, another way to deal with them, besides an explicit table, is using rank and select on a vector of symbols. For a bit vector:

    rank(0,n) -> the number of 0s in the range [0,n]
    rank(1,n) -> the number of 1s in the range [0,n]
    select(0,m) -> the index of the m'th 0
    select(1,m) -> the index of the m'th 1

    For 75% 0 and 25% 1, the vector would look like this, where each element is a state labeled by a bit:

    0001000100010001...

    To follow the transitions for a sequence of bits:

    s = 0;
    while(state_exists(s)) {
    b = get_bit();
    s = select(b,s);
    }
    // You need to add 1 to s sometimes to avoid an infinite loop. That fiddly piece is left out.


    rank and select can be done efficiently with the right data structure. select(b,m) gives the same value as in the table cell (m,b)
    Last edited by nburns; 1st November 2013 at 01:43.

  19. #19
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Why have you omitted states 0-3, because they're redundant?
    I'm using Wolfram ANS demo for guidance on how to build a state machine.

    The main rules that can be derived from it :
    1) A state machine with N=2^n states starts at state N
    2) The first state delivering a 0 refers to state nb(0), then grow up by one at each 0
    2) The first state delivering a 1 refers to state nb(1), then grow up by one at each 1

    I don't know what's supposed to be into states 0-3, and I'm not sure it matters.
    Reaching 0-3 is merely a signal that we need to output/input more bits stay within range.


    When you say the distribution is 0 0 0 1, do you mean that it's 75% zeroes and 25% ones?
    It's more precise, it means the distribution of values is exactly 0, then 0, then 0, then 1.
    In other words :
    State 4 : output a 0, then go to next state
    State 5 : output a 0, then go to next state
    State 6 : output a 0, then go to next state
    State 7 : output a 1, then go to next state


    Using more states and more bits to represent probabilities will give you better compression (closer to Shannon limit)
    Yes, I agree, it's expected.

    However, what I'm stressing here is that not all state machines are equal, even though they contain the same proportions of 0 and 1.

    Coming back to the little 4-states machine studied so far, I've made a program which provides a precise estimate of the optimal probability for a given state machine.
    Here are the results :

    layout 1000 => 0:71.43% 1:28.57%
    layout 0100 => 0:71.43% 1:28.57%
    layout 0010 => 0:78.95% 1:21.05%
    layout 0001 => 0:78.95% 1:21.05%

    So it wasn't exactly the 70/30 and 80/20 roughly evaluated, but close enough.

    This is matched by the experience : the amount of waste of each state machine is minimized when using it on a distribution which matches its optimal values :

    P(1)=30.00% : waste(0100) = 0.79% -- waste(0010) = 4.35%
    P(1)=28.57% : waste(0100) = 0.69% -- waste(0010) = 3.40%
    P(1)=25.00% : waste(0100) = 1.25% -- waste(0010) = 1.61%
    P(1)=21.05% : waste(0100) = 3.63% -- waste(0010) = 0.78%
    P(1)=20.00% : waste(0100) = 4.66% -- waste(0010) = 0.82%

    What matters here is that the construction method of the FSM doesn't provide a secure way to reach 75/25 (in constrast with arithmetic and huffman, which are very straightforward). Adding more states to the FSM is likely to improve its accuracy, but the divergence between the different possible FSM are likely to remain, and I guess it's going to be difficult to select the right (optimal) FSM for a given probability without checking them all to find the less wasteful one.

  20. #20
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Getting one step further, I've tested all possible 8-states combinations for a 75/25 distribution.

    As can be expected, they lead to different results.
    The list is long, so I will provide here what I believe are the most important results :

    layout 11000000 => 0:71.43% 1:28.57%
    (...)
    layout 10001000 => 0:71.43% 1:28.57%
    layout 01010000 => 0:71.43% 1:28.57%
    layout 01000100 => 0:73.58% 1:26.42%
    layout 00100100 => 0:75.55% 1:24.45%
    layout 00100010 => 0:75.90% 1:24.10%
    layout 00010001 => 0:77.78% 1:22.22%
    layout 00001100 => 0:78.95% 1:21.05%
    (...)
    layout 00000011 => 0:78.95% 1:21.05%

    Minor detail : many layout result in exactly the same optimal distribution, notably at the edge :
    9/28 FSM get exactly 0:71.43% 1:28.57%
    6/28 FSM get exactly 0:78.95% 1:21.05%

    Optimal distribution disparity is the same as the 4-states FSM : it ranges from 21.05% to 28.57%.
    But, in contrast with the 4-states FSM, we now have several intermediate values, which allow to select a better one.

    The best result is 75.55/24.45. Once again, it's not exactly 75/25, but fair enough.
    Putting it to the test show that it is indeed better at compressing a 75/25 input than any 4-state FSM so far :

    P(1)=25.00% : waste(0100) = 1.25% -- waste(0010) = 1.61% -- waste(00100100) = 0.31%

    And once again, this particular layout is better at compressing a bitstream with distribution closer to its optimal values :

    P(1)=24.45% : waste(00100100) = 0.30%

    OK, fair enough, 25% is almost as good, so let's say we have our optimal layout (for an 8-states FSM).
    It's important to note that this best result is achieved using the layout 00100100.
    Why is that important ? because the 1 are not evenly spaced by a distance of 4 !

    I've also put in the list the "evenly spaced" layouts, and none of them get to the winning position.
    2 of them are even close to pathological corner cases !

    layout 10001000 => 0:71.43% 1:28.57%
    layout 01000100 => 0:73.58% 1:26.42%
    layout 00100010 => 0:75.90% 1:24.10% <== best one
    layout 00010001 => 0:77.78% 1:22.22%

    What it means is that finding the "best" FSM is going to be a difficult exercise. There's no "obvious" pattern so far.
    Last edited by Cyan; 1st November 2013 at 17:03.

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    Quote Originally Posted by nburns View Post
    Could you explain why you don't need the states from 0-2^12?
    Because there would be a large roundoff error. Look at how the coding works mathematically.

  22. #22
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Continuing this journey into FSM entropy compression,
    I was curious to see if the layout producing the same distribution of 0 & 1 were indeed fully equivalent, or just "close variations".
    So I created the reverse operation, and tried to compress the same input with all of them.

    It turns out they are *perfectly* equivalent.
    Here are some detailed results :

    Generating 100 Mb with P(1)=28.57%
    Encoding with layout 0100 : result : 86.91% (wasted 0.69%)
    Encoding with layout 11000000 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10100000 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10010000 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10001000 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10000100 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10000010 : result : 86.91% (wasted 0.69%)
    Encoding with layout 10000001 : result : 86.91% (wasted 0.69%)
    Encoding with layout 01100000 : result : 86.91% (wasted 0.69%)
    Encoding with layout 01010000 : result : 86.91% (wasted 0.69%)

    As you can see, they all produce the exact same result.
    Perhaps even more interestingly, the "waste" level is exactly the same for the 8-states FSM and the single 4-state FSM sharing the same characteristics.
    It seems to point out that it's not always worthwhile to increase the number of states, at least not with these "edge" layouts.

    The same holds true for the other end of the spectrum :
    Generating 100 Mb with P=21.05%
    Encoding with layout 0001 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00001100 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00001010 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00001001 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00000110 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00000101 : result : 74.80% (wasted 0.78%)
    Encoding with layout 00000011 : result : 74.80% (wasted 0.78%)

    Same effect, same consequences.


    I also had a deeper look into the "optimal" 8-states FSM for the target 75/25.
    It turns out the best layout is not alone.
    3 others layout were displaying almost the same characteristics.

    layout 00100100 => 0:75.55% 1:24.45%
    layout 00101000 => 0:75.56% 1:24.44%
    layout 00011000 => 0:75.56% 1:24.44%
    layout 00010100 => 0:75.56% 1:24.44%

    On closer inspection, they are perfectly equivalent, producing the same compression performance :

    Generating 100 Mb with P(1)=25.00%
    Encoding with layout 00100100 : result : 81.38% (wasted 0.31%)
    Encoding with layout 00101000 : result : 81.38% (wasted 0.31%)
    Encoding with layout 00011000 : result : 81.38% (wasted 0.31%)
    Encoding with layout 00010100 : result : 81.38% (wasted 0.31%)

    For these ones, at least, I have an explanation.
    When looking at how the FSM works, it's obvious in retrospect that last-bit positions are perfectly equivalent.
    It means position 0&1, 2&3, 4&5, etc. can be swapped without changing the properties of the FSM.
    This is exactly what happens for the above 4 FSM : they all have one 1 in the middle 2-seats spots.

    Another interesting point to notice is that the waste level for the latest FSM is much lower (0.31%) than the waste level at the "edge" of the disparity (0.69% - 0.78%). This was also expected : the layout at the edge are not regular, and are expected to produce side-effects, detrimental to compression. Now, it's also clearly measured.

  23. #23
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Because there would be a large roundoff error. Look at how the coding works mathematically.
    I guess the assumption is that only states that emit bits should count.

    If you're viewing it as an FSM, it doesn't make sense to omit states. I haven't worked the implementation out in detail, but if you view the forward transitions as multiplications, then I'm sure you might as well not draw them. Andrew Polar's blog post draws them in, and that seems pretty easy to follow.

  24. #24
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Matt: are you using floating point numbers to do the state changes?

    If so, it might be worth figuring out how to do it with integer arithmetic.

  25. #25
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 797 Times in 489 Posts
    It is using integer arithmetic. Look at the algorithm description in my book or fpaqc comments. Emitting bits is separate from coding. As an FSM you are free to relabel the states, but from a mathematical perspective it makes sense to think of states within a range like 2^12..2^32-1 with 12 bit probabilities, or something. Then whenever you have more than 20 bits, you emit the low 8 bits of the state.

  26. #26
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It is using integer arithmetic. Look at the algorithm description in my book or fpaqc comments. Emitting bits is separate from coding. As an FSM you are free to relabel the states, but from a mathematical perspective it makes sense to think of states within a range like 2^12..2^32-1 with 12 bit probabilities, or something. Then whenever you have more than 20 bits, you emit the low 8 bits of the state.
    I looked up the section in DCE and tracked down the code in fpaqc that implements ABC. I haven't really gone over fpaqc yet, but I think what it says in DCE is making sense now. I can see how you could implement it with only integers. You could use the remainder the same way you'd use the fractional part -- or perhaps there's an even better way I haven't thought of*. The code in fpaqc looks very easy to read. I appreciate your efforts to create decent documentation, for esoteric stuff like this especially.

    The way it distinguishes 0s and 1s when encoding bits -- does that scale naturally to accommodate larger alphabets? I gave it some thought, and it seems like the most natural way to handle a larger alphabet would be essentially to break it into binary and use the algorithm the way it is.

    Edit: * I happened to come across a source on this and here's what it says, just in case it's any different from what's in fpaqc:

    Divide Rounding

    Joe Ibershoff, one of my students, pointed-out that integer divide noramlly yields the floor, but both ceiling and round-to-nearest are easy and useful. I thought these were fairly well-known tricks closely related to the Alignment of Pointers magic, but perhaps they aren't so obvious...? He points out that Ceiling(a/b) is (a+b-1)/b and RoundToNearest(a/b) is (a+(b/2))/b. Of course, these tricks also work if divide is implemented in less obvious ways, such as shifts or shift-and-subtract sequences.

    Edit: Actually, that does sound obvious. The non-obvious part would be that you're dividing by a fraction. I imagine that you could store the fraction as a numerator and denominator and first multiply by the denominator and then divide by the numerator as above. You'd run the risk of overflow during the multiplication, though.
    Last edited by nburns; 10th November 2013 at 17:56.

  27. #27
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 64 Times in 33 Posts

    Tunstall coding

    Quote Originally Posted by Cyan View Post
    The intention is to experiment with an entropy decoder suitable for low-performance CPU.
    Did you investigate Tunstall coding? It's a variable to fixed length coding (means that decoding will go the fast fixed to variable way).

    The alphabet grows by successively adding new sequences made of the most frequent sequence (which is removed from the alphabet right after) followed by each original letters.

    For instance on a two letters alphabet (A and B) having 75% and 25% probabilities of occurrence.

    Code:
    A  0.75 <-
    B  0.25
    
    B   0.2500
    AA  0.5625 <-
    AB  0.1875
    
    B     0.2500
    AB    0.1875
    AAA   0.4219 <-
    AAB   0.1406
    
    B         0.2500
    AB        0.1875
    AAB       0.1406
    AAAA      0.3164 <-
    AAAB      0.1054
    
    B         0.2500 <-
    AB        0.1875
    AAB       0.1406
    AAAB      0.1054
    AAAAA     0.2373
    AAAAB     0.0791
    
    AB        0.1875
    AAB       0.1406
    AAAB      0.1054
    AAAAA     0.2373 <-
    AAAAB     0.0791
    BA        0.1875
    BB        0.0625
    
    AB        0.1875 <-
    AAB       0.1406
    AAAB      0.1054
    AAAAB     0.0791
    BA        0.1875
    BB        0.0625
    AAAAAA    0.1780
    AAAAAB    0.0593
    
    AAB       0.1406
    AAAB      0.1054
    AAAAB     0.0791
    BA        0.1875 <-
    BB        0.0625
    AAAAAA    0.1780
    AAAAAB    0.0593
    ABA       0.1406
    ABB       0.0469
    
    AAB       0.1406
    AAAB      0.1054
    AAAAB     0.0791
    BB        0.0625
    AAAAAA    0.1780 <-
    AAAAAB    0.0593
    ABA       0.1406
    ABB       0.0469
    BAA       0.1406
    BAB       0.0469
    
    AAB       0.1406 <-
    AAAB      0.1054
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABA       0.1406
    ABB       0.0469
    BAA       0.1406
    BAB       0.0469
    AAAAAAA   0.1335
    AAAAAAB   0.0445
    
    AAAB      0.1054
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABA       0.1406 <-
    ABB       0.0469
    BAA       0.1406
    BAB       0.0469
    AAAAAAA   0.1335
    AAAAAAB   0.0445
    AABA      0.1054
    AABB      0.0351
    
    AAAB      0.1054
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABB       0.0469
    BAA       0.1406 <-
    BAB       0.0469
    AAAAAAA   0.1335
    AAAAAAB   0.0445
    AABA      0.1054
    AABB      0.0351
    ABAA      0.1054
    ABAB      0.0351
    
    AAAB      0.1054
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABB       0.0469
    BAB       0.0469
    AAAAAAA   0.1335 <-
    AAAAAAB   0.0445
    AABA      0.1054
    AABB      0.0351
    ABAA      0.1054
    ABAB      0.0351
    BAAA      0.1054
    BAAB      0.0351
    
    AAAB      0.1054 <-
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABB       0.0469
    BAB       0.0469
    AAAAAAB   0.0445
    AABA      0.1054
    AABB      0.0351
    ABAA      0.1054
    ABAB      0.0351
    BAAA      0.1054
    BAAB      0.0351
    AAAAAAAA  0.1001
    AAAAAAAB  0.0334
    
    AAAAB     0.0791
    BB        0.0625
    AAAAAB    0.0593
    ABB       0.0469
    BAB       0.0469
    AAAAAAB   0.0445
    AABA      0.1054 <-
    AABB      0.0351
    ABAA      0.1054
    ABAB      0.0351
    BAAA      0.1054
    BAAB      0.0351
    AAAAAAAA  0.1001
    AAAAAAAB  0.0334
    AAABA     0.0791
    AAABB     0.0263
    
    and so on
    At this point each sequence could be written as a 4 bit value.
    Code:
    0000  -> AAAAB
    0001  -> BB
    0010  -> AAAAAB
    0011  -> ABB
    0100  -> BAB
    0101  -> AAAAAAB
    0110  -> AABA
    0111  -> AABB
    1000  -> ABAA
    1001  -> ABAB
    1010  -> BAAA
    1011  -> BAAB
    1100  -> AAAAAAAA
    1101  -> AAAAAAAB
    1110  -> AAABA
    1111  -> AAABB
    Last edited by caveman; 13th November 2013 at 12:21. Reason: corrected sample

  28. Thanks (2):

    Matt Mahoney (13th November 2013),Piotr Tarsa (13th November 2013)

  29. #28
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    I did not so far, but it's an interesting suggestion.

  30. #29
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,496
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Tunstall coding seems to be compatible with compressed BWT (edit: hmm, it was at first sight). At least it's some food for thought.

    Actually, with Tunstall coding, you can use higher order probabilities than just order-0. More specifically - instead of computing a probability of sequence S (a dictionary entry) as p_0(s_0) * p_0(s_1) * p_0(s_2) * ... * p_0(s_n-1), you can compute it as: p_0(s_0) * p_1(s_1) * p_2(s_2) * ... * p_n-1(s_n-1) where p_n(X) is a probability of symbol in order n, where context is implied (I didn't want to obscure the expression)
    Last edited by Piotr Tarsa; 13th November 2013 at 13:09.

  31. #30
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    803
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Hi all,
    If there is something difficult to understand about ANS, I would gladly explain. Here is a picture showing the difference between arithmetic coding and ANS:



    I have just written a new paper about ANS (to finally send to a journal) - completely rewritten, lots of examples, pictures, simulation results, improved precise initialization for ANS. Also written from perspective of probably the most important advantage of ANS - constructing low state entropy coding automates, extremely close to Shannon capacity (perfect for hardware implementation): less than 20 states are usually enough to get below 0.001 bits/symbol loss (practically unreachable for Huffman, arithmetic coding would need much more states):

    http://arxiv.org/abs/1311.2540
    Here is a poster gathering basic information.

    Please ask if there are any questions and I would be grateful for comments.

  32. Thanks (6):

    biject.bwts (13th November 2013),encode (13th November 2013),kedart (19th September 2016),Matt Mahoney (13th November 2013),Nania Francesco (9th July 2014),nburns (14th November 2013)

Page 1 of 9 123 ... LastLast

Similar Threads

  1. Replies: 39
    Last Post: 10th April 2014, 22:26
  2. Replies: 3
    Last Post: 19th April 2013, 16:33

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •