Page 3 of 3 FirstFirst 123
Results 61 to 76 of 76

Thread: Reducing the header - optimal quantization/compression of probability distribution?

  1. #61
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    One key thing about blocking is that it permits random access within a compressed file if you couple it with some virtual offset mapping. Eg see the bgzf deflate variant. Attempts to use the previous header as a context for shrinking the current header prevent such things, and if you're preventing it then I agree that there aren't so many benefits to using a block based system. I agree with nburns - where huffman is being used sensibly (mainly simple systems without complex modelling), ANS can also be used sensibly and is often a better alternative. Sometimes using a model with context will give higher compression ratios of our data, but that doesn't always make it *better*. If it did the world would use paq8 everywhere, but it doesn't. It depends what we want to achieve.

    I also still stand by my view that practical implementations of block-based entropy encoders vs adaptive streaming entropy encoders show the block-based systems to be faster. Agreed it needs multiple passes so in theory a streaming system has an advantage, so maybe there are commercial tools that prove otherwise, or hardware based systems too. However I've yet to see any free command-line tools that do nothing other than entropy encoding to demonstrate lightning quick arithmetic coding.

  2. #62
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by thorfdbg View Post
    Does it? Sorry, I don't understand. The *only* thing you get "on top" is that you have a unique mapping between symbols and bit-sequences, something that does not hold for AC, but you pay a price for it.
    It's a nifty algorithm that builds a kind of optimal weighted tree. It's not just a code. And it's hard to say that the code would never have a good use anywhere.

    Does it beat AC at compression? Absolutely not.
    Last edited by nburns; 6th March 2014 at 14:35.

  3. #63
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by JamesB View Post
    I also still stand by my view that practical implementations of block-based entropy encoders vs adaptive streaming entropy encoders show the block-based systems to be faster. Agreed it needs multiple passes so in theory a streaming system has an advantage, so maybe there are commercial tools that prove otherwise, or hardware based systems too. However I've yet to see any free command-line tools that do nothing other than entropy encoding to demonstrate lightning quick arithmetic coding.
    Yeah. AC implementations always seem to be inseparable from their model.

    Scanning memory is fast in my experience, and small blocks will be entirely in cache. The extra scan is probably not even the performance limiter for something like FSE. The performance limiter probably has to do with superscalar and branch misprediction stalls in the computation.
    Last edited by nburns; 6th March 2014 at 14:44.

  4. #64
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by nburns View Post
    Scanning memory is fast in my experience, and small blocks will be entirely in cache. The extra scan is probably not even the performance limiter for something like FSE. The performance limiter probably has to do with superscalar and branch misprediction stalls in the computation.
    The problem is here that you seem to limit your view to software implementations only. However, many compression applications are actually hardware implementations. Memory footprint is actually pretty important because it is pricy. Adding a couple of gates to support algorithms not suitable for software is however, actually easy then. Thus, for example, adding a the JPEG LS prediction or contexts is a piece of cake in hardware. Scanning an image twice is not an option if you need an online compression algorithm, for example (online = compress "as the data is coming" without much buffering.)

  5. #65
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by thorfdbg View Post
    The problem is here that you seem to limit your view to software implementations only. However, many compression applications are actually hardware implementations. Memory footprint is actually pretty important because it is pricy. Adding a couple of gates to support algorithms not suitable for software is however, actually easy then. Thus, for example, adding a the JPEG LS prediction or contexts is a piece of cake in hardware. Scanning an image twice is not an option if you need an online compression algorithm, for example (online = compress "as the data is coming" without much buffering.)
    I take your word for it, because that's not my area of expertise. I did not mean my post to be taken as applying to anything other than contemporary PC hardware.

  6. #66
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    This paper has useful background and helps put ANS in context:

    http://www.1stworks.com/ref/cleary84enum.pdf

    You can say ANS is just a different approach to AC, but you could just as well say that AC is just a different approach to enumerative coding. All can be seen as computing a (approximate) rank among all equivalent inputs in lexicographic order.

    I think you could say the difference between AC and ANS is analogous to the difference between radix sort most-significant first and radix sort least-significant first. MSB radix sort, like AC, needs more state. It's also like searching forward vs backward through the BWT.

  7. #67
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I thought of how you might get AC-like performance using Huffman:

    Compute a Huffman code using the observed symbol frequencies. Call this Huffman-1.

    Then, for all possible bytes, work backward and compute the expected frequency of seeing each byte using the Huffman-1 code you generated and the original symbol statistics. Use these new statistics to compute a new Huffman code. Call this code Huffman-2.

    You can repeat as many times as needed, and it should converge to some kind of optimal code.

  8. #68
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Isn't that PPM + Huffman?

  9. #69
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Isn't that PPM + Huffman?
    I'm not sure how literally you mean that, but I think the net result would be PPM-like.

    It did occur to me that these bit-splitting entropy coders are, in effect, doing something like partial matching.

    Edit:
    Note, though, that the successive iterations of Huffman are based only on the information in the header, not the actual stream of symbols. If you based the new codes on the actual stream of symbols, you would need to add information to the header.
    Last edited by nburns; 11th March 2014 at 03:17.

  10. #70
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    It is theoretically possible for the decompresser to dynamically update the Huffman table after every character based on the same high-order model used by the compressor. But it is not used in practice because recomputing the Huffman table would take longer than arithmetic coding and would give worse compression.

  11. #71
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It is theoretically possible for the decompresser to dynamically update the Huffman table after every character based on the same high-order model used by the compressor. But it is not used in practice because recomputing the Huffman table would take longer than arithmetic coding and would give worse compression.
    But that's not what I'm talking about.

    for all possible bytes, work backward and compute the expected frequency of seeing each byte using the Huffman-1 code you generated and the original symbol statistics


    I.e. for every possible sequence of 8 bits, compute the probability that this sequence will come out of the Huffman-1 encoder. This probability would ideally be 1/256, but it will be different by some amount of error. The Huffman-2 code is generated with these statistics and gives a closer approximation.

    To encode the sequence, you'd encode with Huffman-1, then re-encode with Huffman-2, and so on. But in practice, I think you could find shortcuts.

    Edit:
    Suppose that the symbol probabilities are:

    A -> 0.88
    B -> 0.10
    C -> 0.02

    Huffman-1 would give these codes:

    A -> 0
    B -> 10
    C -> 11

    Now, find the probability that you will get a 0 byte from the Huffman-1 code. These sequences generate eight zero-bits in a row, with probabilities given:

    AAAAAAAA => 00000000 (p = 0.88^8 = 0.360)
    BAAAAAAA => 100000000 (p = 0.1 * 0.88^7 = 0.041)

    The resulting probability of a 0 byte:

    Pr(0) = 0.360 + 0.041 = 0.401

    The ideal probability of a 0 byte (or any byte) is:

    Pr(0) = 1/256 = 0.004

    So you can do better by generating a new Huffman code from these computed probabilities, without any additional information.


    Edit2:
    I think there's an error in my byte probability calculation; to get the right p, you may have to adjust Pr(BAAAAAAA) by the probability that you're starting from the second bit of the code (multiply by 0.5?).
    Last edited by nburns; 11th March 2014 at 08:18.

  12. #72
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It is theoretically possible for the decompresser to dynamically update the Huffman table after every character based on the same high-order model used by the compressor. But it is not used in practice because recomputing the Huffman table would take longer than arithmetic coding and would give worse compression.
    rANS with alias method is cheap for dynamical updating (and accurate).
    For m size alphabet you have m buckets - each of them contains at most two symbols: the corresponding one and additionally a part of probability of some different symbol ("alias").
    So simple updating of probability of m-th symbol can be done e.g. by just shifting the divider in m-th bucket - it is at cost of probability of the alias symbol, but it is usually much more probable symbol so its small changes have tiny impact.
    Shifting the divider needs updating starts of further appearances of symbol and alias, but it is cheap (a few values in a list) and we can try to update as far buckets as possible.
    Last edited by Jarek; 11th March 2014 at 04:58.

  13. #73
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by cbloom View Post
    I believe the correct problem to solve is to minimize the total size of encoded probabilities + entropy coded data.
    This is a good point.

    One way that an encoding with a header wastes space in comparison to an optimal adaptive encoding is that the header and the encoded stream tend to duplicate information. Given a multiset (the frequencies), the encoder should encode a multiset permutation instead of a general sequence. The entropy of the multiset permutation is the base 2 log of the corresponding multinomial coefficient. The difference is similar to encoding a general permutation of n as n lg(n)-bit numbers versus the optimal lg(n!)-bit rank.

    To approach optimal, the encoder would have to effectively be dynamic, anyway -- in spite of having a header - because it couldnt pretend the sequence is stationary and infinite, it would have to do the same amount of computation as the headerless encoder.

    I don't think there can actually be an optimal header in a general sense, because the header is just a part of the data that you send first. You could arbitrarily shuffle data between the header and sequence; the header is not a well defined thing. The only reason for segregating some data to send first is to use a simpler or faster algorithm. So it's an implementation detail and a choice of trade-off.
    Last edited by nburns; 23rd March 2014 at 02:16.

  14. #74
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    It might be worth measuring the cost of approximated tANS header frequencies with respect to total output size, to see if it helps or hurts (I.e. it might be better to send exact values). Using fewer bits saves space in the header -- that's clear. But, considering that the header is just part of the data moved to the front, and that any data that isn't in the header must be included in the stream, and vice versa -- you've just redirected that data to the stream, where it might be less efficiently coded, because it's spread over multiple instances of identical symbols.
    Last edited by nburns; 24th March 2014 at 09:21.

  15. #75
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    528
    Thanks
    204
    Thanked 187 Times in 128 Posts
    Quote Originally Posted by nburns View Post
    To approach optimal, the encoder would have to effectively be dynamic, anyway -- in spite of having a header - because it couldnt pretend the sequence is stationary and infinite, it would have to do the same amount of computation as the headerless encoder.
    It's solving a different problem though. I think we all agree that adaptive/dynamic encoding gives the smallest results, but despite claims to the contrary we still haven't seen any adaptive systems that run as fast as static ones. Many applications have a need for speed over size, so this discussion is about how to have a minimal size while not sacrificing the speed.

    The point about having to spend as long as a dynamic system in order to compute an optimal header is interesting. If true then it means we can have fast decoding, but not necessarily fast encoding (when optimal). This is probably Yak-shaving though. The difference between optimal and "good enough" approximations will be very small.

  16. #76
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by JamesB View Post
    It's solving a different problem though. I think we all agree that adaptive/dynamic encoding gives the smallest results, but despite claims to the contrary we still haven't seen any adaptive systems that run as fast as static ones. Many applications have a need for speed over size, so this discussion is about how to have a minimal size while not sacrificing the speed.

    The point about having to spend as long as a dynamic system in order to compute an optimal header is interesting. If true then it means we can have fast decoding, but not necessarily fast encoding (when optimal). This is probably Yak-shaving though. The difference between optimal and "good enough" approximations will be very small.
    I didn't mean it would spend all the time on the header. It would have to effectively change the probabilities on every symbol, because it would have to compute the (approximate) multinomial coefficient. It would be like doing the adaptive version backwards.

    Edit:
    To be somewhat less vague, it would compute a rank based on multinomial coefficients, which is a sum of multinomial coefficients. If you stitched together the sum you'd have to compute if you wrote the adaptive version for ANS (like fpaqc, but symbol-wise, I guess), I think you would find that you can roll up all the step-wise products into factorials, and you'd get a sum of multinomial coefficients. You'd need some equivalent of this to be optimal, and you could probably get there in a header version, but it wouldn't make much sense.

    Edit2:
    The paper I linked to above (http://www.1stworks.com/ref/cleary84enum.pdf) goes through all the details. I probably glossed over something.

    Click image for larger version. 

Name:	cleary84enum_quote2.gif 
Views:	133 
Size:	39.1 KB 
ID:	2793

    So parameterized enumeration uses a header, and it's at least as efficient as adaptive coding, but the computation cost is at least as great, also.

    Adaptive coding gives the same formula as enumerative coding, so long as you use faithful probabilities in the adaptive code, which are the ones you'd normally use, I think.

    This quote deals with the binary case, where the enumerative code becomes a sum of binomial coefficients.

    Click image for larger version. 

Name:	cleary84enum_quote.gif 
Views:	128 
Size:	30.5 KB 
ID:	2792
    Last edited by nburns; 25th March 2014 at 04:20.

Page 3 of 3 FirstFirst 123

Similar Threads

  1. Probability estimation
    By Shelwien in forum Data Compression
    Replies: 23
    Last Post: 13th April 2012, 21:49
  2. Optimal ordered bitcodes (or smth like this)
    By Piotr Tarsa in forum Data Compression
    Replies: 29
    Last Post: 26th July 2011, 14:18
  3. Context quantization and CM asymmetry
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 15th September 2010, 13:13
  4. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 14:07
  5. Data Distribution Questions.
    By Tribune in forum Data Compression
    Replies: 13
    Last Post: 25th June 2008, 19:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •