Page 2 of 3 FirstFirst 123 LastLast
Results 31 to 60 of 76

Thread: Reducing the header - optimal quantization/compression of probability distribution?

  1. #31
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Quote Originally Posted by nburns View Post
    The question is how much of the random noise can you leave out and still get a good fit?
    This is a fundamental question also for any kind of adaptive models - like if it's better to rely on statistical properties e.g. of previous data to predict future one, or maybe, as among others there will be additional noise involved, it would be better to encode counts (or just deviations) in the header ...
    Good observation that Huffman transmits frequencies as lengths, which are logarithms. That's probably the right function to use. Huffman only suffers because the range of logarithms is too narrow.
    Huffman has different behavior than recurrence I gave - it means that different quantization ranges have different maximal impact on deltaH ((p_s-q_s)^2/p_s) - it is not an optimal choice.
    To see it, imagine you have to split [0,1] range into [0,a] and [a,1] subranges and place (x-x_i)^2 parabola in each of them, such that maximal or expected value is as small as possible - the best choice is obviously a=1/2, x_i = 1/4, 3/4 to give both parabolas the same impact.
    Intervals don't have the same impact in Huffman choice of quantization - while denominator in (p_s-q_s)^2/p_s doubles, numerator increases ~4 times. So the largest quantization intervals have the largest impact on deltaH. You can improve it by shifting boundaries of theses ranges.
    Last edited by Jarek; 27th February 2014 at 19:20.

  2. #32
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by Jarek View Post
    Thomas, I have graduated also theoretical mathematics - I understand your point, and gave you simpler equivalent formulation: e.g. find the minimal description (rate) for given maximal or expected KL distance.
    Look, I'm both a physicist and and a mathematician, so I would believe I know both sides. But before I can solve a problem, I need to understand it and get some feeling for it, that is, I need both a good model, and some analysis as where to go. This is currently not the case. I'm not saying that I will potentially use the mathematical solution. But at least, it will feed intuition and will allow me to develop a suitable algorithm.

    Quote Originally Posted by Jarek View Post
    I have returned to Yuriy's paper - as it looks as he wants to solve our problem.
    No, he does not. Look at eqn. (3) in his paper. He is trying to find a solution for the rate-distortion problem where distortion is given as l^\infty distance. This is not the case here. As far as I understand you, you would like to solve the problem under a distortion given by K-B distance, which is a different type of "metric". l^\infty is not behaiving very well, which makes this problem a bit tricky to discuss (it is not additive in symbols, to give one example), but K-B is, which makes the whole story partially easier (or maybe, I did not check).

    Quote Originally Posted by Jarek View Post
    It is one of the problems with abstract theoretical models - they usually simplify aspects which turn out essential ...üp
    You may call that an unsuitable simplification. Maybe K-B is also an unsuitable simplification as well, see my comments upfront. You typically want to minimize the overall rate, and *not* the K-B distance. In that sense, any type of overhead due to the header is insignificant, and thus the problem does not appear... Ok, if that's not the answer you want to hear, because your problem is more constraint, we need a good formulation for the problem. It is not Yurij's problem (minimize l^\infty), maybe it is minimizing K-B distance plus a weight (call it Lagrangian multiplier) times the header overhead. This allows you to optimize the rate in terms of the block size. If this is the right model, let's see where the analysis leads to. The solution will likely be hard to optimize for - then starts the phase of simplification, approximation and so on. If you allow that, I already have a solution for you: Adaptive codiing.

    If adaptive coding is not a solution, I would want to know why it is not - from a practical perspective, the problem goes away, and there is no need to discuss the problem in first place. So *what is* the problem you want to solve if not an abstract mathematical problem? Practical solutions I already presented, but it seems you don't like them. If not, why not?

    Quote Originally Posted by Jarek View Post
    Ok, let's try to formalize it, but I don't see how it helps(?)
    Like in Yuri's paper, we need simplex of probabilities:
    W = {(w_1,..,w_m)>=0: \sum_i w_i =1 }
    for which we need to find a discrete subset: quantizers Q. It defines quantization function as the closest quantizer for given probability distribution: q(p) \in Q minimizes D(p||q). In 1D, values from interval go to its center. Generally we have Voronoi diagram.
    Yes. But that's for the wrong metric.

    Quote Originally Posted by Jarek View Post
    We have also a probability distribution among frequencies (rho(p)) - a density function on the simplex W. It induces a probability distribution on Q - its entropy is our rate (not just lg(|Q|) like in Yuriy's paper): Pr(q)=\int_{p: q(p)=q}, R=-sum Pr(q) lg(Pr(q)).
    Now the problem is e.g.: for given maximal (max_p D(p||q(p))) or expected (\int_p rho(p) D(p||q(p))) KL distance, we need to find quantization (Q) having the lowest rate. Or opposite: for given rate, find quantizer with minimal KL distance.
    Exactly. Now insert the formuli for probabilities and K-B, and then what do we get? Certainly not a simple easy formula as for the fixed rate Lloyd-Max quantization (centroids). I don't have a result here because I did not go through the computation (other things to do), but maybe the result is enlighting. Without having done that, how can I say that it is worth nothing?

  3. #33
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I value purely mathematical abstract description, but studying physics has taught me to have some distance to it - that solving real life problems is rather art of finding good approximations, or that formalization sometimes kills the essence of the problem, sometimes it's more productive to remain longer in the intuition world. And so I don't like working on purely abstract information theory description without imagining concrete constructions behind.
    The risk in over-relying on theory is that you may end up solving the wrong problem, rendering the analysis irrelevant.

    Formal analysis rests on assumptions, and the results are only as good as the assumptions.

    I've found that it's very useful to play around with real data. At the beginning of my interest in compression, I spent many hours flipping through dumps of text data under different transforms and trying out various ideas. It wasn't a very useful way to invent new algorithms, but it helped me build a mental model of what text data is like so I can guess what something will do without having to try it. You can only invent new things by working through a theory, but experimentation is an extremely useful adjunct to this, because it's often very fast to do an experiment that tells you if you're on the right track. Then, if it fails, you can move on to trying to figure out why you were wrong and update the theory, rather than being stuck on something that won't work.
    Last edited by nburns; 28th February 2014 at 05:46.

  4. #34
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by nburns View Post
    The risk in over-relying on theory is that you may end up solving the wrong problem, rendering the analysis irrelevant.

    Formal analysis rests on assumptions, and the results are only as good as the assumptions.
    I don't disagree, but the problem is here that I don't quite understand the problem. There are practical solutions (do not use a header in first place and be adaptive) and there are theoretical solutions (study the R-D problem for K-B distance). Seems none of them seems to be applicable. What I need to understand why that is. It does not make much sense to play with an academic problem with non-adademic means, or to refuse a practical solution for a practical problem.

  5. #35
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    Thomas, Yuriy's paper gives and analyses natural basic method for this practical problem: quantize probabilities as k_i/n, see them as division of [0,1] interval and use entropy coder like enumerative coding to encode dividers between them.
    However, it neglects two opportunities for improvements:
    1) large probabilities can be represented with worse accuracy as D(p||q) ~ sum_i (p_i-q_i)^2/q_i
    2) it assumes uniform distribution on Q: R = lg(|Q|)
    The second one needs some knowledge of what probability distribution to expect - we could use a Zipf of power law here, maybe with a transmitted parameter.
    The first one says that we can save some bits while storing large probabilities.

    I think the main question of this thread is how to effectively exploit these two opportunities.
    They are extremely difficult for theoretical analysis (the balls we cover the simplex are getting nasty), so the only reasonable approach might be guessing some algorithms and testing their performance - this kind of approach seems ugly for theoretical mathematician, but is necessary in most of real life applications, like in physics or data compression.
    To summarize, for now I see two general approaches:
    - encode dividers, but adaptively - reducing precision for large probabilities, like in my first two posts in this thread. This approach has built in power law distribution of probabilities (it could be modified ...),
    - encode probabilities with quantization by using growing family of intervals: such that their impact on deltaH remains the same ((p_s-q_s)^2/q_s). Finally using entropy coder for such quantizer we can assume some density among probability distributions. Unfortunately such quantized probabilities does not have to sum to 1 - we need some additional normalization in this approach.

    ps. Figure 3 in Yuriy paper shows that Huffman is really terrible in quantizing probabilities - we could use much smaller headers for the same distance using a basic quantization for accurate entropy coder. Much better is Gilbert-Moore choice of prefix codes (almost touching the curve for accurate quantization) - in which sequences maintain order of symbols, making it cheaper to encode.

  6. #36
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I didn't follow along with all of the analysis in that paper, but I think for the most part it doesn't help to solve this problem. For one thing, they seem to be assuming that the set of distributions Q is fixed and it's treated as an input parameter. Given that, the problem they seem to be addressing is how to find the closest element of Q for a given distribution p that's known to infinite precision. (Jarek -- it assumes a uniform distribution on Q, yes, but I think that's not a problem in this analysis, because you are free to choose whatever subset Q you want from the set of distributions of m symbols. So you can choose a subset Q to guarantee a uniform distribution over Q -- i.e., I think they're deliberately ignoring that problem.) Under those assumptions, the optimal header size is trivially lg(|Q|) bits. The optimal header size is not derived but assumed.

    There could be some useful bits you can take from this paper, but I think the main thrust of the analysis is not relevant to this problem.

    Jarek --
    I think you may find that any scheme to give extra precision only to rare symbols ends up being equivalent to simply quantizing to higher precision, because there will be lots of rare symbols that need more precision and far too few common symbols with extra precision to make up the difference.
    Last edited by nburns; 1st March 2014 at 03:23.

  7. #37
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    nburns, the paper quantization algorithm is a bit different way of doing what Charles did here (I wonder which way is better?) - natural approach, the question is only how to handle the issue that trivial quantization is usually not normalized.

    The issues of choosing Q and then probability on Q are separate.
    First we would like to choose Q e.g. to minimize maximal distance - so basically it is a problem of covering probability simplex with balls (Q is the set of their centers) ... the problem is that for KL distance these balls become deformed - "swelling" toward large probabilities (the center of simplex).
    Second - having Q and some probability density in probability simplex (like Zipf or power law), we can find probability of using given q \in Q. Generally the larger probability, the less frequent - this probability distribution on Q should not be uniform. So we could use an entropy coder (e.g. prefix codes) to get with rate from lg(|Q|) down to -sum_q Pr(q) lg(Pr(q)).

    And the problem here is not to give a better precision for rare symbols, what is limited e.g. by ~ 1/L in ANS, but rather to save unnecessary bits for accuracy of more probable symbols.

  8. #38
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Jarek -- I didn't read the complex parts very closely, and they could well be using clever methods that give insight into other problems. That said, I think it's unlikely that the overall conclusions of the paper are applicable to this situation, because they defined their problem at a very high level of generality.
    Last edited by nburns; 1st March 2014 at 23:05. Reason: brevity

  9. #39
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    but the problem is here that I don't quite understand the problem. There are practical solutions (do not use a header in first place and be adaptive) and there are theoretical solutions (study the R-D problem for K-B distance). Seems none of them seems to be applicable. What I need to understand why that is. It does not make much sense to play with an academic problem with non-adademic means, or to refuse a practical solution for a practical problem.
    I don't think the problem is well enough defined at this point to actually start work on a solution. I think Jarek's interest in this is inspired by FSE and table ANS in general, but in FSE the header is designed to be fast, and so it is consciously sub-optimal in its arrangement to gain speed. I agree that adaptive is probably inherently better for compression than static, or, at least it has the potential to be. One simple reason is that it has more information available and many more degrees of freedom, since it can regenerate the model on each symbol. To prove that adaptive is strictly no weaker than static seems like an approachable problem. You'd only have to show that adaptive can match what static can achieve, and since both transmit identical amounts of total information in the end, you'd just be parceling out that same information in a different order and at different rates, but the total message probability, and thus bits, should come out the same.

    If I was trying to improve FSE, I'd want to do a lot of experimentation and try to learn as much as I could about how the system works before I chose something to fix. For instance, what is the optimal block size? Smaller blocks may improve compression by allowing it to adapt to changing local conditions. Rather than deal with so many unknowns, you might as well choose one specific use case, such as BWT data and MTF. There's a paper from a few years ago that deals with compressing the BWT in fixed blocks, which would help with choosing parameters and understanding the results. If you learned everything you could about that one use case, you could try to generalize to other use cases. I'd expect that to be more productive than working the other way around, from general to specific, and having more context would help guide you in figuring out what problems to work on...
    Last edited by nburns; 1st March 2014 at 23:13.

  10. #40
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    And the problem here is not to give a better precision for rare symbols, what is limited e.g. by ~ 1/L in ANS, but rather to save unnecessary bits for accuracy of more probable symbols.
    Ok, so you'd mainly like to trim wasted bits from the header. Although I thought you also wanted to give an extra bit to singletons to indicate 1/2 as the frequency. Whether you shortened common symbols or lengthened rare symbols, the result would look about the same, wouldn't it?

    I didn't realize it at first, but -- if I understand your proposal correctly -- you won't be able to find many wasted bits in common symbols, because there can't be many common symbols. There can be at most one symbol with p > 1/2, and you wouldn't have to represent its probability explicitly anyway, if you already know the frequencies of all the other symbols and the table size is fixed.

    In the header, most of the uncompressible entropy will go toward rare symbols, at least under a Zipf distribution, because there will be a lot of them.
    f choices that actually make sense. I tried various universal codes for RLE on suffix array data a while back, and Elias gamma always won. The reason was simply that it's the most compact on the first few codes, which dominate the result. Now I simply assume that the extremes will dominate the error, and are all that matter.

    Edit: deleted
    Last edited by nburns; 1st March 2014 at 23:18.

  11. #41
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Quote Originally Posted by thorfdbg View Post
    Despite that, any type of block coding has a drawback that it cannot decorrelate data accross blocks, and it can only adapt to non-stationary sources at block boundaries. It is, for practical applications, a very limited model.
    in rar2, huffman block headers was encoded using len-prevlen (code bit length in the current block MINUS code bit length in the previous block). afair, it improved compression ratio by ~10%

  12. Thanks:

    cade (1st March 2014)

  13. #42
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    819
    Thanks
    253
    Thanked 264 Times in 163 Posts
    nburns, encoding probability distribution is a general problem, but indeed we should have concrete applications in mind while designing algorithms, like for tANS or rANS (which with alias method can fit huge alphabet (up to ~ 2^15) with perfect accuracy into L1 cache).
    I have written ~1/L as minimal distinguished probability, but indeed for tANS I would distinguish e.g. singleton in the middle (q~1/L) from singleton at the end of the table (q~1/L/sqrt(2)) - and generally such quantization can have basic tuning in consideration (changing starting bias of symbols), what means getting out of f[s]/L quantization.
    Formulation is as I have written, or exactly as in Yuriy's paper - but for data compression it is KL distance, and we can use the fact that some quantizers are used more often, and some much rarer (like at most once for q>1/2 as you have written) - we can use entropy coder to compress the final quantizer(s).
    Regarding choosing the quantizer, it is simpler to think from the perspective of pessimistic case - we want to minimize C = max_p D(p||q(p)). It makes it exactly the problem of covering the simplex with balls: the union of balls with centers in Q and radius C should cover all probabilities (W \subset \bigcup_q B(q,C)) . We want to find Q such that r is the smallest. For KL distance these balls are deformed.
    Looking from a perspective of probability of a single symbol (it is useful approximation), these balls became ranges: like all probabilities in (r_i, r_{i+1}) are quantized as c_i = (r_i + r_{i+1})/2. The recurrence I have written for choosing succeeding "r"s was to make they have the same approximate radius:
    C = (r_i - c_i)^2 / c_i = (r_{i+1}-c_i)^2 / c_i is the same for all i

    Bulat, thanks - I thought it might be a good idea, but 10% improvement for Huffman is much more than I expected. For accurate entropy coder it should be even more (or not?).
    Differences should have simple probability distribution - usually some two-directional exponential distribution.
    ... and the question how to quantize/compress them well in such headers - these corrections improving adaptive compression - is perfect for this thread ... but it also further complicates the problem as accuracies of corrections should depend on values they correct - should be better for small probabilities.
    Part of this issue can be solved by encoding not difference, but e.g. log(p/previous p).

  14. #43
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    it is 10% of header. but i'm not sure - it was back in 1996. instead of 6..9 values, you are encoding mostly -1..+1

  15. #44
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    166
    Thanks
    39
    Thanked 54 Times in 30 Posts
    Thanks for information about rar. I use a simple similar idea for storing Huffman right now, maybe others might find it useful:

    1) Dense/sparse vector bits for zeroes. Example, 1011 with n = 3 means that the first 3 symbols have lengths specified, next 3 are skipped, next 3 are specified, next 3 are specified.

    2) Encode non-zero as new - old using only positive numbers (2 * abs(value) + sign(value)). There's a minimum bitlength (especially for the first block), such as 3 or 5, so that's the base for this step. I find the max number of bits to encode (bitlength - base), and use that number of bits to write.

    Eg, bitlengths = 5, 8, 8, 8, 10
    bitlengths - base = 0, 3, 3, 3, 5
    So 3 bits per bitlength.

    Probably a waste after switching to log2 bucket symbols + extra normal bits.

  16. #45
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    I believe the correct problem to solve is to minimize the total size of encoded probabilities + entropy coded data.

    From my practical studies I find that the ideal block size for things like Huffman or ANS is around 8k-32k symbols.

    The size of the encoded probabilities *is* very important. It lets you have more entropy sets and lets you reset statistics more often.

    With ANS you have a large number of degrees of freedom. One options is you can vary M (the sum of normalized counts) (M=L typically in tANS) and send the counts scaled to M precisely. Obviously on smaller blocks, a smaller M will be better. Furthermore, for any fixed M you could also send approximations of the counts (using log-ish scheme or other).

    One problem I have with any log-ish scheme is that the decoder then has to correct the sum to M, which adds further loss that would have to be accounted for. It's hard to imagine that the savings in header size would make up for those penalties except on very small blocks. It also adds time to the decoder, which is undesirable, we'd like to put all the slow work on the encode side.

    In the real world, exact analysis of this will be impossible because the scheme to encode the probabilities will be some complex heuristic that can't be measured except by running it.

    In an artificial setting, you could try to solve just the simple problem if picking the optimal M assuming that the counts are sent using log2(M) bits each. It seems to be that even that is impossible to solve. The function total_length(M) is not well behaved, it's not smooth and doesn't have simple minima, so you have to just try all M.
    Last edited by cbloom; 3rd August 2016 at 21:42.

  17. Thanks (3):

    Cyan (1st March 2014),encode (1st March 2014),Jarek (1st March 2014)

  18. #46
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Quote Originally Posted by thorfdbg View Post
    Adaptivitiy has no problems with speed. Rather the opposide: If you have to go over the data twice to collect information on the relative frequencies to built up a model fro the data, you will be slow. Adaptive methods are one-pass. JPEG LS is certainly not slow, to make one example, and its adaptive Golomb does not require a header. The MQ coder is certainly not slow, but adaptive. Adaptive Huffman is slow, and computing a Huffman table from measured frequencies is also needlessly slow.
    Your ideas about speed are very strange. In the real world, JPEG-LS is ridiculously slow. Two pass Huffman is extremely fast.
    Last edited by cbloom; 3rd August 2016 at 21:42.

  19. #47
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    Also a general note about the issue of sending counts -

    The best algorithms for sending Huffman codelens use domain-specific heuristics. That is, they are designed to compress typical codelens seen on real world data well. They use heuristic prior knowledge, like codelens tend to occur in blocks, etc. The optimal codelen packer for post-blocksort data would not be the same as the optimal codelen packer for literals, or LZ matches, etc.

    Any heuristic on the codelens like RLE or Elias-Gamma or delta from previous is implicitly based on a prior model of what kind of codelens you expect.

    In my RAD Huffman code I try four different codelen schemes and pick the best one.
    Last edited by cbloom; 3rd August 2016 at 21:42.

  20. #48
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by cbloom View Post
    Also a general note about the issue of sending counts -

    The best algorithms for sending Huffman codelens use domain-specific heuristics. That is, they are designed to compress typical codelens seen on real world data well. They use heuristic prior knowledge, like codelens tend to occur in blocks, etc. The optimal codelen packer for post-blocksort data would not be the same as the optimal codelen packer for literals, or LZ matches, etc.

    Any heuristic on the codelens like RLE or Elias-Gamma or delta from previous is implicitly based on a prior model of what kind of codelens you expect.

    In my RAD Huffman code I try four different codelen schemes and pick the best one.
    So there would potentially be a pre-header for the header itself? It makes sense: each header fills in details using the model from the last header, bootstrapping the information up from zero. The pre-header becomes so general that it might as well be hardcoded, though. Maybe three's a crowd, as with Elias gamma - delta - omega. Omega seems to have no practical application.

  21. #49
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    nburns, even deflate has "preheader" - it's tree of bit lengths in lit/dist trees

  22. #50
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Jarek View Post
    nburns, encoding probability distribution is a general problem, but indeed we should have concrete applications in mind while designing algorithms, like for tANS or rANS (which with alias method can fit huge alphabet (up to ~ 2^15) with perfect accuracy into L1 cache).
    I have written ~1/L as minimal distinguished probability, but indeed for tANS I would distinguish e.g. singleton in the middle (q~1/L) from singleton at the end of the table (q~1/L/sqrt(2)) - and generally such quantization can have basic tuning in consideration (changing starting bias of symbols), what means getting out of f[s]/L quantization.
    Formulation is as I have written, or exactly as in Yuriy's paper - but for data compression it is KL distance, and we can use the fact that some quantizers are used more often, and some much rarer (like at most once for q>1/2 as you have written) - we can use entropy coder to compress the final quantizer(s).
    The Zipf distribution is defined by the relative frequency of its quantizers, isn't it?

    I recently realized that the top state is about sqrt(2) smaller in probability on average than expected. You seem to have got there first. The effective probability or code length also depends on what larger states are accessible from a state, e.g., if only singletons appear after the halfway point, it becomes no different than the top state.

  23. #51
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    nburns, even deflate has "preheader" - it's tree of bit lengths in lit/dist trees
    It looks like in all cases each header layer encodes statistics about the logarithms of its successor's values (let's consider bit- and byte-lengths as logarithms). Only a few such logarithmic shifts can be useful, independent of context.
    Last edited by nburns; 2nd March 2014 at 08:56.

  24. #52
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by cbloom View Post
    Your ideas about speed are very strange. In the real world, JPEG-LS is ridiculously slow. Two pass Huffman is extremely fast.
    Your ideas about speed are very stange indeed. You compare a full image compression pipeline with an entropy coder. No wonder the result is pretty wierd.

    Besides, I don't think LS is slow. Implementations may be slow - you seem to judge from a software implementation, but that's completely the wrong start. LS was built for simple and fast hardware. Consider what you have to do for a two-pass Huffman in hardware: Store a full picture (lots of gates for RAM, or no online compression) go through all the data twice (breaks any type of cache locality) store the code, then do everything from the start again. Sorry, but this is a quite stupid design for any type of realistic implementation *except pure software*.

  25. #53
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by nburns View Post
    The Zipf distribution is defined by the relative frequency of its quantizers, isn't it?
    Zipf's law is a rather general statement about distributions of "natural" sources - long tailed probability distributions all the way. One way to arrive at it is to assume distributions that are invariant under scaling.

  26. #54
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by nburns View Post
    To prove that adaptive is strictly no weaker than static seems like an approachable problem.
    But this is the easy part. Any type of non-adaptive scheme breaks the source in one or multiple blocks. Breaking it up into one block is obviously non-ideal for non-stationary sources (sources that changes their statistics "slow enough"). Breaking up the source into blocks is non-ideal because it ignores the co-information between the blocks. In fact, it is an easy exercise to show that the ideal entropy rate of a block-coder is larger than the ideal rate of a context-encoder without blocking by exactly the amount of co-information between blocks.

    This being said, in practical applications it of course depends how well the adaption scheme works, which is again source dependent.

  27. #55
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    One should read

    Analysis of Arithmetic Coding for Data Compression (1992)

    http://www.ittc.ku.edu/~jsv/Papers/H...chOfficial.pdf


    This is one of the greatest papers on arithmetic compression ever written.

    It talks about the file length of an arithmetically compressed file with a header
    versus that of an adaptive arithmetically compressed file. Hint the header is a
    waste of time and space.

    When one looks at the real difference between Huffman and Arithmetic compression it boils
    down to the fact that Huffman can not do as well as Arithmetic since its only a small subset
    of arithmetic so its limited to how well it works.

    A long file and any permutation of it will with proper Arithmetic compress each permutation
    to two possible fixed lengths that differ at most by one byte.

    A long file and any permutation of it will with a proper Huffman will compress to many different
    lengths sometimes shorter and sometimes longer than what the equivalent Arithmetic
    compressor would compress to. However the Arithmetic compression average length will be
    shorter than the Huffman average taken over all permutations of the file

    The problem with headers is most want to use either a count or length of symbol in its header
    neither of these lead to optimal headers for Arithmetic or Huffman.

  28. #56
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by biject.bwts View Post
    When one looks at the real difference between Huffman and Arithmetic compression it boils
    down to the fact that Huffman can not do as well as Arithmetic since its only a small subset
    of arithmetic so its limited to how well it works.
    This is actually pretty much how I teach Huffman: Start with arithmetic, then check the easy special cases when you don't have the carry-over problem, then say that this is just Huffman.

  29. #57
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by thorfdbg View Post
    This is actually pretty much how I teach Huffman: Start with arithmetic, then check the easy special cases when you don't have the carry-over problem, then say that this is just Huffman.
    I never had the chance to teach compression in school. Got to teach baby math to community college students until I was replaced with teachers who had the proper race and social views for the coming socialists states of america where our future leaders will even know
    less about any proper use of logic or science. Like at the EPA today they are destroying America and have no clue about reality.

    But I have written static bijective Huffman compressors using the standard arithmetic high and low state thing where after each symbol the high and low return to the largest range possible. In other words the fixed weights were nice powers of 2. When you change any of
    the weights the code out is the static bijective Arithmetic coder. But when the static weights a nice power of 2 you could have used a
    simple bijective static Huffman compressor to get the exact same output file.

    Do you get to teach at a college that actually teaches math and science? If so you lucky because that is quickly becoming a thing of the past. Where I live they pretend to teach and the students pretend to learn as I was told by some other math teachers when I started teaching math at the college level. The country is in such bad shape that we had better stay friends with russia and china since there we be too few real engineers left in the country to even maintain the infrastructure we currently enjoy.

  30. #58
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by biject.bwts View Post
    I never had the chance to teach compression in school.
    This is a university level course (for master's, mostly) in image compression. There is a bit of source coding, but also image transforms, color spaces, rate-distortion theory...

    Quote Originally Posted by biject.bwts View Post
    Got to teach baby math to community college students until I was replaced with teachers who had the proper race and social views for the coming socialists states of america where our future leaders will even know
    less about any proper use of logic or science. Like at the EPA today they are destroying America and have no clue about reality.
    This is getting OT, but I am quite irritated on my travels that many Americans misunderstand what "socialism" is. I'm from Berlin, I had it next door. Or next wall, if you prefer. Social engagement and social responsibility, and a social system organized by the state, such as health care and free education is *not* socialism. It is a general responsibility taken by allmost all states of the world, excluding the US. A socially responsible system is what we have, and had. Socialism next door was a dictatorship under a different name that shot their own people when they tried to express their thoughts or tried to travel.

    That is *quite* a difference. Socialism, by the definition of the world, is a "market system controlled by the state". That also existed next door, but apparently, this did not work. Required a dictatorship. Completely different thing.

    Quote Originally Posted by biject.bwts View Post
    Do you get to teach at a college that actually teaches math and science? If so you lucky because that is quickly becoming a thing of the past.
    While I'm usually in Stuttgart, this is a course in Berlin. Yes, they offer math and science, too. It is not a college. As a mathematician, I'm of course sometimes bewildered of the low level of the general knowledge in math, but that is probably not so exceptional. After all, this is why the students come to the university: to learn.

    Quote Originally Posted by biject.bwts View Post
    Where I live they pretend to teach and the students pretend to learn as I was told by some other math teachers when I started teaching math at the college level. The country is in such bad shape that we had better stay friends with russia and china since there we be too few real engineers left in the country to even maintain the infrastructure we currently enjoy.
    Part of the problem here in Germany is that neither the state nor the industry is investing sufficient money into the university or school system. Universities are free, which I consider a good idea - accessibility to education for everyone - but somehow it needs to be funded. OT here, thus EOT.

  31. #59
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by thorfdbg View Post
    Breaking up the source into blocks is non-ideal because it ignores the co-information between the blocks. In fact, it is an easy exercise to show that the ideal entropy rate of a block-coder is larger than the ideal rate of a context-encoder without blocking by exactly the amount of co-information between blocks.
    I don't think that matters here, because there's no context between symbols.

    People are a little bit unkind to Huffman. It solves a different problem than AC, and it's the optimal solution for that problem. The problem it solves can be useful in other contexts.
    Last edited by nburns; 6th March 2014 at 03:59.

  32. #60
    Member
    Join Date
    Apr 2012
    Location
    Stuttgart
    Posts
    448
    Thanks
    1
    Thanked 101 Times in 61 Posts
    Quote Originally Posted by nburns View Post
    I don't think that matters here, because there's no context between symbols.
    Why? I mean, whether there is a context or not is entirely a matter of the model you use, and models *with* context are simply better than models without them. IOW, it is your choice to model a source as iid or higher order Markov.

    Quote Originally Posted by nburns View Post
    People are a little bit unkind to Huffman. It solves a different problem than AC, and it's the optimal solution for that problem. The problem it solves can be useful in other contexts.
    Does it? Sorry, I don't understand. The *only* thing you get "on top" is that you have a unique mapping between symbols and bit-sequences, something that does not hold for AC, but you pay a price for it.

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. Probability estimation
    By Shelwien in forum Data Compression
    Replies: 23
    Last Post: 13th April 2012, 21:49
  2. Optimal ordered bitcodes (or smth like this)
    By Piotr Tarsa in forum Data Compression
    Replies: 29
    Last Post: 26th July 2011, 14:18
  3. Context quantization and CM asymmetry
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 15th September 2010, 13:13
  4. optimal parsing
    By flounder in forum Data Compression
    Replies: 10
    Last Post: 3rd August 2008, 14:07
  5. Data Distribution Questions.
    By Tribune in forum Data Compression
    Replies: 13
    Last Post: 25th June 2008, 19:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •