Results 1 to 10 of 10

Thread: Context Mixing

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts

    Context Mixing

    This is a question on your point of view on the better way to "mix" correlation information.

    I was considering the idea of evaluating probability for a token according to different context information. I guess this is nothing new for many of you.

    I will take an example to make it clearer.
    Let's say i'm trying to guess the 6th bit of position P, that we'll call P6.
    Let's imagine i have 2 correlation information, which are completely different :
    1) According to the first correlation, which is based on the previous Byte value, (P6==1) = 80%
    2) According to a second correlation, based on bit position within the Byte, (P6==1) = 65%

    Now, let's say we've got both these estimations, from 2 different correlation rules.
    How should we mix them ? Is that even possible ?

    I don't expect that blindly "taking the average" would prove any useful. Like mixing too many colors always end up with a brown one.

    My first impression was "there must be a mathematical formula for this", potentially using more information not present in this example, such as a "reliability counter", or whatever.
    Then, i simply wondered : is it even necessary to merge anything ? Could the best (most squeezed) probability simply be the better one to select ?

    Note : i'm deliberately not considering to "combine" both rules, creating many smaller and more specialized buckets. This would obviously work well in this example, since it is very small, but obviously the more combinations, the larger the number of buckets, eventually reaching memory limit. Even more importantly, the bucket sample would eventually become too light.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Now, let's say we've got both these estimations, from 2 different correlation rules.
    > How should we mix them ? Is that even possible ?

    Yes, there're two common mixing methods now - linear and logistic.
    Both are based on probability theory, but make different assumptions about data,
    ie linear mixing implies that old data doesn't matter, and logistic mixing
    takes into account the context history (and thus usually shows better results,
    but is harder to compute).

    > I don't expect that blindly "taking the average" would prove any useful.
    > Like mixing too many colors always end up with a brown one.

    Surprisingly, it would... it's what linear mixing is, anyway.
    But its not just random average, but a probability estimation looking like that.
    I mean, suppose you have text mixed with numbers, but don't really
    know how to predict what appears next. And you have models for text
    and numbers, which can give you bit value predictions at any given point,
    but the text model is better for text etc.
    So we have N bits of data, and N1 of these are better compressed using
    p1 (=prediction of text model) and N2=N-N1 with p2.
    Sure you can encode the model selection, but it would be redundant
    (ie you'd have to store information unrelated to actual data).
    Still, how we would encode the model selection?
    Obviously, using the N1,N2 statistics, eg with a probability of w=(N1+1)/(N+2)
    that it would be the first model.
    But then the result would be like this:
    model1, p(1)=w: p(0)=p1, p(1)=1-p1
    model2, p(1)=1-w: p(0)=p2, p(1)=1-p2
    or, if we look at codelengths:
    model1: 0:log2(w*p1), 1:log2(w*(1-p1))
    model2: 0:log2((1-w)*p2), 1:log2((1-w)*(1-p2))
    but then, we don't really care which model it is, if only we can decode the bit.
    so we can discard the model selection information and merge the corresponding intervals:
    0: log2(w*p1+(1-w)*p2), 1: log2(w*(1-p1)+(1-w)*(1-p2))
    And this is all there is to linear mixing basically... there're also various methods
    of weight update (ie estimation of probability of each model being optimal for next bit),
    but its fairly easy to invent 10 new ones - there's no perfect one, because there're
    no perfect models in general.

    Still, even static linear mixing can be unexpectedly helpful.

    > My first impression was "there must be a mathematical formula for
    > this", potentially using more information not present in this
    > example, such as a "reliability counter", or whatever.

    Sure, there is.

    > Then, i simply wondered : is it even necessary to merge anything ?
    > Could the best (most squeezed) probability simply be the better one
    > to select ?

    Yes - if you can predict precisely enough which is the best prediction.
    Otherwise it would be worse than mixing, and coding your choice would
    be even worse.

    > Note : i'm deliberately not considering to "combine" both rules,
    > creating many smaller and more specialized buckets.

    There's actually another approach (aside from mixing) called
    "secondary estimation" or "probabily mapping".
    You can just collect bit statistics for your pairs of {p1;p2}
    and use them for actual coding.

  3. #3
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    287
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    ie linear mixing implies that old data doesn't matter, and logistic mixing
    takes into account the context history
    logistic mixing just transforms probabilities from [0,1] to [-inf,+inf]

  4. #4
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Cyan View Post
    My first impression was "there must be a mathematical formula for this", potentially using more information not present in this example, such as a "reliability counter", or whatever.
    You don't need a reliability counter. Think about which of the two statistical distributions you would trust more, the more skewed one, or the more balanced? Remember how the statistics are build, the counter very slowly (well depends on your counter-update scheme) build up skew if any, present skew means there has been a _lot_ bias occured. You should trust a _lot_ bias (like if you know you have a biased coin, you're going to win with the balanced or the biased coin?). So much about a non-math argument how to do it.
    Actually the argument stems from min-max theory. Charles did the first PPM-SSE based on that argument with log-scaling over three probabilities (like your two) back then.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > logistic mixing just transforms probabilities from [0,1] to [-inf,+inf]

    Actual implementations map [0..SCALE] to [-SCALE/2..SCALE/2].
    Also its really not the point at all.

    > You don't need a reliability counter.

    Its a matter of terminology.
    We can use something like this:
    Code:
    // probability estimations of current bit value
    q1 = bit ? p1 : SCALE-p1;
    q2 = bit ? p2 : SCALE-p2;
    if( q1<q2 ) w1++; else w2++;
    w = w1/(w1+w2);
    And it would be a valid update function for both linear and logistic mixer,
    ie it would improve compression over static mixing.
    Sure its not the best one, but instead you can call w1,w2 "reliability counters"

    > Think about which of the two statistical distributions you would
    > trust more, the more skewed one, or the more balanced? Remember how
    > the statistics are build, the counter very slowly (well depends on
    > your counter-update scheme) build up skew if any, present skew means
    > there has been a _lot_ bias occured.

    That's only right with very skewed symbol distributions, like in
    high-order contexts, and counters initialized with 0.5.
    But it doesn't apply to occurrence counters (ie p=n0/(n0+n1)),
    and to schemes like unary coding, and to data types different
    from high order context histories.

    Also always predicting bit=0 if you have p1=1 and p2=SCALE/2
    is not optimal in either case (I mean using the most skewed
    estimation, like you suggested).

    > You should trust a _lot_ bias (like if you know you have a biased
    > coin, you're going to win with the balanced or the biased coin?).

    You're right, but it doesn't apply to compression at all.
    Here you have to consider codelength instead of plain number of guesses.

    > Charles did the first PPM-SSE based on that argument with
    > log-scaling over three probabilities (like your two) back then.

    Logistic mixing is unrelated to log scaling.
    Also it was only SEE, SSE was invented by another person.

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks for your comments,
    and the very detailed explanations.

    At this point, it seems the better way to improve understanding is to test with real examples.
    Quite some work ahead...

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    1. http://nishi.dreamhosters.com/u/o01_v1.rar
    \mixtest_v2 -- linear mix (both static and adaptive)
    \mixtest_v3 -- logistic mix
    \o01_src -- interpolated SSE2 for mixing

    2. http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar
    Its an entropy coder from a real compressor (bsc 2.4.0),
    with speed optimizations removed for readability.
    In particular, see bsc_mixer.inc - it implements a base submodel, which is applied for all kinds of contexts.
    Logistic mixing, interpolated SSE and static mixing are used there.

    3. Also these can be used for your experiments - they're based
    on a really simple implementation:

    bitwise o0 rangecoder demo / rc version comparison
    http://nishi.dreamhosters.com/u/rc_v0.rar

    static linear mix of o0 and o1 linear counters
    http://nishi.dreamhosters.com/u/rc_v3.rar

    linear mixing demo with BFA0-3 updates:
    http://nishi.dreamhosters.com/u/rc_v4.rar
    (see http://encode.su/threads/1159-Updati...mixer-with-BFA)

  8. #8
    Member
    Join Date
    Apr 2010
    Location
    El Salvador
    Posts
    43
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Shelwien View Post
    Also it was only SEE, SSE was invented by another person.
    Yes you're right, the acronyms are sometimes confusing.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Actually there is no basis in probability theory for combining predictions. If you have two models predict that the next bit will be 1 with, say, p1 = .65 and p2 = .8 in separate contexts, it would not be hard to construct scenarios where the next bit is always 0 when both contexts are present.

    The basis for combining predictions is algorithmic information theory. It is the answer to the question "what is the simplest (shortest) explanation for one model predicting .65 and the other predicting .8?" and use that explanation to get p. Or more precisely, consider all possible explanations M and combine by weighted averaging with weights 2^-|M|, where |M| is the length of the explanation in bits. (This is similar to using just the shortest explanation because its weight dominates the others.) However this is not computable in general, and is highly dependent on the choice of language used to write M.

    But we can say something. In most languages, expressing probabilities near 0 or 1 takes more bits than probabilities near 1/2. In fact, it takes -log(p) bits when p is near 0 or -log(1-p) bits when p is near 1. Let |p| denote the number of bits needed to write p. Then when you combine probabilities, you have |p| = (|p1| + |p2|)/2, which tends to favor probabilities near 0 or 1. For example if p1 = .5, p2 = .9999999999, then p = .99999 would have the right number of bits.

    Both logistic mixing (PAQ8, ZPAQ) and linear mixing (PAQ6) have this form, although logistic mixing is more precise.

    PAQ6 mixing is linear in bit counts, not probabilities. Although you will get compression improvement using

    p = (p1 + p2)/2 = .725

    That is not really what PAQ6 does. It represents probabilities as ratios of 0's to 1's. For example, p1 = .65 might be a ratio like (2 : 4) and p2 = .8 might be (2 : . Then you would add the counts to get a ratio like ((2+2) : (4+) = (4 : 12) = .75. Probabilities near 0 or 1 have one large count and one small count, so they get greater weight than probabilities near 1/2 which have two small counts. The state table in PAQ6 maintains the counts this way by discarding half of the excess count over 2 of the opposite bit. For example, a 0 bit in state (1 : 10) would go to (2 : 6) instead of (2 : 10).

    Discarding counts also makes the model adaptive, which may be good or bad depending on the data. The problem with PAQ6 is that the state table has to compromise between the right amount of adaptation (keeping counts small or large) and mixing (should (2 : get twice the weight of (1 : 4). Logistic mixing separates the two so both can be tuned separately. The formula is:

    p = squash((stretch(p1) + stretch(p2)) / 2)

    where

    stretch(x) = ln(x/(1-x))

    squash(x) = 1/(1 + e^-x)) (inverse of stretch).

    For example

    stretch(p1) = ln(.65/.35) = .619
    stretch(p2) = ln(.8/.2) = 1.386
    (.619 + 1.386) / 2 = 1.0025
    squash(1.0025) = .732

    Again this is closer to .8 than .65, but not as much as .75. But this is the correct way to do it. PAQ6 weights the larger probability too much. For example, it would combine .5 with .9999999999 to get .9999999998 rather than .99999.

  10. #10
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks very much for detailed explanation.
    It is very pleasant reading.

    All in all, it seems we always try to lean more towards the squeezed probability, albeit in different manner depending on selected mixing formula.
    I very much like the bit-weight average formula.

Similar Threads

  1. Logistic mixing
    By Sebastian in forum Data Compression
    Replies: 121
    Last Post: 28th January 2011, 12:29
  2. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  3. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  4. Mixing strategies
    By Alibert in forum Data Compression
    Replies: 38
    Last Post: 25th June 2009, 00:37
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •