# Thread: Context Mixing

1. ## Context Mixing

This is a question on your point of view on the better way to "mix" correlation information.

I was considering the idea of evaluating probability for a token according to different context information. I guess this is nothing new for many of you.

I will take an example to make it clearer.
Let's say i'm trying to guess the 6th bit of position P, that we'll call P6.
Let's imagine i have 2 correlation information, which are completely different :
1) According to the first correlation, which is based on the previous Byte value, (P6==1) = 80%
2) According to a second correlation, based on bit position within the Byte, (P6==1) = 65%

Now, let's say we've got both these estimations, from 2 different correlation rules.
How should we mix them ? Is that even possible ?

I don't expect that blindly "taking the average" would prove any useful. Like mixing too many colors always end up with a brown one.

My first impression was "there must be a mathematical formula for this", potentially using more information not present in this example, such as a "reliability counter", or whatever.
Then, i simply wondered : is it even necessary to merge anything ? Could the best (most squeezed) probability simply be the better one to select ?

Note : i'm deliberately not considering to "combine" both rules, creating many smaller and more specialized buckets. This would obviously work well in this example, since it is very small, but obviously the more combinations, the larger the number of buckets, eventually reaching memory limit. Even more importantly, the bucket sample would eventually become too light.  Reply With Quote

2. > Now, let's say we've got both these estimations, from 2 different correlation rules.
> How should we mix them ? Is that even possible ?

Yes, there're two common mixing methods now - linear and logistic.
Both are based on probability theory, but make different assumptions about data,
ie linear mixing implies that old data doesn't matter, and logistic mixing
takes into account the context history (and thus usually shows better results,
but is harder to compute).

> I don't expect that blindly "taking the average" would prove any useful.
> Like mixing too many colors always end up with a brown one.

Surprisingly, it would... it's what linear mixing is, anyway.
But its not just random average, but a probability estimation looking like that.
I mean, suppose you have text mixed with numbers, but don't really
know how to predict what appears next. And you have models for text
and numbers, which can give you bit value predictions at any given point,
but the text model is better for text etc.
So we have N bits of data, and N1 of these are better compressed using
p1 (=prediction of text model) and N2=N-N1 with p2.
Sure you can encode the model selection, but it would be redundant
(ie you'd have to store information unrelated to actual data).
Still, how we would encode the model selection?
Obviously, using the N1,N2 statistics, eg with a probability of w=(N1+1)/(N+2)
that it would be the first model.
But then the result would be like this:
model1, p(1)=w: p(0)=p1, p(1)=1-p1
model2, p(1)=1-w: p(0)=p2, p(1)=1-p2
or, if we look at codelengths:
model1: 0:log2(w*p1), 1:log2(w*(1-p1))
model2: 0:log2((1-w)*p2), 1:log2((1-w)*(1-p2))
but then, we don't really care which model it is, if only we can decode the bit.
so we can discard the model selection information and merge the corresponding intervals:
0: log2(w*p1+(1-w)*p2), 1: log2(w*(1-p1)+(1-w)*(1-p2))
And this is all there is to linear mixing basically... there're also various methods
of weight update (ie estimation of probability of each model being optimal for next bit),
but its fairly easy to invent 10 new ones - there's no perfect one, because there're
no perfect models in general.

Still, even static linear mixing can be unexpectedly helpful.

> My first impression was "there must be a mathematical formula for
> this", potentially using more information not present in this
> example, such as a "reliability counter", or whatever.

Sure, there is.

> Then, i simply wondered : is it even necessary to merge anything ?
> Could the best (most squeezed) probability simply be the better one
> to select ?

Yes - if you can predict precisely enough which is the best prediction.
Otherwise it would be worse than mixing, and coding your choice would
be even worse.

> Note : i'm deliberately not considering to "combine" both rules,
> creating many smaller and more specialized buckets.

There's actually another approach (aside from mixing) called
"secondary estimation" or "probabily mapping".
You can just collect bit statistics for your pairs of {p1;p2}
and use them for actual coding.  Reply With Quote

3. Originally Posted by Shelwien ie linear mixing implies that old data doesn't matter, and logistic mixing
takes into account the context history
logistic mixing just transforms probabilities from [0,1] to [-inf,+inf]  Reply With Quote

4. Originally Posted by Cyan My first impression was "there must be a mathematical formula for this", potentially using more information not present in this example, such as a "reliability counter", or whatever.
You don't need a reliability counter. Think about which of the two statistical distributions you would trust more, the more skewed one, or the more balanced? Remember how the statistics are build, the counter very slowly (well depends on your counter-update scheme) build up skew if any, present skew means there has been a _lot_ bias occured. You should trust a _lot_ bias (like if you know you have a biased coin, you're going to win with the balanced or the biased coin?). So much about a non-math argument how to do it. Actually the argument stems from min-max theory. Charles did the first PPM-SSE based on that argument with log-scaling over three probabilities (like your two) back then.  Reply With Quote

5. > logistic mixing just transforms probabilities from [0,1] to [-inf,+inf]

Actual implementations map [0..SCALE] to [-SCALE/2..SCALE/2].
Also its really not the point at all.

> You don't need a reliability counter.

Its a matter of terminology.
We can use something like this:
Code:
```// probability estimations of current bit value
q1 = bit ? p1 : SCALE-p1;
q2 = bit ? p2 : SCALE-p2;
if( q1<q2 ) w1++; else w2++;
w = w1/(w1+w2);```
And it would be a valid update function for both linear and logistic mixer,
ie it would improve compression over static mixing.
Sure its not the best one, but instead you can call w1,w2 "reliability counters" > Think about which of the two statistical distributions you would
> trust more, the more skewed one, or the more balanced? Remember how
> the statistics are build, the counter very slowly (well depends on
> your counter-update scheme) build up skew if any, present skew means
> there has been a _lot_ bias occured.

That's only right with very skewed symbol distributions, like in
high-order contexts, and counters initialized with 0.5.
But it doesn't apply to occurrence counters (ie p=n0/(n0+n1)),
and to schemes like unary coding, and to data types different
from high order context histories.

Also always predicting bit=0 if you have p1=1 and p2=SCALE/2
is not optimal in either case (I mean using the most skewed
estimation, like you suggested).

> You should trust a _lot_ bias (like if you know you have a biased
> coin, you're going to win with the balanced or the biased coin?).

You're right, but it doesn't apply to compression at all.
Here you have to consider codelength instead of plain number of guesses.

> Charles did the first PPM-SSE based on that argument with
> log-scaling over three probabilities (like your two) back then.

Logistic mixing is unrelated to log scaling.
Also it was only SEE, SSE was invented by another person.  Reply With Quote

and the very detailed explanations.

At this point, it seems the better way to improve understanding is to test with real examples.
Quite some work ahead...  Reply With Quote

7. 1. http://nishi.dreamhosters.com/u/o01_v1.rar
\mixtest_v2 -- linear mix (both static and adaptive)
\mixtest_v3 -- logistic mix
\o01_src -- interpolated SSE2 for mixing

2. http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar
Its an entropy coder from a real compressor (bsc 2.4.0),
with speed optimizations removed for readability.
In particular, see bsc_mixer.inc - it implements a base submodel, which is applied for all kinds of contexts.
Logistic mixing, interpolated SSE and static mixing are used there.

3. Also these can be used for your experiments - they're based
on a really simple implementation:

bitwise o0 rangecoder demo / rc version comparison
http://nishi.dreamhosters.com/u/rc_v0.rar

static linear mix of o0 and o1 linear counters
http://nishi.dreamhosters.com/u/rc_v3.rar

linear mixing demo with BFA0-3 updates:
http://nishi.dreamhosters.com/u/rc_v4.rar  Reply With Quote

8. Originally Posted by Shelwien Also it was only SEE, SSE was invented by another person.
Yes you're right, the acronyms are sometimes confusing.  Reply With Quote

9. Actually there is no basis in probability theory for combining predictions. If you have two models predict that the next bit will be 1 with, say, p1 = .65 and p2 = .8 in separate contexts, it would not be hard to construct scenarios where the next bit is always 0 when both contexts are present.

The basis for combining predictions is algorithmic information theory. It is the answer to the question "what is the simplest (shortest) explanation for one model predicting .65 and the other predicting .8?" and use that explanation to get p. Or more precisely, consider all possible explanations M and combine by weighted averaging with weights 2^-|M|, where |M| is the length of the explanation in bits. (This is similar to using just the shortest explanation because its weight dominates the others.) However this is not computable in general, and is highly dependent on the choice of language used to write M.

But we can say something. In most languages, expressing probabilities near 0 or 1 takes more bits than probabilities near 1/2. In fact, it takes -log(p) bits when p is near 0 or -log(1-p) bits when p is near 1. Let |p| denote the number of bits needed to write p. Then when you combine probabilities, you have |p| = (|p1| + |p2|)/2, which tends to favor probabilities near 0 or 1. For example if p1 = .5, p2 = .9999999999, then p = .99999 would have the right number of bits.

Both logistic mixing (PAQ8, ZPAQ) and linear mixing (PAQ6) have this form, although logistic mixing is more precise.

PAQ6 mixing is linear in bit counts, not probabilities. Although you will get compression improvement using

p = (p1 + p2)/2 = .725

That is not really what PAQ6 does. It represents probabilities as ratios of 0's to 1's. For example, p1 = .65 might be a ratio like (2 : 4) and p2 = .8 might be (2 : . Then you would add the counts to get a ratio like ((2+2) : (4+ ) = (4 : 12) = .75. Probabilities near 0 or 1 have one large count and one small count, so they get greater weight than probabilities near 1/2 which have two small counts. The state table in PAQ6 maintains the counts this way by discarding half of the excess count over 2 of the opposite bit. For example, a 0 bit in state (1 : 10) would go to (2 : 6) instead of (2 : 10).

Discarding counts also makes the model adaptive, which may be good or bad depending on the data. The problem with PAQ6 is that the state table has to compromise between the right amount of adaptation (keeping counts small or large) and mixing (should (2 : get twice the weight of (1 : 4). Logistic mixing separates the two so both can be tuned separately. The formula is:

p = squash((stretch(p1) + stretch(p2)) / 2)

where

stretch(x) = ln(x/(1-x))

squash(x) = 1/(1 + e^-x)) (inverse of stretch).

For example

stretch(p1) = ln(.65/.35) = .619
stretch(p2) = ln(.8/.2) = 1.386
(.619 + 1.386) / 2 = 1.0025
squash(1.0025) = .732

Again this is closer to .8 than .65, but not as much as .75. But this is the correct way to do it. PAQ6 weights the larger probability too much. For example, it would combine .5 with .9999999999 to get .9999999998 rather than .99999.  Reply With Quote

10. Thanks very much for detailed explanation.
It is very pleasant reading.

All in all, it seems we always try to lean more towards the squeezed probability, albeit in different manner depending on selected mixing formula.
I very much like the bit-weight average formula.  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•