1. ## Probability estimation

> how to update the probability that the next value will be a 0 or a 1 ?

Suppose you've encountered n0 zeroes and n1 ones,
then the entropy of the processed string is
l[p0_,n0_,n1_]:=n0*Log[2,p0]+n1*Log[2,1-p0]
and by solving l'(p)=0 we'd be able to find the p value
which minimizes the entropy.
http://www.wolframalpha.com/input/?i...%3D%3D0%2Cp%5D
which is p=n0/(n0+n1).
But its a posterior estimation.
(ie its the best approximation of static generator's parameter, given the known bits,
but its not necessarily optimal for prediction of the next bit.)

And for a prior estimation there's a useful method based on likelihoods:
Code:
```// the probability of a string with n0 zeroes and n1 ones:
c[n0_,n1_]:=1/Binomial[n0+n1,n0];

// the probability of the next string being the one with extra zero:
p=c[n0+1,n1]/(c[n0+1,n1]+c[n0,n1+1])```
http://www.wolframalpha.com/input/?i...%2Bn0%5D%29%5D

Same method actually can be also used with string probabilities based on entropy
with substituted posterior p=n0/(n0+n1), ie
c[n0_,n1_]:=1/2^l[n0/(n0+n1),n0,n1] =(n0/(n0+n1))^n0*(n1/(n0+n1))^n1
but entropy formula is an approximation (it assumes p=n0/(n0+n1), while actual information
is that there're n0 zeroes and n1 ones; these are not the same at all.),
so we'd end up with a pretty weird estimation:
p=1/(1+(n0^n0*(n1+1)^(n1+1)/(n0+1)^(n0+1)/n1^n1))

Although Shkarin was actually advertising it (as "MP-estimator") -
and its supposedly better than plain (n0+d)/(n0+n1+2*d) in practical use.
(see http://nishi.dreamhosters.com/u/mp2a.rar)

Actually this approach can be improved even further by taking into
account the probability distribution of p values.
For example,
Code:
```c[n0_,n1_]:=Integrate[ PDF[BetaDistribution[1,1],p]*(1-p)^n1*p^n0, {p,0,1},
Assumptions->{Element[p,Reals],Element[n0,Integers],Element[n1,Integers],n0>=0,n1>=0}]
p=FullSimplify[c[n0+1,n1]/(c[n0+1,n1]+c[n0,n1+1])]```
gives p=(n0+1)/(n0+n1+2)
and
Code:
```c[n0_,n1_]:=Integrate[ PDF[BetaDistribution[1/2,1/2],p]*(1-p)^n1*p^n0, {p,0,1},
Assumptions->{Element[p,Reals],Element[n0,Integers],Element[n1,Integers],n0>=0,n1>=0}]
p=FullSimplify[c[n0+1,n1]/(c[n0+1,n1]+c[n0,n1+1])]```
gives p=(n0+.5)/(n0+n1+1) (the Krichevski-Trofimov estimator)

But its commonly more interesting to use secondary statistics and numerical integration there.
Its actually completely practical if we'd apply these formulas to initialization of a
state-to-probability map or some such.

Btw, the plot of BetaDistribution[1,1] is this: http://www.wolframalpha.com/input/?i...2C0%2C1%7D+%5D
(ie all p values have the same probability)
(ie probabilities near 0 and 1 are more likely to appear)
Of course, when we know that the data is not random, the latter is a more reasonable assumption.
But we'd only know the actual distribution after collecting the statistics from various contexts in a real CM.

--------------
1. This was originally a comment on Cyan's blog, but then I tried to fix some typos...
2. "Mathematica" syntax is used.

2. As a side-note

The maximum-likelihood estimate is the same as the maximum a-posteriori estimation (MAP) with Beta(1,1)

Calculating the posterior mean with Beta(1,1) gives the laplace estimator, what you did.

3. In this paper, Cleary and Teahan found that the best estimator is p = (n1 + r)/(n0 + n1 + 2r) with r = 0.5 to 1 when n0, n1 > 0 and r = 0.05 to 0.1 when either n0 = 0 or n1 = 0. r = 1 is just the Laplace estimator and r = 0.5 is the Krichevski-Trofimov estimator. http://www.cs.waikato.ac.nz/ml/publi...zerofreq95.pdf

Their experiments were done on the Calgary corpus files, which tend to be stationary. For nonstationary sources, you cannot consider the observed bits to be independent. For example, 00001 would predict a higher probability of 1 than 10000 even though the counts are the same. This is where indirect context models (context -> bit history -> prediction) outperform direct models (context -> prediction).

4. Originally Posted by Matt Mahoney
For nonstationary sources, you cannot consider the observed bits to be independent. For example, 00001 would predict a higher probability of 1 than 10000 even though the counts are the same. This is where indirect context models (context -> bit history -> prediction) outperform direct models (context -> prediction).
First, one can imagine a nonstationary source with independent bits, e.g. p(X_n == 1) = 1/n.

Second, indirect model is pretty damn cool, but it's not the end of discussion on nonstationarity. For example, concatenating two different files together would also make the result "more nonstationary", no? And indirect context model doesn't capture that.

I think the biggest difference between practical counters and theoretical estimators is that practical counters don't converge to some ideal binomial model, they keep adapting. Either because of limited precision or due to conscious choice of the author.

5. > Cleary and Teahan found that the best estimator is p=(n1+r)/(n0+n1+2r) with r=0.5 to 1

1. Yeah, and I tried to explain what it means, and how to improve it using secondary statistics.
CMs have to handle multiple contexts, so there's an option to collect statistics of p values,
then compute the likelihoods for strings with (n0,n1) bits in context of current p value distribution.

2. The common p distribution for a context is skewed, so any linear probability estimations
(from linear counters or freqs) are likely to be posterior estimations, while we need prior ones for coding.
Thus, an extra p'=squash(C*stretch(p)) transformation, with tuned C value, usually improves the compression
(that can be merged with logistic mixing).
Although its not the precise formula, but just a convenient approximation.

> This is where indirect context models outperform direct models

Yes, but in the end we still need memoryless estimations for histories of these indirect contexts.
Also fast coders can't afford extra indirection, but can be improved by fixing the probability estimations
in FSM LUTs and such.

> For example, concatenating two different files together would also make the result "more nonstationary", no?
> And indirect context model doesn't capture that.

> I think the biggest difference between practical counters and
> theoretical estimators is that practical counters don't converge to
> some ideal binomial model, they keep adapting

Yes, but for nonstationary counters we can apply the same likelihood-based
estimation method too, we'd just have to use a modified entropy formula
with weighted entropy of bits and/or limited history size.

6. Originally Posted by Shelwien
First, how is it the mixer's problem? A context mixer uses context models as submodels not 'model for first half of file and model for second half of file' type of thing.

Second, I don't understand that link at all... Static mixture of static probabilities is equivalent to a single probability, P(bit==0) = p1w + p2(1-w). Probability

(p1^w * p2^(1-w))^N0 * ((1-p1)^w * (1-p2)^(1-w))^N1

is wrong. It should be

(p1*w + p2*(1-w))^N0 * ((1-p1)*w + (1-p2)*(1-w))^N1

Where does this come from?

7. Originally Posted by Matt Mahoney
In this paper, Cleary and Teahan found that the best estimator is p = (n1 + r)/(n0 + n1 + 2r) with r = 0.5 to 1 when n0, n1 > 0 and r = 0.05 to 0.1 when either n0 = 0 or n1 = 0. r = 1 is just the Laplace estimator and r = 0.5 is the Krichevski-Trofimov estimator. http://www.cs.waikato.ac.nz/ml/publi...zerofreq95.pdf
I generally use the Laplace which is considered optimal when r = 1 and uniform distributions. I also use the r =.5 the KT one which I really am not sure when its optimal. I am talking about stationary files. There was a paper I can't remember which but it showed the absolute error for a simple source long file as compared to entropy. The error for the Laplace was one at P = .5 while the error for KT is two. This paper had graphs for the various r versus probability. If P is very low or very high the lower values of r are the best sometimes much lower than .5 Likewise if your compressing noise a low value of r leads to larger expansion than using r much greater than 1.

In short it depends what your compressing. However since we hope our compressors make files much smaller so that's why we tend to use r = .5

If you do the model right you really don't have to worry about n0 or n1 = 0. If you have the so called zero frequency problem you have to some times play games since if you don't it can add to the output space of file. And yes my arb255 often does a poor job of handling the zero frequency problem I plane to clean that up in a future release since I could place all symbols used in a header. But for most short files arb255 would be the better choice.

8. I'm too tired to check, but have a look at equation (5) in http://eprints.pascal-network.org/ar...1/witmse08.pdf. Should Shkarin's estimator be the sequential Normalized Maximum Liklihood distribution?

Good night!

9. > First, how is it the mixer's problem? A context mixer uses context
> models as submodels not 'model for first half of file and model for
> second half of file' type of thing.

1. Technically you can also make a "model for second half of file",
by creating a new initialized instance of the model after reaching that
second half of file, and mixing it with the first instance.
2. What I meant is that the mixer would detect the fact that a submodel
started to provide wrong predictions, and give higher weights to lower
order models, down to order-(-1). But its basically the same as what
happens at the start of the file - after a while, a nonstationary model
would adapt to new data and the mixer would increase its weight again.

> Second, I don't understand that link at all... Static mixture of
> static probabilities is equivalent to a single probability,
> P(bit==0) = p1*w + p2*(1-w).

Yes, but that's a posterior estimation.
Its correct for already processed bits, but doesn't include the next
unknown bit which can be anything.

> Probability
> (p1^w * p2^(1-w))^N0 * ((1-p1)^w * (1-p2)^(1-w))^N1
> is wrong. It should be
> (p1*w + p2*(1-w))^N0 * ((1-p1)*w + (1-p2)*(1-w))^N1
> Where does this come from?

The initial assumption says that we have two bit generators
(its a thing like "bit=(random(0..1)>=p)") and a switcher,
so the complete generator looks like this:
Code:
```bit1=(random(0..1)>=p1; // G1
bit2=(random(0..1)>=p2; // G2
mode=(random(0..1)<w);
bit = mode ? bit1 : bit2;```
And we only know the total numbers of 0s and 1s, so can only
approximate the numbers of bits from each generator as N0*w
for 0s from G1 etc.
Now, the probability of the string of N bits appearing is
the product of the probabilities of the bits in the string.
There're N0*w 0s from G1 and N0*(1-w) 0s from G2, with the
total probability p1^(N0*w)*p2^(N0*(1-w))=(p1^w*p2^(1-w))^N0.

10. Originally Posted by Matt Mahoney
For nonstationary sources, you cannot consider the observed bits to be independent.
The observed bits aren't considered to be independent, because then a context model would be useless. What is considered to be independent are the observed bits under a particular context.
Because under the assumption of a markovian source every context gets a finite number of memoryless subsources. If you assume something different you have to do different math.

Originally Posted by biject.bwts
I generally use the Laplace which is considered optimal when r = 1 and uniform distributions.
The Laplace-estimator assumes an uniform disribution on p. It is optimal under an 0-1 loss function.

11. Originally Posted by Shelwien
1. Technically you can also make a "model for second half of file",
by creating a new initialized instance of the model after reaching that
second half of file, and mixing it with the first instance.
You can, but you won't. It's impractical.

Originally Posted by Shelwien
2. What I meant is that the mixer would detect the fact that a submodel
started to provide wrong predictions, and give higher weights to lower
order models, down to order-(-1). But its basically the same as what
happens at the start of the file - after a while, a nonstationary model
would adapt to new data and the mixer would increase its weight again.
If mixer detects higher order submodels are deteriorating, then perhaps it should also increase/reset their adaptation speed. IMO the mixer helps, but ultimately the problem can only be dealt in the counters. Well, that's if it is actually a problem, which it kind of isn't in practice, AFAICS.

Originally Posted by Shelwien
Yes, but that's a posterior estimation.
Its correct for already processed bits, but doesn't include the next
unknown bit which can be anything.
1) Of course it doesn't include the unknown bit. Cause it's unknown...

2) It's not an estimate, but true probability. Estimate would be e.g. P(bit==0) = N0/(N0+N1).

3) Explain to me the term 'posterior'. From what I've read: 'a posteriori' -- knowledge derived from given sample; 'a priori' -- knowledge based not on given sample, but something else.

Originally Posted by Shelwien
The initial assumption says that we have two bit generators
(its a thing like "bit=(random(0..1)>=p)") and a switcher,
so the complete generator looks like this:
Code:
```bit1=(random(0..1)>=p1; // G1
bit2=(random(0..1)>=p2; // G2
mode=(random(0..1)<w);
bit = mode ? bit1 : bit2;```
And we only know the total numbers of 0s and 1s, so can only
approximate the numbers of bits from each generator as N0*w
for 0s from G1 etc.
Now, the probability of the string of N bits appearing is
the product of the probabilities of the bits in the string.
There're N0*w 0s from G1 and N0*(1-w) 0s from G2, with the
total probability p1^(N0*w)*p2^(N0*(1-w))=(p1^w*p2^(1-w))^N0.
1) Intuitively, this is how I understand the problem setting. To generate the next bit, I first choose a source: I choose the first source with probability w, the second with 1-w. Then, if I chose the first source, then I will send the 0 bit with probability p1 and the 1 bit with 1-p1. Etc.

In terms of probability theory: Let X, Y be binary random variables. Let X=0 denote the event 'the next bit is zero', let Y=0 denote the event 'the first submodel is chosen'.

P(Y=0) = w,
P(X=0|Y=0) = p1,
P(X=0|Y=1) = p2,

From the law of total probability:
P(X=0) = P(X=0|Y=0)P(Y=0) + P(X=0|Y=1)P(Y=1) = p1w + p2(1-w).

Where did I go wrong?

2) Explain some more things to me. Which probabilites are unknown to us? What are we trying to estimate? p1,p2? w? all of them? If all are known then there is nothing to estimate. If none are known then it is a normal Bernoulli distribution and p1,p2,w are unnecessary.

12. > You can, but you won't. It's impractical.

But segmentation does improve compression sometimes,
so a periodically flushed submodel would, too.

> If mixer detects higher order submodels are deteriorating, then
> perhaps it should also increase/reset their adaptation speed.

Its possible, but not very useful, mainly because of secondary statistics.
I mean, SSE is much more efficient when its input is a static function
of context history (because like that its able to adjust predictions
for specific bit patterns, which is impossible if the relation between
recent bits and probability values changes with time).

> 1) Of course it doesn't include the unknown bit. Cause it's unknown...

But my version with likelihoods does include it.
And because of that doesn't have the common "zero freq" problem.

> 2) It's not an estimate, but true probability.
> Estimate would be e.g. P(bit==0) = N0/(N0+N1).

Bits won't appear precisely with that probability
(like if we'd repeat the experiment multiple times with random inputs and take the stats)
so its still just an estimation.

> 3) Explain to me the term 'posterior'. From what I've read: 'a
> posteriori' -- knowledge derived from given sample; 'a priori' --
> knowledge based not on given sample, but something else.

http://en.wikipedia.org/wiki/Posterior_probability
http://en.wikipedia.org/wiki/Prior_probability

> Where did I go wrong?

Its not wrong, its just a different kind of probability estimation,
which doesn't take the history into account.
Simplified, in a way.

> If all are known then there is nothing to estimate

w,p1,p2 are all known, but we don't know whether p1 or p2 was used
to generate the next bit, so we need some mixing function.
(Also in fact model's w,p1,p2 are model's estimations
of generator's w,p1,p2, so they're not "true probabilities")

p1*w+p2*(1-w) is also such a function, but logistic mixing
(the one used in paq) works better.

Then, there's also an issue of w's update.

13. Originally Posted by Shelwien
Its possible, but not very useful, mainly because of secondary statistics.
I mean, SSE is much more efficient when its input is a static function
of context history (because like that its able to adjust predictions
for specific bit patterns, which is impossible if the relation between
recent bits and probability values changes with time).
One could temporarily increase the adaptation speed of SSE on file boundary.

Originally Posted by Shelwien
> 1) Of course it doesn't include the unknown bit. Cause it's unknown...

But my version with likelihoods does include it.
And because of that doesn't have the common "zero freq" problem.
But this is nonsense! You cannot know the unknown.

Originally Posted by Shelwien
> 2) It's not an estimate, but true probability.
> Estimate would be e.g. P(bit==0) = N0/(N0+N1).

Bits won't appear precisely with that probability
(like if we'd repeat the experiment multiple times with random inputs and take the stats)
so its still just an estimation.
I disagree. I say:

1) The true probability is p1w + p2(1-w).

2) The estimate N0/(N0+N1) will converge to true probability.

3) If all parameters are known, then there is no "likelihood".

4) The source is equivalent to a Bernoulli source.

5) We know exactly the probability of the next bit and don't need to estimate anything.

6) This has nothing to do with PAQ.

Originally Posted by Shelwien
> 3) Explain to me the term 'posterior'. From what I've read: 'a
> posteriori' -- knowledge derived from given sample; 'a priori' --
> knowledge based not on given sample, but something else.

http://en.wikipedia.org/wiki/Posterior_probability
http://en.wikipedia.org/wiki/Prior_probability
This confirms my position:

1) "a prior probability distribution [...] is the probability distribution [...] before the "data" is taken into account."

2) "the posterior probability of a random event [...] is the conditional probability that is assigned after the relevant evidence is taken into account"

So any parameter estimation leads to a posterior probability.

14. > One could temporarily increase the adaptation speed of SSE on file boundary.

Well, there are counters with dynamic update rate.
For example, the SEE counters for masked contexts in ppmd.

But with SSE... well, if you'd change the rate once or twice
per file, it won't really matter.

Also SSE update speed is always relatively slow, with any parameters,
just because SSE node includes multiple counters and we need a chance
to update all of them.
There's no way to correctly retrain SSE with just 10 bits of data.

> But this is nonsense! You cannot know the unknown.

I cannot know, but I can take it into account.

In this case, I don't know the probability of the next bit being 0,
but I can estimate the likelihood of a string with N0+1 zeroes and
N1 ones appearing (as a weighted sum of likelihoods of strings with
all p values).
We also know that the next string would either have stats N0+1,N1,
or N0,N1+1, so using that information, we can compute the conditional
probability of next string being the one with N0+1,N1 bits, which is
also the probability of the next bit being 0.

> 1) The true probability is p1*w + p2*(1-w).

Yes, with generator's p1,p2,w which we don't know.
(and taking into account that usually the data is not based on assumed
generation method at all).

Again, _model's_ p1,p2,w are model's estimations of generator's p1,p2,w.
They're also estimated using only a part of available information about
generator's behavior (ie data bits).

So there's no wonder that different probability estimations may be applicable at once,
and any of them may be optimal for specific types of data.

Its just an experimental fact that logistic mixing works better in bitwise CMs.

But its also a fact that the likelihood-based estimation takes into account
more of available information, so its predictions are commonly better, or
at least not worse than a linear mix.

> 2) The estimate N0/(N0+N1) will converge to true probability.

It won't because the coder using it would crash before encoding the first bit
(due to division by zero).

And something like (N0+d)/(N0+N1+2*d) would converge, but only for stationary data.

> 3) If all parameters are known, then there is no "likelihood".

String probabilities do exist in any case.
If you have N bits of data, then there's a probability of encountering these N bits.

Also variables are known (they're not parameters).
We can potentially encounter any values of these variables.

> 5) We know exactly the probability of the next bit and don't need to estimate anything.

No, we don't know the probability, you misunderstood.
We assume a specific generator, but we don't know anything about its parameters.
We can only estimate which parameter set is most likely, given the known data bits.

> This confirms my position

1) posterior probability of a random event or an uncertain
proposition is the conditional probability that is assigned after
the relevant evidence
is taken into account

this is what p=N0/(N0+N1) is, after encountering N0 zeroes and N1 ones -
its true for already known bits, but only these.
As a proof, it won't accept 1s if previous bits were all 0s, despite
the fact that we should expect 1 with non-zero probability.

2) prior probability [...] uncertainty about p before the "data" is taken into account

Which is the case with our next bit.

15. Okay, so do you agree with the following statement:

1) If p1,p2,w are the exactly the generators true parameters, then the generator emits a zero bit with probability p = p1*w + p2*(1-w).

If you agree with that statement, do you then disagree with this statement:

2) If p1,p2,w are the exactly the generators true parameters, then the generator emits a string with N0 zeroes and N1 ones with probability p^N0 * (1-p)^N1, where p = p1*w + p2*(1-w).

I assume you disagree, but I don't see why.

16. > 1) If p1,p2,w are the exactly the generators true parameters, then
> the generator emits a zero bit with probability p = p1*w + p2*(1-w).

Yes, but we can't precisely compute the parameter values just from the generated data.
Also with real data we have to assume that there're no globally optimal parameter values at all.

> 2) If p1,p2,w are the exactly the generators true parameters, then
> the generator emits a string with N0 zeroes and N1 ones with
> probability p^N0 * (1-p)^N1, where p = p1*w + p2*(1-w).
>
> I assume you disagree, but I don't see why.

This is one of the possible estimations too, mine is simply based on different assumptions.
To be specific, I assume that N0*w zero bits are generated with probability p1, while in your
case all the bits can be generated with probability p2.

The main difference is that your interpretation gives the same linear-mix estimation
for the probability of the next bit, even using likelihoods.
But we have experimental evidence that logistic mixing is frequently better than linear.

Anyway, there's no single correct answer, they're all based on some hidden assumptions,
like uniform distribution of generator's p values, or data stationarity
(you assume that generator's w is the same for all bits, while I don't).

But still, I think its better to have some explanation for logistic mixing, because
it was originally borrowed by Matt Mahoney from NN theory without any explanation at all.

17. Unfortunately I am not sure why logistic mixing works. I discovered it experimentally.

Probability theory is insufficient. The question is given pa = P(X|A) and pb = P(X|B), what is p = P(X|A,B)? It is possible to construct scenarios where the answer could be anything you want, like pa = pb = 0.999, p = 0. For example, it is easy to construct a stream where contexts A and B are always be followed by bit X = 1 except for the 1 case in 1000 where both contexts occur together.

The best I can do is argue from Solomonoff induction. Solomonoff induction is a formal statement of Occam's Razor. It says (approximately) that the best predictor of a sequence is to find the shortest program that generates that sequence, then see what it outputs next. Or more precisely, it says find all programs that generate the sequence and average them with weight 1/2^l where l is the length of the program in bits. It's almost the same because the mix is dominated by the shortest program. Unfortunately, the algorithm is not computable either way. (If it were, compression would be an easy problem). Also, the length of the program depends on the choice of programming language, but not by too much because any language can be translated into any other language by a program whose length does not depend on the sequence being predicted.

Now lets say you have two models that predict that the next bit will be 1 with probabilities pa and pb. What this means in the Solomonoff model is that you really have two sets of programs whose averages (weighted by 1/2^l) of the next output bit are pa and pb respectively. Then the best prediction will be the weighted average of the union of the two sets. That will depend on the distribution of lengths in the two sets, which we don't know in general. But one thing we do know is that it takes more bits (and therefore longer programs) to express probabilities very close to 0 or 1 than near 1/2. In fact, it takes roughly |log(p/(1-p))| = |stretch(p)| bits more in most languages.

For example, suppose pa = 0.5, pb = 0.999. Then pa is equivalent to two programs of the same length, one that outputs 0 and one that outputs 1, and maybe some longer programs whose output is irrelevant because they would be weighted less. pb would also require at least two programs, but the shortest program that outputs 0 would have to be 10 bits longer than the shortest program that outputs 1 in order to make 1 1000 times more likely.

Now we consider the weighted average of both sets. Unfortunately we don't know the relative lengths of the shortest programs in each of the two sets. The best we can do is say is stretch(p) = w*stretch(pa) + (1-w)*stretch(pb). The case of w = 0.5 would be where the two programs in pb are 5 bits shorter and 5 bits longer than either of the two programs in pa, yielding p ~ 0.97.

18. Unfortunately I am not sure why logistic mixing works. I discovered it experimentally.
General Outline. Statistical compression is a sequential process. Fix an arbitrary step k (the entities in the following discussion may vary from step to step), you have m models, which predict m model distributions P_1, P_2, ..., P_m on the source alphabet A.

Assumption 1. The k-th character is drawn according to an unknown (source) distribution Q on A.
Observation. Typically some models work better than others.
Assumption 2. For each model there is a positive weight w_i, which quntifies how good model i "fits" the unknown distribution Q.

In this situation logistic mixing (which i call geometric weighting) finds Q, which is "close to" (measured by expected redundancy, KL-divergence) good models and "far away" from bad models, i.e., the unique minimizer of

min_Q w_1 D(Q||P_1) + ... + w_m D(Q||P_m).

The minimizer is given by:

Q*(x) := [prod_i=1..m P_i(x)^_i] / [sum_a in A prod_1=i..m P_i(a)^w_i].

For |A|=2 (a binary alphabet) this is equivalent to logistic mixing in PAQ.

We assumed that the weights are given, however that is generally not the case. Thus, we may start with an estimation w(0) = (1/m ... 1/m)^T and incrementally minimize the actual code length (as in PAQ). This is called incremental gradient descent and an approximate attempt to minimize the code length of a given sequence x^n over A:

min_w l(w), where l(w) := sum_k=1..n -log(Q^*(x_k | x^{k-1})).

Fact. l(w) is convex, when w is drawn from the m-unit simplex, i.e., there exists a single unique minimizer w*.

19. Most excellent work, sir toffer!

I once tried to derive a mixer from minimizing KL, but I'm a bit dumb, so things didn't go that well for me...

Say, isn't "iterative" GD usually called "online" GD?

20. Now a question: how do we implement a better update function for logistic mixer?
(its possible)

21. OK, I just had to try this out and it works. The following mixes predictions .5 and .01 with equal weights using toffer's formula by finding x that minimizes D(x||.5) + D(x||.01). It gives x = -1/98 + 3*sqrt(11)/98 ~ 0.0913

http://www.wolframalpha.com/input/?i...in+%5B0%2C1%5D

And then I get the same number using averaging in the logistic domain. http://www.wolframalpha.com/input/?i...2F.99%29%29%29

22. @valdman: the term incremental gradient descent (not iterative, which is a typo) comes from "Nonlinear Programming", D. Bertsekas; sure there might be other names.

@Shelwien: dunnot know. So what do you propose?

@Matt, we can directly show this (this actually is in my paper):

st(x) = ln(x/(1-x))
sq(x) = 1/(1+exp(-x))

p(1)
= sq( sum_i=1..m w_i st(p_i(1)) )
= 1/(1 + exp( -sum_i=1..m w_i ln( p_i(1)/p_i(0) ) )
= 1/(1 + exp( ln prod_i=1..m (p_i(0)/p_i(1))^w_i )
= 1/(1 + prod_i=1..m (p_i(0)/p_i(1))^w_i )
= [ prod_i=1..m p_i(1)^w_i ] / [ (prod_i=1..m p_i(1)^w_i) + (prod_i=1..m p_i(0)^w_i) ].
= the stuff from my previous post with A = {0, 1}.

I think i've already posted this somewhere.

EDIT: in the paper...
Section 3.1 proofs, that the mixture distribution in the previous post minimizes the mentioned problem.
Section 3.2 shows, that the string length is a convex function of the weight vector and proposed a weight update method based on incremental gradient descent.
Section 3.3 shows, that the mixture distribution from 3.1 is equivalent to logistic mixing in PAQ when A={0,1}.

23. > how do we implement a better update function for logistic mixer?

The switching scheme generates a (virtual) string of submodel choices.
The mixer's weight is an estimation of the probability of next symbol in that string.
But in real data the submodel choices are much less random - they usually follow
the data structure. Like, if we're compressing a .csv table, text fields there
would be better compressed with stationary submodel, and number fields would
require a submodel with aggressive update.
The mixer would surely detect that, with some delay, and adjust the weight toward
the more efficient submodel. But this is basically the same as compressing everything
with an order0 model - the submodel choices do have a context dependency.
Then, some improvements for weight estimation become obvious:
1) Approximate the model-switch sequence
(by comparing the rolling entropies of each submodel and choosing the minimum)
and build a better model for it (with context and whatever would be necessary)
and use the estimated choice probability distribution for mixing of primary submodels.
2) Use parsing optimization.
Normally (for CM) all submodels receive all statistics from every data type,
but we can detect "errors" in the model choice string (with some delay, using
further information) and then implement update exclusion (in a second instance
of each submodel, or just do the parsing optimization and submodel fixup after
processing a block).
In a way, this is what Shkarin's seg_file does, but an adaptive sequential
implementation would be more interesting.

24. Finally got around to updating DCE with a reference to toffer's paper. http://mattmahoney.net/dc/dce.html#Section_43

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•