>> Well, you can skip some bits but how'd that help?

>

> I meant you could be applying some function(s) of the delayed bits.

And I already said that it should be done after correlation analysis

of optimized coefficients, not by correlations that you _think_

there'd be.

> > Another point is that estimation mapping is different

> > from update mapping...

>

> I never took this into account. One could interpret the PAQ implementation like

> this: the automata itself represents a static update mapping, while the APMs

> try to dynamically model the estimation mapping.

Yes, but paq uses this just to fit counters into bytes.

> > So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1 parameters.

>

> How do you get so much parameters?

As I said... there're N*2^k estimation parameters, N*2^(k+1) update mapping

parameters (including the just processed bit) and 2^(k+1)-1 initial probability

values for every context of lengths 1..k.

Well, also estimation and update mappings can have different contexts, but

you'd need enough initial values for update mapping to work.

Also its important that all these things are linear, so its easy to

calculate a set of coefficients for a perfect simulation of a simple

linear-update counter.

> I would interpret it as a clue that such changes of statistics happen frequently

> (especially in lower order models, since the contextual seperation isn't as

> strong as with higher orders) and the jump in compression gain is caused by the

> ability of bit histories longer than two to distingush such changes (..000111..)

Whatever... Its the same as any other primary context model.

Actually fpaq0f2 has very similar performance to a traditional order1 model

and uses the same number of counters:

http://shelwien.googlepages.com/fpaq0f2.htm
And there's no reason to think that previous data bytes are less significant

than context history bits... in fact, its kinda obvious that source models

don't use history bits - there're even data types with known generator, like

executables.

Its another case when the primary model grows huge and lacks statistics.

Then we just have to merge some contexts and context history similarity

is a good indicator of what to merge.

And that's what SSE is, basically, though it uses a quantized version of

context history aka probability.

> Rewriting p_i(k) dependent on the last bit history updates and ordering by w

> yields a polynomal in w, the thing you posted. This polynomal only depends on:

> (1) the bit history b_1 .. b_k under the current context,

> (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)

> (3) the parameter w

>

> k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)

>

> So the polynomals coefficients a_i only depend on (1) and (2), since you ordered

> by the powers of w.

>

> My question is, how can we compute the coefficients a_i? I tried to find patterns

> substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find

> any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).

> Otherwise one can't implement it.

Guess I was wrong to leave p0 as a parameter in that example...

Look, imagine that we want to optimize just a single w parameter

p' = p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w) = w*(y-p)+(y+p)/2

Now lets assume that p=Sum[ a[i]*w^i, i=0..k ]. Then

p' = Sum[ (a[i]/2-a[i-1])*w^i, i=2..k ] + (a[1]/2+y-a[0])*w + (y+a[0])/2

a'[0] = (y+a[0])/2

a'[1] = (y-a[0]-a[1]/2)

a'[i] = (a[i]/2-a[i-1])

Is it clear now?

Then, its possible to work with more parameters in the same way,

eg. we can use the same polynomial class based on p0 for a[i] coefficients.

But I suspect that it'd take too much memory to get a tolerable precision

even with two variables.