For statistical compression, it is most common to use the probability

distribution which minimizes the codelength of already known data for coding

of next symbol.

More advanced implementations may actually take into account the next

symbol too, and minimize the codelength of already known data + next

symbol.

But either way, it seems to be common sense to think that the model in

compression predicts the next symbol, ie computes the probability

distribution of the next symbol, which is then immediately used for coding.

Now, is that actually correct?

Why, actually we have a perfect example of arithmetic coding with

completely precise modelling and coding - enumeration for a memoryless

source with known symbol counts.

It can be implemented in a way which would look exactly like a standard

statistical coder with model + AC, but it would still assign codes of

precisely the same length to all data permutations, without any redundancy

at all.

The actual coding is the same, so the model is what makes it so perfect.

Now, what's so different about the model in this case?

Well, it outputs the numbers of symbol's occurrences until the end of input,

which lets us enumerate the future strings and precisely subdivide the code

space for strings starting with each symbol.

In short, in the example where the model is known to be correct and

precise, that model actually predicts the numbers of symbol's occurrences

until the end of input, instead of chances that next symbol would be a

specific one.

So, why don't we try doing the same in a "universal" model?

Actually, knowing the number of remaining symbols in the data is quite

common (mainly because of performance considerations - checking for EOF is

much slower than storing the length and counting symbols).

Suppose that we have p=P(bit==0) and n bits left.

The probability of next n bits containing k zeroes is Binomial[n,k]*p^k*(1-p)^(n-k)

(its a sum of likelihoods of strings with k zeroes divided by sum of

likelihoods of all strings, which is 1).

We have to choose the best k value, so it has to be the most probable one.

Then let's differentiate it by k, and find the root. Its a little

complicated, but in the end Mathematica shows it as

Log[(1-p)/p] = HarmonicNumber[n-k]-HarmonicNumber[k] // HarmonicNumber[n] = Sum[1/i,{i,1,n}]

Now, let's compute the actual estimations: http://nishi.dreamhosters.com/u/target1.png

f[p0_,n0_]:=k/.FindRoot[(HarmonicNumber[k]-HarmonicNumber[-k+n]+Log[(1-p)/p])/.p->p0/.n->n0,{k,p0*n0}]

Plot[f[p,100]/100-p,{p,10^-3,1-(10^-3)}]

f[10/100,100] = 9.59634 // optimal k for n=100,p=10/100

f[90/100,100] = 90.4037 // optimal k for n=100,p=90/100

And as we can see, its somewhat different from the probability.

Am I doing something wrong?