Probability estimation for near-empty contexts

http://cbloomrants.blogspot.com/2010...ession_12.html

> A "sparse" context is one with very little information in it.

A "sparse" context is what you call "skip context" etc in your blog, like AxBx.

So please don't start renaming things now, because even paq description

calls it that.

> In the most extreme case, a sparse context might only have 1

> previous occurance. Let's take the binary case for clarity,

> so we have {n0=1,n1=0} or {n0=0,n1=1} in a single context.

> 1. What should our actual estimate for P0 (the probability

> of a zero) be?

Its not really a problem with secondary statistics.

> There's a lot of theory about ideal estimators and you can

> go read papers on the unbiased estimator or the KT estimator

> or whatever that will give you formulas like

Unfortunately actually good ones are not really described

that much.

Here's the result of an estimator discussion with Shkarin:

http://nishi.dreamhosters.com/u/mp2a.rar

It started with him saying that Shtarkov's "MP-estimator"

is the best, which didn't seem quite true in the end,

but reminded about the possibility of maximum likelihood

estimation.

http://encode.su/threads/997-Counter...ll=1#post20144

ie you consider all the permutations of strings with n0

zeroes and n1 ones, and their probabilities (string probabilities!)

computed with secondary statistics (in that mp2a demo its

known that p0 is selected from (0..1/4)U(3/4..1),

and the string probability is p0^n0*(1-p0)^n1 )

and estimate the next bit probability from that - ideally

by integrating all the string intervals.

> For example, in some context you've seen 50 A's and 20 B's.

> Now you see two C's. Suddenly you are in a sparse situation

> again. What you have is an insufficient sample sitation.

> Should I still use those 50 A's I saw, or am I now 99%

> likely to make more C's ? I don't have enough statistics in

> this context to tell, so I'm in a sparse situation again.

>

> There are various attempts at addressing this, but like all

> the problems in this serious there's been no definitive

> attack on it.

You're right about this being an important point, but SSE

and logistic mixing and offline parameter optimization seem

to handle it well enough, for now.