1. Of course there'd be some noise either way, when calculating with limited precision.
But the noise with small divisors is "more random" or something like that.

Or there's other way too.
Namely, you can use interval arithmetics for all
calculations and select some specific value from
the output interval instead of random which you
get without interval arithmetics  2. Ok, it is obvious that masking out more bits reduces randomness, since higher bits contribute more significantly to the variable's value. This could be an explanation.
How do you select the term for "noise replacment" (you know, what i mean)? Do you give more weight to more probable symbols? Or which kind of distribution do you use to mix? 3. > How do you select the term for "noise replacment" (you know, what i mean)?

By trying all and finding the best value Though rough estimations can be derived from the "look" of data -
basically, you set a higher weight of static part
for less predictable data.

> Do you give more weight to more probable symbols?

Usually not, though sometimes it might have sense.

> Or which kind of distribution do you use to mix?

Not sure if I got it right, but for the
counter-mixed-with-static-distribution model from the first post
I normally use a flat distribution, so it works like a measure
of amount of random patches in the data or something like that.

Btw, random values (remainders) have kinda the same effect too
(due to non-zero average),
so the compression probably would get worse if you just remove
the "noise component". 4. Originally Posted by Shelwien
> compared to standard speed = 40:
> - after hit: prob = prob + (1 - prob) / 40 = 0.5125
> - after miss: prob = prob - prob / 40 = 0.4875
Simpler code for power of 2 speed:
Code:
```Prob -= Prob >> n;
if ( hit ) Prob += 1 >> n;```

See PPMd .

For binary sequence with only added information that there're n0 zeroes and n1 ones,
there're (n0+n1)!/n0!/n1! equiprobable cases, and its easily provable that a counter
like this would be optimal:
p1 = n1/(n0+n1); // static
It is approximate solution for large n0 & n1. For precise solution, see MP-codes. 5. > Simpler code for power of 2 speed:
> Prob -= Prob >> n;
> if ( hit ) Prob += 1 >> n;
> See PPMd .

Really? I thought you only use frequencies there.

>> For binary sequence with only added information that there're n0 zeroes and n1
>> ones,
>> there're (n0+n1)!/n0!n1! equiprobable cases, and its easily provable that a counter
>> like this would be optimal:
>> p1 = n1/(n0+n1); // static
>
> It is approximate solution for large n0 & n1.

Well, its globally optimal for static coding.
And also the dynamic formula is optimal for adaptive coding, though
its simpler to prove as the product of probabities with properly
decremented counters equals to that binomial coefficient.

> For precise solution, see MP-codes.

Is that supposed to be some binomial-based coding?
Like this: http://compression.ru/sh/mark1.rar ? 6. Originally Posted by Shelwien
Really? I thought you only use frequencies there.
See SSE for binary contexts. Originally Posted by Shelwien
Well, its globally optimal for static coding.
Oh, sorry wrong quote. Right quote: Originally Posted by Shelwien
if( y ) c1++; else c0++; p = c1/(c1+c0); // dynamic Originally Posted by Shelwien
Is that supposed to be some binomial-based coding?
Like this: http://compression.ru/sh/mark1.rar ?
Yes, something similar, but IMHO your formula is incorrect, correct formula does not include factorials and C(n,k). 7. Originally Posted by Shelwien
Is that supposed to be some binomial-based coding?
Like this: http://compression.ru/sh/mark1.rar ?
Ok. It is simpler to write correct formula:
n - length of bitstring s;
k - number of 1s in string s;
+ - concatenation of strings;
p(1|s)=
P(s+1)/(P(s+1)+P(s+0))=
(k+1)^(k+1)/k^k/((k+1)^(k+1)/k^k+(n-k+1)^(n-k+1)/(n-k)^(n-k)) 8. >> Really? I thought you only use frequencies there.
>
> See SSE for binary contexts.

I see only SEE there But if that's what you meant, I'd just believe you, as the code there
isn't quite readable and I'm not that interested.

And what I am interested in, is "stateless" counters, ie, ideally,
a way to dynamically compute a good enough estimation directly from
the context history. Alas, its not as simple as it seems, because
with short histories (eg 16 bits is short) the prediction of normal
counter is usually much better, and non-lookup implementations are damn slow.

Btw, the reason is that such an approach really improves the speed
of bruteforce parameter optimization.

>> Well, its globally optimal for static coding.
>
> Oh, sorry wrong quote. Right quote:
>
>> if( y ) c1++; else c0++; p = c1/(c1+c0); // dynamic

Well, for small c0,c1 there's no single "precise solution" anyway,
as it all depends on which sequences you consider more frequent.

>> Is that supposed to be some binomial-based coding?
>> Like this: http://compression.ru/sh/mark1.rar ?
>
> Yes, something similar, but IMHO your formula is incorrect,
> correct formula does not include factorials and C(n,k).

Well, there's a working coder without much redundancy, so I don't
care if you think its incorrent.
Also there're obviously different ways to do the same... with
additional proof in the form of recent ABS coder... who'd think
its still possible to create a really different interval coding
alternative. 9. > Ok. It is simpler to write correct formula:
> n - length of bitstring s;
> k - number of 1s in string s;
> '+' - concatenation of strings;
> p(1|s )= P(s+1)/(P(s+1)+P(s+0)) =
> (k+1)^(k+1)/k^k/((k+1)^(k+1)/k^k+(n-k+1)^(n-k+1)/(n-k)^(n-k))

Ah. Quite as I suspected.
So its not "doesn't include factorials and binomial coefficients",
it just uses a (in)famous factorial approximation...
http://en.wikipedia.org/wiki/Stirling%27s_approximation
Btw, its also only good for large numbers, so I don't think there's
actually any difference... but thanks anyway, as I haven't heard about
this particular implementation. Guess I'd appreciate any links or articles
if you have any. 10. Originally Posted by Shelwien
I see only SEE there
For binary contexts p(SSE)=1-p(SEE)  Originally Posted by Shelwien
And what I am interested in, is "stateless" counters, ie, ideally,
a way to dynamically compute a good enough estimation directly from
the context history. Alas, its not as simple as it seems, because
with short histories (eg 16 bits is short) the prediction of normal
counter is usually much better, and non-lookup implementations are damn slow.
Formula must be tabulated, of course. Originally Posted by Shelwien
Well, theres a working coder without much redundancy, so I dont
care if you think its incorrent.
Redundancy can be reduced with proper formula. I heard, You are concerned for precise models, are not You?  11. Originally Posted by Shelwien
Btw, its also only good for large numbers, so I dont think theres
actually any difference... but thanks anyway, as I havent heard about
this particular implementation. Guess Id appreciate any links or articles
if you have any.
AFAIR, it is by Shtarkov and it is not approximation, but precise formula. 12. > > I see only SEE there
>
> For binary contexts p(SSE)=1-p(SEE) I know, but my texteditors search function doesn't .

> > And what I am interested in, is "stateless" counters, ie, ideally,
> > a way to dynamically compute a good enough estimation directly from
> > the context history. Alas, its not as simple as it seems, because
> > with short histories (eg 16 bits is short) the prediction of normal
> > counter is usually much better, and non-lookup implementations are damn slow.
>
> Formula must be tabulated, of course.

Well, not that is "must be", actually... its all just the matter
of speed impact on the optimizer.

But still I'd like to find some counter decomposition/approximation
which would be precise enough and much faster than recurrent update.

> > Well, there's a working coder without much redundancy, so I don't
> > care if you think its incorrent.
>
> Redundancy can be reduced with proper formula. I heard, You are concerned for
> precise models, are not You? Certainly not "You" Also that's not a proper formula for nonstationary data, so I wouldn't
really use it even if it was ideal for stationary data, what it isn't.

> > Btw, its also only good for large numbers, so I don't think there's
> > actually any difference... but thanks anyway, as I haven't heard about
> > this particular implementation. Guess I'd appreciate any links or articles
> > if you have any.
>
> AFAIR, it is by Shtarkov and it is not approximation, but precise formula.

Also afair Shtarkov really liked the Stirling formula even back then, when
I'd seen some articles of his. It allows to turn a simple binomial formula
into a scary half-page monster, so I kinda understand why, though Code:
```
n! ~= Sqrt(2*Pi*n)*n^n/Exp(n)

C(x+y,x) = (x+y)!/x!/y!

= Sqrt(2*Pi*(x+y))*(x+y)^(x+y)/Exp(x+y) / ( Sqrt(2*Pi*x)*x^x/Exp(x) * Sqrt(2*Pi*y)*y^y/Exp(y) )

= (x+y)^(x+y+,5) / x^(x+.5) / y^(y+.5) / Sqrt(2*Pi)```

Also about that formula for probability:

Code:
```
C(x+y,x) - number of bit strings with x zeros and y ones.

p1 = C(x+1+y,x+1) / (C(x+1+y,x) + C(x+1+y,x+1))

= (x+1+y)^(x+1+y+,5) / (x+1)^(x+1.5) / y^(y+.5) /
( (x+1+y)^(x+1+y+,5) / (x+1)^(x+1.5) / y^(y+.5) +
(x+1+y)^(x+1+y+,5) / x^(x+.5) / (y+1)^(y+1.5) )

= x^(x+.5) * (y+1)^(y+1.5) / ( x^(x+.5) * (y+1)^(y+1.5) + (x+1)^(x+1.5) * y^(y+.5) )```

Doesn't that look similar now? Guess that .5 was removed for ugliness, though And finally:

Code:
```
p1 = C(x+1+y,x+1) / (C(x+1+y,x) + C(x+1+y,x+1))

= (x+1+y)!/(x+1)!/y! / ( (x+1+y)!/x!/(y+1)! + (x+1+y)!/(x+1)!/y! )

= x!*(y+1)! / ( (x+1)!*y! + x!*(y+1)! ) =

= x!*y!*(y+1) / ( x!*y!*(x+1) + x!*y!*(y+1) ) =

= y+1 / (x+y+2)``` 13. Originally Posted by Shelwien
Also afair Shtarkov really liked the Stirling formula even back then, when
Id seen some articles of his. It allows to turn a simple binomial formula
into a scary half-page monster, so I kinda understand why, though
Maximum likehood probability of bitstring is:
P(s)=const(n)*(k/n)^k*((n-k)/n)^(n-k)
Substitute it to
p(1|s )= P(s+1)/(P(s+1)+P(s+0))
and You will get the Shtarkov formula for MP-codes. As You can see, there is no approximations and no Stirling formula, but You can use it for further simplifications at large n0 & n1 and You will get Dirichlet estimator:
p(1|s)=(n1+1/2)/(n0+n1+1)
MP-codes are useful for small n0 & n1, for larger ones, Dirichlet estimator is enough. Originally Posted by Shelwien
I know, but my texteditors search function doesnt .
<div class=""jscript""><pre>#define GET_MEAN(SUMM,SHIFT) ((SUMM+ROUND) >> SHIFT)
inline void PPM_CONTEXT::encodeBinSymbol(int symbol)
{
STATE& rs=oneState();
WORD& bs=BinSumm[QTable[rs.Freq-1]][NS2BSIndx[suff()->NumStats]+PrevSuccess+
Flags+((RunLength >> 26) & 0x20)];
UINT tmp=rcBinStart(BSumm=bs,TOT_BITS); bs -= GET_MEAN(BSumm,PERIOD_BITS);
if (rs.Symbol == symbol) {
bs += INTERVAL; rcBinCorrect0(tmp);
FoundState=&rs; rs.Freq += (rs.Freq < 196);
RunLength++; PrevSuccess=1;
} else {
}
}
[/code]

It is also described in my paper on PPMII. 14. >> Also afair Shtarkov really liked the Stirling formula even back then, when
>> I'd seen some articles of his. It allows to turn a simple binomial formula
>> into a scary half-page monster, so I kinda understand why, though
>
> Maximum likehood probability of bitstring is:
> P(s)=const(n)*(k/n)^k*((n-k)/n)^(n-k)

Do I really need to repeat that its an approximation?
Because there're exactly n!/k!/(n-k)! possible bitstrings of length n with k 1s,
so P(s) = k!*(n-k)!/n!

> Substitute it to
> p(1|s )= P(s+1)/(P(s+1)+P(s+0))
> and You will get the Shtarkov formula for MP-codes. As You can see, there is no
> approximations and no Stirling formula, but You can use it for further
> simplifications at large n0 & n1 and You will get Dirichlet estimator:
> p(1|s)=(n1+1/2)/(n0+n1+1)

Whatever, the method of derivation doesn't change anything, and I can do this:
Code:
```
number of strings with x zeroes and y+1 ones, ending with 1 is: C(x+1+y,x)/(y+1)
number of strings with x+1 zeroes and y ones, ending with 0 is: C(x+1+y,x+1)/(x+1)

p1 = C(x+1+y,x)/(y+1) / ( C(x+1+y,x)/(y+1) + C(x+1+y,x+1)/(x+1) )

= C(x+1+y,x)*(x+1) / ( C(x+1+y,x) + C(x+1+y,x+1)*(y+1) )

= C(x+2+y,x+1) / (C(x+1+y,x) + C(x+2+y,x+1))```

and then replace C(n,k) with a Stirling formula and get more or less the same
result (what I more or less already did in previous post).

> MP-codes are useful for small n0 & n1, for larger ones, Dirichlet estimator is
> enough.

Your observations don't change anything either
(that is, if you have any and don't just quote some article),
as I'm certain that you can get even better empirical results
than "MP-codes" by adjusting the d value in the p1=(y+d)/(x+y+2*d) formula.

Also I don't see how n! is worse than n^n, computation-wise.

>> I know, but my texteditors search function doesn't .
>
> #define GET_MEAN(SUMM,SHIFT) ((SUMM+ROUND) >> SHIFT)

Yeah, I've seen this and its not getting any more readable by posting it here.

Also I wasn't arguing with this from the start, its just that its the same
update function as the one that paq and derivatives use,
and I consider it noisy and imprecise due to fixed set of available "update speeds". 15. Originally Posted by Shelwien
Do I really need to repeat that its an approximation?
It is not approximation, it is binomial distribution. Originally Posted by Shelwien
Because therere exactly n!/k!/(n-k)! possible bitstrings of length n with k 1s,
so P(s) = k!*(n-k)!/n!
Are You sure? You select one string with k 1s and length n, but the same source can generate strings with the same length n, but with other k. Originally Posted by Shelwien
Your observations dont change anything either
(that is, if you have any and dont just quote some article),
as Im certain that you can get even better empirical results
than "MP-codes" by adjusting the d value in the p1=(y+d)/(x+y+2*d) formula.
It is KT-estimator for binary alphabet. Asimptotically, redundancy is minimized for d=1/2. Originally Posted by Shelwien
Also I dont see how n! is worse than n^n, computation-wise. Originally Posted by Shelwien
Also I wasnt arguing with this from the start, its just that its the same
update function as the one that paq and derivatives use,
and I consider it noisy and imprecise due to fixed set of available "update speeds".
I am talking about the same. For example, We can use MP-codes for small n0 & n1 and switch to Dirichlet estimator after gathering of enough statistics. All things connected with nonstationarity must be applied to Dirichlet estimator because at begining of coding We do not have statistics even for stationary source. 16. btw:

dmitry, can you find some spare time to reformat ppmd sources? currently they are hard to read (you write two instructions per row, what can fool the reader).

also can you write some new complete article on ppmdii in english? currently on your site i see two articles one in russian (my russian is very poor) and one in english which is little outdated (it refers to ppmdh while ppmdi or ppmdj are noticeably better) and requires reader to understand previous article written in english. also your english probably improved i would be very grateful to you if you do that  17. Originally Posted by donkey7
dmitry, can you find some spare time to reformat ppmd sources? currently they are hard to read (you write two instructions per row, what can fool the reader).
Such formatting is convinient for me, You can reformat as You want, of course. Originally Posted by donkey7
also can you write some new complete article on ppmdii in english? currently on your site i see two articles one in russian (my russian is very poor) and one in english which is little outdated (it refers to ppmdh while ppmdi or ppmdj are noticeably better) and requires reader to understand previous article written in english.
PPMdi & PPMdj are simply tuning of old ideas, there is nothing principially new in modeling since var.G and nothing principially new in tree building since var.H. Originally Posted by donkey7
It is degraded . 18. >> Because there're exactly n!/k!/(n-k)! possible bitstrings of length n with k 1s,
>> so P(s) = k!*(n-k)!/n!
>
> Are You sure? You select one string with k 1s and length n, but the same source
> can generate strings with the same length n, but with other k.

Whatever k there is, the number of binary strings of length n with k ones is
C(n,k)=n!/k!/(n-k)!
So given that we know the current k and n, there're C(n+1,k)/(n+1-k) different
strings of length n+1 with zero at last position, and C(n+1,k+1)/(k+1) strings
with one at last position, so we can calculate a relative frequency aka
probability.

> It is not approximation, it is binomial distribution.

The formula can be originally conjured by any magic, but when I can reach
the same by replacing some functions with approximations, I'd call it approximation.

Also there's nothing bad in it, and depending on the context, given approximation
might be precise.

Actually, if you forgot already, I just tried to explain the relation of your
formula to binomial coefficients.

>> Your observations don't change anything either
>> (that is, if you have any and don't just quote some article),
>> as I'm certain that you can get even better empirical results
>> than "MP-codes" by adjusting the d value in the p1=(y+d)/(x+y+2*d) formula.
>
> It is KT-estimator for binary alphabet. Asimptotically, redundancy is minimized
> for d=1/2.

Dunno how can you get that asymptotically.
Though instead I know that ( D[f(x),x]=f'(x) )
Code:
```
In:= Solve[ D[x*Log[((x+d)/(x+y+2*d))]+y*Log[((y+d)/(x+y+2*d))],d]==0,d]
Out:= {{d->0}}```
And that's what I'd call asymtotical estimation.
Also we can exclude d=0 (as it can't be used with x=y=0)
by substitution d->1/(1+e^2) and taking a derivative by e.
Then it says that minimum is at e=0, so d=1.
Pity that I don't have any good grass, so can't immediately do something
to make it d=.5

...Well, with all that I only wanted to show that there's no single ideal solution.

>> Also I don't see how n! is worse than n^n, computation-wise.
>
> I don't talk about simplicity of computations, but about precise estimation.

And I'm talking about the reason for using an approximation when its as slow
as the precise formula, but imprecise.

>> Also I wasn't arguing with this from the start, its just that its the same
>> update function as the one that paq and derivatives use,
>> and I consider it noisy and imprecise due to fixed set of available "update speeds".
>
> I am talking about the same. For example, We can use MP-codes for small n0 & n1
> and switch to Dirichlet estimator after gathering of enough statistics. All
> things connected with nonstationarity must be applied to Dirichlet estimator
> because at begining of coding We do not have statistics even for stationary source.

But we do have it. Instead, its really hard to find a case where a single or
really few counters would be used.
Normally there're quite enough of counters in a context model so that we
can collect our statistics. So, eg, you can test the model with different
values of d and find the one that gives the smallest output. Of course its
also possible to use d1,d2,... for different ranges and optimize them all.

Btw, I kinda mix a static distribution into all my counters, guess it has
the same effect. (I don't actually mix it, but keep already mixed value
in the counter and update it accordingly).

Though anyway, no pure mathematical solution would have any practical application.
I mean, if you care about initial predictions of counters, you just have to use
something like:

t = Counter.Occurences
if( t<Limit ) {
Counter.P = p = table[Counter.History];
(Counter.History<<=1)+=bit;
} else {
p = Counter.P;
Counter.Update( bit );
}

and tune the values in the table[] to your data.
Also you can simulate any theoretical formula with it, so its superior.  19. Originally Posted by Shelwien
Whatever k there is, the number of binary strings of length n with k ones is
C(n,k)=n!/k!/(n-k)!
So given that we know the current k and n, therere C(n+1,k)/(n+1-k) different
strings of length n+1 with zero at last position, and C(n+1,k+1)/(k+1) strings
with one at last position, so we can calculate a relative frequency aka
probability.
C(n,k) - number of strings with the given n & k. I agree.
These strings are equiprobable and You can simply count them to get probability estimation. I agree again (excluding that estimation will be other than You wrote).
But after concatenation of old string with 0 & 1, strings will not become equiprobable because 0 & 1 are not equiprobable and We can not simply count them. We disagree here. Originally Posted by Shelwien
Instead, its really hard to find a case where a single or really few counters would be used.
Yes, it is trouble . It is fully useless for PPM, but I hoped it can get some usage for weightning of predictions. Originally Posted by Shelwien
Also you can simulate any theoretical formula with it, so its superior.
You can generate any text using million of monkeys with typewriters. Is it superior way for writing texts? 20. > But after concatenation of old string with '0' & '1', strings will not become
> equiprobable because '0' & '1' are not equiprobable and We can not simply count
> them. We disagree here.

Look, its totally similar to what you did with likelihoods, but likelihood is
an approximation, and the real P(s) is 1/C(n,k) because there're C(n,k) possible
strings.

>> Instead, its really hard to find a case where a single or really few counters
>> would be used.
>
> Yes, it is trouble . It is fully useless for PPM, but I hoped it can get some
> usage for weightning of predictions.

1. Even if you don't have multiple similar counters within a model, its still
possible to collect necessary statistics with different datasets.

2. Have you seen that thread?
http://www.encode.su/forums/index.ph...ic=586#msg7087

>> Also you can simulate any theoretical formula with it, so its superior.
>
> You can generate any text using million of monkeys with typewriters.
> Is it superior way for writing texts?

Guess it could be, if there was a known measure of text quality.
Luckily, we do have such a measure for compression quality. 21. Originally Posted by Shelwien
Look, its totally similar to what you did with likelihoods, but likelihood is
an approximation
Why? Originally Posted by Shelwien
and the real P(s) is 1/C(n,k) because therere C(n,k) possible
strings.
Grrrr... It is true only if You select one string out of set of strings with equal n & k, but there can be other k... 22. >> Look, its totally similar to what you did with likelihoods, but likelihood is
>> an approximation
>
> Why?

Its P(n,k) = O(n) * (k/n)^k * ((n-k)/n)^(n-k) , right?

1. There's no corresponding 1/P(n,k) different cases, so its obviously
approximation, as there's no underlying "physical objects".

2. By applying Stirling's approximation to C(n,k), you'd get

n! ~= Sqrt(2*Pi*n)*n^n/Exp(n) = C*n^(n+.5)/Exp(n)
C(n,k) = n!/k!/(n-k)!
= C*n^(n+.5)/Exp(n) / C*k^(k+.5)/Exp(k) / C*(n-k)^(n-k+.5)/Exp(n-k)
= n^(n+.5) / k^(k+.5) / C*(n-k)^(n-k+.5) =
= Sqrt(n/k/(n-k))/C * n^n / k^k / (n-k)^(n-k) =
= Sqrt(n/k/(n-k))/C * (n/k)^k * (n/(n-k))^(n-k)
and 1/C(n,k) = C*Sqrt(k*(n-k)/n) * (k/n)^k * ((n-k)/n)^(n-k) =
= O(n) * (k/n)^k * ((n-k)/n)^(n-k) = likelihood

That's why I said that if you get something by approximation, then
its approximation.

>> and the real P(s) is 1/C(n,k) because there're C(n,k) possible
>> strings.
>
> Grrrr...

That's my line > It is true only if You select one string out of set of strings with
> equal n & k, but there can be other k...

Yeah, and we're talking about probability estimation based only on
distributions, not on contexts. So if current state is (n,k) then
we can have (n+1,k) or (n+1,k+1) on next step, C(n+1,k)+C(n+1,k+1)
combinations in total as they don't overlap.

Edit: Guess, I better add that to get a probability we have to count
the combinations ending with 1 - that'd be C(n,k-1)+C(n,k)=C(n+1,k)

So p1 = C(n+1,k) / (C(n+1,k)+C(n+1,k+1)) = (k+1)/(n+2) 23. Originally Posted by Shelwien
1. Theres no corresponding 1/P(n,k) different cases, so its obviously
approximation, as theres no underlying "physical objects".
There is 2^n different cases because source can generate any string with length n and our selection p=k/n maximizes probability of our string s. Originally Posted by Shelwien
Yeah, and were talking about probability estimation based only on
distributions, not on contexts. So if current state is (n,k) then
we can have (n+1,k) or (n+1,k+1) on next step, C(n+1,k)+C(n+1,k+1)
combinations in total as they dont overlap.
Grrr.... even GRRRRR.... these cases do not overlap, but they have different probability and You can not simply count them.

ES> So p1 = C(n+1,k) / (C(n+1,k)+C(n+1,k+1)) = (k+1)/(n+2)

Great result ! 24. > > 1. There's no corresponding 1/P(n,k) different cases, so its obviously
> > approximation, as there's no underlying "physical objects".
>
> There is 2^n different cases because source can generate
> any string with length n and our selection p=k/n maximizes
> probability of our string s.

Well, ok, thats your source so it can do what you want.
But then at least explain what k!=0 means eg for the string
of all zeros which is one of 2^n ?

> > Yeah, and we're talking about probability estimation based only on
> > distributions, not on contexts. So if current state is (n,k) then
> > we can have (n+1,k) or (n+1,k+1) on next step, C(n+1,k)+C(n+1,k+1)
> > combinations in total as they don't overlap.
>
> Grrr.... even GRRRRR....

Please don't be hasty and read my previous post as many times
as you'd need to understand it. I really paid attention when
writing it, so am pretty sure that there's no mistakes.

> these cases do not overlap, but
> they have different probability and You can not simply count
> them.

Why do they have different probabilities?
We're supposedly only know that there're k ones in the current
string of length n, and nothing more specific. Also we don't
know the probability of next bit. So we can enumerate
all the possible cases on the next step, then select the
cases ending with 1, and get our probability by dividing these
numbers.

> > So p1 = C(n+1,k) / (C(n+1,k)+C(n+1,k+1)) = (k+1)/(n+2)
>
> Great result!

2. Your formula with k^k etc can be derived from it by
replacing factorials with a Stirling approximation. 25. It becomes tiresome.
"Мамой клианус, да!"
Code:
```
enum { I=1000, N=10 };
int main()
{
int i, k, n, bit;
double p, IdealCodeLen, CodeLen[N][N], Summ, Err, ttt[N+1];
for (ttt[i=0]=1.0;i < N;i++)
ttt[i]=(ttt[i+1]=pow(i+1.0,i+1.0))/ttt[i];
for (n=0;n < N;n++)
for (k=0;k <= n;k++) {
CodeLen[n][k]=-log(ttt[k]/(ttt[k]+ttt[n-k]))/log(2.0);
CodeLen[n][k]=-log((k+0.5)/(n+1.0))/log(2.0);
CodeLen[n][k]=-log((k+1.0)/(n+2.0))/log(2.0);
}
for (Err=Err=Err=i=0;i < I;i++) {
p=FRAND();
for (IdealCodeLen=Summ=Summ=Summ=k=n=0;n < N;n++, k += bit) {
bit=(FRAND() <= p);
Summ += (bit)?(CodeLen[n][k]):(CodeLen[n][n-k]);
Summ += (bit)?(CodeLen[n][k]):(CodeLen[n][n-k]);
Summ += (bit)?(CodeLen[n][k]):(CodeLen[n][n-k]);
IdealCodeLen += -((bit)?(log(p)):(log(1.0-p)))/log(2.0);
}
Err += fabs((Summ-IdealCodeLen)/IdealCodeLen/N);
Err += fabs((Summ-IdealCodeLen)/IdealCodeLen/N);
Err += fabs((Summ-IdealCodeLen)/IdealCodeLen/N);
}
printf("Average relative error for        MP-estimator: %8.6f
",Err/I);
printf("Average relative error for Dirichlet estimator: %8.6f
",Err/I);
printf("Average relative error for    Laplas estimator: %8.6f
",Err/I);
return 0;
}```
Program output:
Code:
```
Average relative error for        MP-estimator: 0.109592
Average relative error for Dirichlet estimator: 0.124685
Average relative error for    Laplas estimator: 0.168420``` 26. > It becomes tiresome.

Well, that certainly explains things.
I really meant a different data source/model.
The only question is, if there any applications where source really generates data using fixed probabilities. 27. The program contains a bug: there is no need to divide on N.
Output of corrected program for N=10:
Code:
```Average relative error for        MP-estimator: 1.095918
Average relative error for Dirichlet estimator: 1.246853
Average relative error for    Laplas estimator: 1.684201```
Output of corrected program for N=100:
Code:
```Average relative error for        MP-estimator: 0.112389
Average relative error for Dirichlet estimator: 0.126469
Average relative error for    Laplas estimator: 0.180794```
Thus, We can conclude that N=10 is too small string length for convergence.
The only question is, if there any applications where source really generates data using fixed probabilities.
As a rule, We have not enough statistics even for estimation of stationary sources, for nonstationary ones size of typycal data must be much larger. 28. Well, I'd rewritten your program to show the accumulated code lengths instead of some random averages.

And my result somehow is like this:
Code:
```
compressed bitlength of 1000 random binary strings of length 100
Ideal: 71945.546875
MP-estimator: 75128.820313
Dirichlet estimator: 74939.500000
Laplas estimator: 74749.976563``` 29. And what?
More rough methods give better result for nearly equiprobable sources (p ~ 0.5) and your accumulated summs consist mainly of encoded output of these sources because, for p ~ 0.0 or p ~ 1.0, encoded strings are too short to give visible impact on your accumulated summs. Ok, here I added a couple of lines:
p = FRAND();
float r = 0.25;
p = (p<0.5) ? p*r : 1-(1-p)*r;

Code:
```compressed bitlength of 1000 random binary strings of length 100
Ideal: 31728.228516
MP-estimator: 34714.800781
Dirichlet estimator: 34734.050781
Laplas estimator: 35370.746094
(k+0.40)/(n+0.80) estimator: 34685.050781``` 