People seem to ignore an important concept in mixer implementations,

so here I finally made a demo of my results with a method calledBFA

(bruteforce adaptation), also known as Likelihood Maximization.

http://nishi.dreamhosters.com/u/rc_v4.rar

Now, let's look at the results first. "Static" is that fixed mix of order0 and order1 again,

"counter_backprob" demonstrates the already mentioned mixer which updates mixed probability

as a counter, and BFA* are versions of the method which I'd like to describe here.

Note that

(1) adaptive linear mixing does help here, even without mixer contexts or chains/trees

(2) BFA1-2 show better results that my usual linear update (despite having less parameters)

1. BFA0Code:216774 static, http://nishi.dreamhosters.com/u/rc_v3.rar 216089 counter_backprop 216377 BFA0 215907 BFA1/NUM=65 215846 BFA2/NUM=65 220946 BFA3/DEPTH=64 218978 BFA3/DEPTH=128 217358 BFA3/DEPTH=256 216437 BFA3/DEPTH=512 216040 BFA3/DEPTH=1024

We already know that the weightwin the mix(p1*(1-w)+p2*w)represents a

probability of model-2 compressing the bit better than model-1.

And well, this information can be directly translated into a mixer update formula:

Here we compute the codelengths independently for model-1 and model2Code:word w; uint l1,l2; void Update( int bit, int p1,int p2, int wr, int cwr, int cmw ) { if( bit ) p1=SCALE-p1, p2=SCALE-p2; l1 -= (int(l1-logtbl[p1])*wr) >> 12; l2 -= (int(l2-logtbl[p2])*wr) >> 12; uint y = (l1<l2); LinearUpdate( w, y, cwr, cmw ); }

with optimized version of the formulal'=l*(1-wr)-log2(p1)*wr.

When(l1<l2)is true, the probabilitywdecreases

(because its a probability of 0 here), so weight ofp1in the

mix increases.

Thus,wis an estimation of the probability of((l1<l2)==0),

ie the model-2 compressing the next bit better.

Sure, this is fairly imprecise, but, as we can see, it is already an

improvement over static mixing, while its known that with these inputs

its not quite easy to reach that.

2. BFA1

We can go further (and slower) and implement an actual minimum entropy

(=maximum likelihood) search for some set of weight values:

Its very straight-forward and, as we can see from results, is prettyCode:enum{ NUM=64+1, iNUM=SCALE/(NUM-1) }; int w; uint l[NUM]; void Update( int bit, int p1,int p2, int wr, uint Limit ) { if( uint(p2-p1+Limit) > uint(2*Limit) ) { int W,i,j; int ml=l[0]; j=0; if( bit ) p1=SCALE-p1, p2=SCALE-p2; for( i=0,W=0; i<NUM; i++,W+=iNUM,p+=dp ) { int p = p1*SCALE + (p2-p1)*W; l[i] -= (int(l[i]-logtbl[p>>SCALElog])*wr) >> 12; if( l[i]<ml ) ml=l[i],j=W; } w = j; } }

good actually. Note the update condition though - it means

|p2-p1|<Limitand skips updates when submodel estimations are

too similar - its a speed optimization, but actually improves compression too,

because of limited precision.

3. BFA2

Here I used a lookup table for log2 values, but a different mathematical

solution is also possible - instead of searching for function's minimum,

we can find the root of its derivative.

So, the function is (y = next bit value)

l[w] = (1-y)*log2(p1+(p2-p1)*w)+y*log2(1-(p1+(p2-p1)*w))

and its derivative at y=0 is

l'[w] = (p2-p1)/pmix = (p2-p1)/(p1+(p2-p1)*w) = 1/(p1/(p2-p1)+w)

So here we're accumulating the derivatives of BFA1 codelengths, andCode:int w; float l[NUM]; void Update( int s, int p1,int p2, int wr, uint Limit ) { int W,p,dp,i,j; float ml=fabs(l[0]); j=0; if( s ) p1=SCALE-p1, p2=SCALE-p2; float qp = float(p1)*SCALE / (p2-p1); for( i=0,W=0; i<NUM; i++,W+=iNUM ) { float dp = float(SCALE) / (qp+W); l[i] = (l[i]*(4096-wr) + dp*wr)/4096; if( fabs(l[i])<ml ) ml=fabs(l[i]),j=W; } w = j; }

finding the one closest to 0 for the weight value.

Its also possible to use linear interpolation for l[i] and output

weight values with higher precision, but this trick is not included

for simplicity.

Note that result is better here (maybe because of higher precision

in division comparing to BFA1 log2 lookup table), but also slower -

lookup table fordpand integer arithmetics weren't used

(because I'm lazy).

4. BFA3

And then we have another alternative. Instead of accumulating the

cost function of inputs for fixed parameter values, we can store

a limited history of inputs and become able to evaluate the cost

approximation for any weight value. So gradient descent and similar

methods can be used and iteration onwbecomes unnecessary.

An interesting detail is that in our case its possible to store

onlyp1/(p2-p1)instead of two inputs.

Well, as it can be seen in results, the precision increases along withCode:int w; float l[DEPTH]; float win[DEPTH]; // window coefs void Update( int s, int p1,int p2, int wr, uint Limit ) { int W,p,dp,i,j; float ml=1E38; j=0; if( s ) p1=SCALE-p1, p2=SCALE-p2; float qp = float(p1)*SCALE / (p2-p1); // shift the history, append the next element for( i=1; i<DEPTH; i++ ) l[i-1]=l[i]; l[DEPTH-1]=qp; // estimate the function for all weight values, like in BFA2 for( W=0; W<=SCALE; W+=iNUM ) { float a = 0; for( i=0; i<DEPTH; i++ ) a += win[i] / (l[i]+W); if( fabs(a)<ml ) ml=fabs(a),j=W; } w = j; }

history depth, but I wasn't patient enough to run it with enough history

to actually produce better result than BFA2.

Anyway, this version is included mainly for completeness, as I wasn't ever

able to observe noticeably better results with high weight precision.

But there could be other cases where a similar approach would be helpful,

also note that its possible to use any window function here, unlike BFA1-2

where its always exponential.

5. Conclusion

Maximum likelihood methods are actively used in telecommunication algorithms

(like speech codecs and advanced error correction), but for some reason

remain unknown for many lossless data compression developers.

The main reasons may be speed and memory usage, but at least it can be used

for estimation of potential compression - obviously, the same approach is

applicable to logistic mixer update too (and does improve the compression)

and in fact many other model parameters can be made adaptive

(eg I successfully applied it to ppmy weight function parameters).