# Thread: Updating the linear mixer with BFA

1. ## Updating the linear mixer with BFA

People seem to ignore an important concept in mixer implementations,
so here I finally made a demo of my results with a method called BFA
(bruteforce adaptation), also known as Likelihood Maximization.

http://nishi.dreamhosters.com/u/rc_v4.rar

Now, let's look at the results first. "Static" is that fixed mix of order0 and order1 again,
as a counter, and BFA* are versions of the method which I'd like to describe here.
Note that
(1) adaptive linear mixing does help here, even without mixer contexts or chains/trees
(2) BFA1-2 show better results that my usual linear update (despite having less parameters)

Code:
```216774  static, http://nishi.dreamhosters.com/u/rc_v3.rar
216089  counter_backprop
216377  BFA0
215907  BFA1/NUM=65
215846  BFA2/NUM=65
220946  BFA3/DEPTH=64
218978  BFA3/DEPTH=128
217358  BFA3/DEPTH=256
216437  BFA3/DEPTH=512
216040  BFA3/DEPTH=1024```
1. BFA0

We already know that the weight w in the mix (p1*(1-w)+p2*w) represents a
probability of model-2 compressing the bit better than model-1.
And well, this information can be directly translated into a mixer update formula:
Code:
```  word w;
uint l1,l2;
void Update( int bit, int p1,int p2, int wr, int cwr, int cmw ) {
if( bit ) p1=SCALE-p1, p2=SCALE-p2;
l1 -= (int(l1-logtbl[p1])*wr) >> 12;
l2 -= (int(l2-logtbl[p2])*wr) >> 12;
uint y = (l1<l2);
LinearUpdate( w, y, cwr, cmw );
}```
Here we compute the codelengths independently for model-1 and model2
with optimized version of the formula l'=l*(1-wr)-log2(p1)*wr.
When (l1<l2) is true, the probability w decreases
(because its a probability of 0 here), so weight of p1 in the
mix increases.
Thus, w is an estimation of the probability of ((l1<l2)==0),
ie the model-2 compressing the next bit better.
Sure, this is fairly imprecise, but, as we can see, it is already an
improvement over static mixing, while its known that with these inputs
its not quite easy to reach that.

2. BFA1

We can go further (and slower) and implement an actual minimum entropy
(=maximum likelihood) search for some set of weight values:

Code:
```  enum{ NUM=64+1, iNUM=SCALE/(NUM-1) };
int w;
uint l[NUM];
void Update( int bit, int p1,int p2, int wr, uint Limit ) {
if( uint(p2-p1+Limit) > uint(2*Limit) ) {
int W,i,j;
int ml=l[0]; j=0;
if( bit ) p1=SCALE-p1, p2=SCALE-p2;
for( i=0,W=0; i<NUM; i++,W+=iNUM,p+=dp ) {
int p = p1*SCALE + (p2-p1)*W;
l[i] -= (int(l[i]-logtbl[p>>SCALElog])*wr) >> 12;
if( l[i]<ml ) ml=l[i],j=W;
}
w = j;
}
}```
Its very straight-forward and, as we can see from results, is pretty
good actually. Note the update condition though - it means
|p2-p1|<Limit and skips updates when submodel estimations are
too similar - its a speed optimization, but actually improves compression too,
because of limited precision.

3. BFA2

Here I used a lookup table for log2 values, but a different mathematical
solution is also possible - instead of searching for function's minimum,
we can find the root of its derivative.
So, the function is (y = next bit value)
l[w] = (1-y)*log2(p1+(p2-p1)*w)+y*log2(1-(p1+(p2-p1)*w))
and its derivative at y=0 is
l'[w] = (p2-p1)/pmix = (p2-p1)/(p1+(p2-p1)*w) = 1/(p1/(p2-p1)+w)

Code:
```  int w;
float l[NUM];
void Update( int s, int p1,int p2, int wr, uint Limit ) {
int W,p,dp,i,j;
float ml=fabs(l[0]); j=0;
if( s ) p1=SCALE-p1, p2=SCALE-p2;
float qp = float(p1)*SCALE / (p2-p1);
for( i=0,W=0; i<NUM; i++,W+=iNUM ) {
float dp = float(SCALE) / (qp+W);
l[i] = (l[i]*(4096-wr) + dp*wr)/4096;
if( fabs(l[i])<ml ) ml=fabs(l[i]),j=W;
}
w = j;
}```
So here we're accumulating the derivatives of BFA1 codelengths, and
finding the one closest to 0 for the weight value.
Its also possible to use linear interpolation for l[i] and output
weight values with higher precision, but this trick is not included
for simplicity.
Note that result is better here (maybe because of higher precision
in division comparing to BFA1 log2 lookup table), but also slower -
lookup table for dp and integer arithmetics weren't used
(because I'm lazy).

4. BFA3

And then we have another alternative. Instead of accumulating the
cost function of inputs for fixed parameter values, we can store
a limited history of inputs and become able to evaluate the cost
approximation for any weight value. So gradient descent and similar
methods can be used and iteration on w becomes unnecessary.
An interesting detail is that in our case its possible to store
only p1/(p2-p1) instead of two inputs.

Code:
```  int w;
float l[DEPTH];
float win[DEPTH]; // window coefs
void Update( int s, int p1,int p2, int wr, uint Limit ) {
int W,p,dp,i,j;
float ml=1E38; j=0;
if( s ) p1=SCALE-p1, p2=SCALE-p2;
float qp = float(p1)*SCALE / (p2-p1);
// shift the history, append the next element
for( i=1; i<DEPTH; i++ ) l[i-1]=l[i]; l[DEPTH-1]=qp;
// estimate the function for all weight values, like in BFA2
for( W=0; W<=SCALE; W+=iNUM ) {
float a = 0;
for( i=0; i<DEPTH; i++ ) a += win[i] / (l[i]+W);
if( fabs(a)<ml ) ml=fabs(a),j=W;
}
w = j;
}```
Well, as it can be seen in results, the precision increases along with
history depth, but I wasn't patient enough to run it with enough history
to actually produce better result than BFA2.
Anyway, this version is included mainly for completeness, as I wasn't ever
able to observe noticeably better results with high weight precision.
But there could be other cases where a similar approach would be helpful,
also note that its possible to use any window function here, unlike BFA1-2
where its always exponential.

5. Conclusion

Maximum likelihood methods are actively used in telecommunication algorithms
(like speech codecs and advanced error correction), but for some reason
remain unknown for many lossless data compression developers.
The main reasons may be speed and memory usage, but at least it can be used
for estimation of potential compression - obviously, the same approach is
applicable to logistic mixer update too (and does improve the compression)
(eg I successfully applied it to ppmy weight function parameters).

2. Nice ideas - As my experiments also showed there is little to no improvement
over classical methods. Without any context, that is one mixer (and no feedling with update-rate): I got

215.792 with least-squares
215.767 with least-cost

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•