Results 1 to 2 of 2

Thread: Updating the linear mixer with BFA

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,375
    Thanks
    214
    Thanked 1,023 Times in 544 Posts

    Updating the linear mixer with BFA

    People seem to ignore an important concept in mixer implementations,
    so here I finally made a demo of my results with a method called BFA
    (bruteforce adaptation), also known as Likelihood Maximization.

    http://nishi.dreamhosters.com/u/rc_v4.rar

    Now, let's look at the results first. "Static" is that fixed mix of order0 and order1 again,
    "counter_backprob" demonstrates the already mentioned mixer which updates mixed probability
    as a counter, and BFA* are versions of the method which I'd like to describe here.
    Note that
    (1) adaptive linear mixing does help here, even without mixer contexts or chains/trees
    (2) BFA1-2 show better results that my usual linear update (despite having less parameters)

    Code:
    216774  static, http://nishi.dreamhosters.com/u/rc_v3.rar
    216089  counter_backprop
    216377  BFA0
    215907  BFA1/NUM=65
    215846  BFA2/NUM=65
    220946  BFA3/DEPTH=64
    218978  BFA3/DEPTH=128
    217358  BFA3/DEPTH=256
    216437  BFA3/DEPTH=512
    216040  BFA3/DEPTH=1024
    1. BFA0

    We already know that the weight w in the mix (p1*(1-w)+p2*w) represents a
    probability of model-2 compressing the bit better than model-1.
    And well, this information can be directly translated into a mixer update formula:
    Code:
      word w;
      uint l1,l2;
      void Update( int bit, int p1,int p2, int wr, int cwr, int cmw ) {
        if( bit ) p1=SCALE-p1, p2=SCALE-p2;
        l1 -= (int(l1-logtbl[p1])*wr) >> 12;
        l2 -= (int(l2-logtbl[p2])*wr) >> 12;
        uint y = (l1<l2);
        LinearUpdate( w, y, cwr, cmw );
      }
    Here we compute the codelengths independently for model-1 and model2
    with optimized version of the formula l'=l*(1-wr)-log2(p1)*wr.
    When (l1<l2) is true, the probability w decreases
    (because its a probability of 0 here), so weight of p1 in the
    mix increases.
    Thus, w is an estimation of the probability of ((l1<l2)==0),
    ie the model-2 compressing the next bit better.
    Sure, this is fairly imprecise, but, as we can see, it is already an
    improvement over static mixing, while its known that with these inputs
    its not quite easy to reach that.

    2. BFA1

    We can go further (and slower) and implement an actual minimum entropy
    (=maximum likelihood) search for some set of weight values:

    Code:
      enum{ NUM=64+1, iNUM=SCALE/(NUM-1) };
      int w;
      uint l[NUM];
      void Update( int bit, int p1,int p2, int wr, uint Limit ) {
        if( uint(p2-p1+Limit) > uint(2*Limit) ) {
          int W,i,j;
          int ml=l[0]; j=0;
          if( bit ) p1=SCALE-p1, p2=SCALE-p2;
          for( i=0,W=0; i<NUM; i++,W+=iNUM,p+=dp ) {
            int p = p1*SCALE + (p2-p1)*W;
            l[i] -= (int(l[i]-logtbl[p>>SCALElog])*wr) >> 12;
            if( l[i]<ml ) ml=l[i],j=W;
          }
          w = j;
        }
      }
    Its very straight-forward and, as we can see from results, is pretty
    good actually. Note the update condition though - it means
    |p2-p1|<Limit and skips updates when submodel estimations are
    too similar - its a speed optimization, but actually improves compression too,
    because of limited precision.

    3. BFA2

    Here I used a lookup table for log2 values, but a different mathematical
    solution is also possible - instead of searching for function's minimum,
    we can find the root of its derivative.
    So, the function is (y = next bit value)
    l[w] = (1-y)*log2(p1+(p2-p1)*w)+y*log2(1-(p1+(p2-p1)*w))
    and its derivative at y=0 is
    l'[w] = (p2-p1)/pmix = (p2-p1)/(p1+(p2-p1)*w) = 1/(p1/(p2-p1)+w)

    Code:
      int w;
      float l[NUM];
      void Update( int s, int p1,int p2, int wr, uint Limit ) {
        int W,p,dp,i,j;
        float ml=fabs(l[0]); j=0;
        if( s ) p1=SCALE-p1, p2=SCALE-p2;
        float qp = float(p1)*SCALE / (p2-p1);
        for( i=0,W=0; i<NUM; i++,W+=iNUM ) {
          float dp = float(SCALE) / (qp+W);
          l[i] = (l[i]*(4096-wr) + dp*wr)/4096;
          if( fabs(l[i])<ml ) ml=fabs(l[i]),j=W;
        }
        w = j;
      }
    So here we're accumulating the derivatives of BFA1 codelengths, and
    finding the one closest to 0 for the weight value.
    Its also possible to use linear interpolation for l[i] and output
    weight values with higher precision, but this trick is not included
    for simplicity.
    Note that result is better here (maybe because of higher precision
    in division comparing to BFA1 log2 lookup table), but also slower -
    lookup table for dp and integer arithmetics weren't used
    (because I'm lazy).

    4. BFA3

    And then we have another alternative. Instead of accumulating the
    cost function of inputs for fixed parameter values, we can store
    a limited history of inputs and become able to evaluate the cost
    approximation for any weight value. So gradient descent and similar
    methods can be used and iteration on w becomes unnecessary.
    An interesting detail is that in our case its possible to store
    only p1/(p2-p1) instead of two inputs.

    Code:
      int w;
      float l[DEPTH];
      float win[DEPTH]; // window coefs
      void Update( int s, int p1,int p2, int wr, uint Limit ) {
        int W,p,dp,i,j;
        float ml=1E38; j=0;
        if( s ) p1=SCALE-p1, p2=SCALE-p2;
        float qp = float(p1)*SCALE / (p2-p1);
        // shift the history, append the next element
        for( i=1; i<DEPTH; i++ ) l[i-1]=l[i]; l[DEPTH-1]=qp;
        // estimate the function for all weight values, like in BFA2
        for( W=0; W<=SCALE; W+=iNUM ) { 
          float a = 0;
          for( i=0; i<DEPTH; i++ ) a += win[i] / (l[i]+W);
          if( fabs(a)<ml ) ml=fabs(a),j=W;
        }
        w = j;
      }
    Well, as it can be seen in results, the precision increases along with
    history depth, but I wasn't patient enough to run it with enough history
    to actually produce better result than BFA2.
    Anyway, this version is included mainly for completeness, as I wasn't ever
    able to observe noticeably better results with high weight precision.
    But there could be other cases where a similar approach would be helpful,
    also note that its possible to use any window function here, unlike BFA1-2
    where its always exponential.

    5. Conclusion

    Maximum likelihood methods are actively used in telecommunication algorithms
    (like speech codecs and advanced error correction), but for some reason
    remain unknown for many lossless data compression developers.
    The main reasons may be speed and memory usage, but at least it can be used
    for estimation of potential compression - obviously, the same approach is
    applicable to logistic mixer update too (and does improve the compression)
    and in fact many other model parameters can be made adaptive
    (eg I successfully applied it to ppmy weight function parameters).

  2. #2
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Nice ideas - As my experiments also showed there is little to no improvement
    over classical methods. Without any context, that is one mixer (and no feedling with update-rate): I got

    215.792 with least-squares
    215.767 with least-cost

Similar Threads

  1. Replies: 8
    Last Post: 5th December 2010, 17:18
  2. Paq mixer theory
    By Shelwien in forum Data Compression
    Replies: 0
    Last Post: 22nd November 2009, 02:32
  3. M01 - Order 01 context mixer
    By toffer in forum Data Compression
    Replies: 58
    Last Post: 17th June 2008, 19:29
  4. Mixer vs 2D SSE
    By Shelwien in forum Forum Archive
    Replies: 26
    Last Post: 8th March 2008, 13:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •