Page 1 of 5 123 ... LastLast
Results 1 to 30 of 122

Thread: Logistic mixing

  1. #1
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts

    Post Logistic mixing

    Quote Originally Posted by Shelwien View Post
    The mixer is a 3-way logistic paq mixer.
    The minimization is done with respect to prediction error in the logistic domain.
    Maybe it would be better to minimize with respect to coding i.e. -log_2(p) in logistic domain too.
    There 's also an obvious improvement to all these gradient descent approaches, but I think it just would cost too much speed.
    Last edited by Sebastian; 3rd November 2010 at 23:09.

  2. #2
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Wrong. It minimizes coding cost - just have a look at my recent post

    https://sites.google.com/site/toffer...logisticmixing
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    Thanks, I guess I missed a constant in http://ctxmodel.net/files/mix_test/stems5.png
    Not that it matters though

  4. #4
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by toffer View Post
    Wrong. It minimizes coding cost - just have a look at my recent post
    Ah ok, i really wondered why the "back-propagation" term is missing in the update rule.
    Your derivation is very clear and leads to a nice simple update formulation, thanks for that .

    I currently use something like that with 2-input mixing for p
    w += r*d, where d=(p2-p1)/p when bit=1, or d=(p1-p2)/(1-p) when bit=0
    which is derived from minimizing coding cost in linear domain.

    did you try finding the minimum on the line f(w+r*d)=min! instead of using fixed step-size?

  5. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Ok, here's a comprehensive answer.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yeah, I did a similar derivation when I introduced logistic mixing to PAQ7 but only commented the result. I noticed that adjusting the weights along the gradient of coding cost in weight space led to a simpler update rule than for back propagation, which minimizes root mean square prediction error. So if the actual output is p (a prediction between 0 and 1) and the desired output is y (the update bit 0 or 1), then back propagation says:

    w[i] += L*x[i]*(y-p)*p*(1-p);

    where L is the learning rate (around .001 to .01) and w[i] is the weight for input x[i]. Minimizing coding cost gives you

    w[i] += L*x[i]*(y-p);

    I thought it was neat that the extra terms went away. I had done the same analysis for linear mixing in PAQ6, where I took the partial derivative of coding cost with respect to weights. The equations were a little more complex but it did eliminate L as a parameter. For linear mixing you have to maintain w[i] > 0 but it's not required for logistic mixing.

  7. #7
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    The actual difference between cost minimization in logistic or linear domain is very small in my example but it all could be because I've done something wrong

    I made different 2-input mixers. In my first version the linear mixing was actually better, but after literally copying the mixer from PAQ9a it went a little better.

    This is the result on calgary corpus with literals coded by an O2-O1-O0-mix powered by an LZ-engine.
    The mixer for Mix(p2,Mix(p0,p1)) is selected via actual coded byte and parts of last-byte.

    difference could be because we lack parameter optimization, but funny how close they are (only have to change 1 line in the code)

    Code:
    3.141.622 bytes
    linear mix:   845.585 bytes
    logistic mix: 844.903 bytes
    Quote Originally Posted by Matt Mahoney View Post
    but it did eliminate L as a parameter
    But L is the step-size on the gradient line, and the linear mix should have the same speed as the logistic variant.
    Because we are able to replace the division in d=(p2-p1)/p when bit=1, or d=(p1-p2)/(1-p) when bit=0 with table-lookup multiplication.
    and then we have w+=L*d which needs 2mul, 1add per bit, just like the logistic variant.
    Last edited by Sebastian; 7th November 2010 at 03:27.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I describe linear and logistic mixing here: http://mattmahoney.net/dc/dce.html#Section_431

    In a linear mixer, S0 and S1 (evidence for 0 and 1) start small and grow over time. This causes the weight updates to start large and get smaller over time. The reason you don't have a learning rate parameter is that you are mixing evidence (bit counts) to get probabilities. In a logistic mixer, the inputs are already probabilities. The rapid adaptation at the start comes from the models rather than from the mixer.

  9. #9
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the squash and stretch.

    But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
    (y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient. Thus you're able to add a step-size on weight update.
    It doesn't matter if you retrieve the formula from bit-counts, because you interpret them as a probability, but maybe I have to start Mathematica again
    Last edited by Sebastian; 7th November 2010 at 04:11.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the
    > squash and stretch.

    Its not really about avoiding anything.
    In fact, both logistic and linear mixing are based on the same source model,
    where each symbol is generated by a single submodel, and weights are probabilities
    of submodel usage.
    Just that you get linear mixing by summing symbol intervals
    (ie there's a total probability of w1*p1 that symbol would be encoded
    with model1 and w2*p2 for model2 - and we don't care which model it is,
    so symbol probability is w1*p1+w2*p2)
    and logistic mixing by summing likelihood intervals (string probabilities):
    http://encode.su/threads/496-Paq-mix...ixer+theory%5C

    > But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
    > (y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient.

    Not sure what're you talking about. Anyway, the update in binary logistic mixer looks
    like this:
    Code:
      void Update( int y, int wr,int mw, int Limit, int p1,int p2 ) {
        const int e = (((y<<MIX_SCALE_LOG)-pm)*wr)>>SCALElog;
        w[0] += e*s[0];
        w[1] += e*s[1];
      }
    and it matches with http://encode.su/threads/1154-Logist...ll=1#post22889
    and http://www.ctxmodel.net/files/mix_test/stems5.png (though there's some unnecessary stuff)

    > because you interpret them as a probability, but maybe I have to start Mathematica again

    Yeah, you can start with http://www.wolframalpha.com/input/?i...p)],+{p,0,1}+]

  11. #11
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Yeah wr is your update-rate, that possibility was all I wanted to point out.
    There's only one question left, that's why you allow negative weights?
    If i look at your derivation weights sum up to 1 but allowing negative weights (and using 2 for binary input) improves compression.
    Last edited by Sebastian; 7th November 2010 at 11:58.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    Well, the version I'm using (like in mix_test) looks like this:
    Code:
    struct iMixer {
      word w;  // weights
    
      void Init( int w0 ) { 
        w = w0 + hSCALE;
      }
    
      int rdiv( int x, int a, int d ) {  
    //    return x>=0 ? (x+a)>>d : -((-x+a)>>d);
        return (x+a)>>d;
      }
    
      static int st( int p ) { return U.Stretch(p); }
      static int sq( int p ) { return W.Squash(p);  }
    
      int Mixup( int s0, int s1, int wr ) {
        int x = s1 + rdiv((w-hSCALE)*(s0-s1),1<<(SCALElog-1),SCALElog);
        x = rdiv( (x*wr), 1<<(SCALElog-1), SCALElog );
        return x;
      }
    
      void Update( int y, int p0,int p1, int wq, int pm ) {
        int py = SCALE - (y<<SCALElog);
        int e = (py-pm);
        int d = rdiv( e*(p0-p1), 1<<(SCALElog-1), SCALElog );
        d = rdiv( d*wq, 1<<(SCALElog-1), SCALElog );
        w += d;
      }
    };
    So I don't use negative weights (even if it helps in some cases, it would be overtuning imho),
    and instead of n-ary mixers I have trees of binary mixers.

  13. #13
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    if I translate correctly

    mix:
    x=s1+(s0-s1)(w-0.5)=(1-w+0.5)*s1+(w-0.5)*s0 (why the 0.5?)
    x=x*wr (I tested that and it gained some bytes, but this seems rather shamanic)

    update:
    d=(1-bit-pm)*(p0-p1)*wq (is this derived from the "likelihood-mixer"?)
    w+=d

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    0.5 is for rounding. you don't need it with float-point arithmetics.
    wr has an explanation - you can get it with p=p*(1-wr)+y*wr probability update and likelihood extrapolation of posterior probability.
    for wq i don't know much, but it seems reasonable to control update rate anyway.
    And actually I've got further improvements from tuning of stretch/squash mappings, which is even harder to explain.

  15. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Sebastian View Post
    Ok, I understand linear mixing as p=p1*w1+...+pn*wn avoiding the squash and stretch.

    But if your update formula for linear evidence mixing is wi := max[0, wi + (y - pi)(S n1i - S1 ni) / S0 S1],
    (y - pi)(S n1i - S1 ni) / S0 S1 should be the negative gradient. Thus you're able to add a step-size on weight update.
    It doesn't matter if you retrieve the formula from bit-counts, because you interpret them as a probability, but maybe I have to start Mathematica again
    The trick here is that you can interpret bit counts as probability p = n1/n, n = n0+n1, but this implies giving greater weights to high counts (in addition to weighting by the mixer). For example, should n0=100, n1=100 (p = 1/2) have 100 times more weight than n0=1, n1=1?

    To avoid this problem, n0*n1 is bounded so that both counts cannot get large. This has the effect that probabilities near 0 or 1 have large n and so get more weight. Logistic mixing also has this property because of the squash and stretch functions. The reason I think logistic mixing does better is because you can separate the modeling of bit counts from the mixer, which you can't do with linear evidence mixing. If you have a bit history like 0000001, then what is the probability of the next bit? Linear mixing uses a fixed model, like n0=4, n1=1 because of bit discounting rules that favor new data over old. This might not be the best rule because the data could be stationary or not. Logistic mixing lets the model decide by looking at the outcomes of the same sequence in other contexts to estimate the probability, then just mixes the probability.

  16. #16
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    0.5 is for rounding
    I don't get why this should help. You're shifting w to [0.5,1.5] and I tried that, but it did not change the output by one bit.

    Also I don't understand where the term (p0-p1) in your update formula comes from.
    Your mathematica derivation is clear to me.

  17. #17
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    1. Consider the integer divisions

    int(2/2) = int(1.0) = 1
    int(3/2) = int(1.5) = 1
    int(4/2) = int(2.0) = 2

    and now with proper "integer"-rounding

    int((2+1)/2)) = int (1.5) = 1
    int((3+1)/2)) = int (2.0) = 2
    int((4+1)/2)) = int (2.5) = 2

    This simply improves accuracy and is an implementation issue. Don't mix it up with the derivation of formulas, etc.

    2. p' = w p0 + (1-w) p1 + w p1 = p0 + (p1-p0) w


    @Matt: I never understood why you call mixing "evidences" linear. It's neither linear in weights nor linear in counts or probability.
    Last edited by toffer; 8th November 2010 at 17:08.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  18. #18
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by toffer View Post
    int(2/2) = int(1.0) = 1
    int(3/2) = int(1.5) = 1
    int(4/2) = int(2.0) = 2

    and now with proper "integer"-rounding

    int((2+1)/2)) = int (1.5) = 1
    int((3+1)/2)) = int (2.0) = 2
    int((4+1)/2)) = int (2.5) = 2
    this is done in the function rdiv

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by toffer View Post
    @Matt: I never understood why you call mixing "evidences" linear. It's neither linear in weights nor linear in counts or probability.
    So 2 models A and B have predictions Pa = Na1/Na, Pb = Nb1/Nb, where Na, Nb are total counts by models A and B, and Na1, Nb1 are counts of 1 bits. Then a linear mixer with weights Wa, Wb = 1-Wa has prediction p = (WaNaPa + WbNbPb)/(WaNa+WbNb). It's linear in weight*count, maybe not in weights or counts by themselves.

  20. #20
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    It's clearly linear if you abstract modeling from mixing, although I prefer the word linear regression.
    The models probabilities are formed by occurence counters, but there could be many other solutions to this problem, like linear counters or what else.
    I know the following is clear for everyone.

    Let p=(1-w)*p1+w*p2 be the convex linear combination of two models.
    Cost when retrieving a one-bit is -log(p).
    A simple solution to this minimization-problem is gradient descent.
    So taking partial derivative with respect to w gives p'=-(p2-p1)/p.
    And a step for the parameter vector would be w:=w-L*p when retrieving a one-bit, where L is the step-size on the gradient line.

    But for numerous reasons linear regression is bad for this problem, mainly because our dependent variables p1 and p2 are binary.
    So logistic regression is a better answer.
    Where p=1/(1+exp(-(s1*w1+s2*w2))). Taking the partial derivatives again, gives
    w1:=w1+L*s1*(1-p) and w2:=w2+L*s2*(1-p) when retrieving a one bit, or simply w:=w-L*(s2-s1)*(1-p) if we use only one weight as in linear regression.
    The only problem here is the interpretation of the weights.

    But you can come to this solution from different initial situations.

    So my mixer looks like this
    Code:
    #define MIXW2 // use two weights?
    
    class Mixer2f {
      const int wbits;
      const int wscale;
      int w1,w2,s1,s2,pm;
      public:
         Mixer2f():wbits(18),wscale(1<<wbits),w1(wscale>>1),w2(wscale>>1) { };
         int idiv(const int v,const int s)
         {
           return (v+(1<<(s-1)))>>s;
         }
         int mix(const int _p1,const int _p2)
         {
            s1=domain.Stretch(_p1);
            s2=domain.Stretch(_p2);
            #ifdef MIXW2
             pm=domain.Squash(idiv(s1*w1+s2*w2,wbits));
            #else
             pm=domain.Squash(s1+idiv(s1+(s2-s1)*w1,wbits));
            #endif
            pm=cGlobal::max(1,cGlobal::min(PSCALE-1,pm));
            return pm;
         }
         void update(const int bit,float rate)
         {
           int e=(bit<<PBITS)-pm;
           int r=wscale*rate;
           #ifdef MIXW2
             int d1=idiv(e*s1,PBITS);
             int d2=idiv(e*s2,PBITS);
             w1+=idiv(r*d1,PBITS);
             w2+=idiv(r*d2,PBITS);
           #else
             int d1=idiv(e*(s1-s2),PBITS);
             w1-=idiv(r*d1,PBITS);
           #endif
    
           w1=cGlobal::max(-wscale,cGlobal::min(wscale,w1));
           w2=cGlobal::max(-wscale,cGlobal::min(wscale,w2));
         }
    };
    results for the above mixer on calgary corpus with my previous mentioned LZ-codec (18-bit weights).
    Code:
    two weights unrestricted: 843.302
    two weights from [-wscale,+wscale]): 843.187 
    two weights from [0,+wscale]): 843.230
    
    one weight from [-wscale,+wscale]): 844.308 (mathematical incorrect)
    one weight from [0,+wscale]: 844.254
    Last edited by Sebastian; 8th November 2010 at 22:53.

  21. #21
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Matt
    This function



    is linear in x. A mixing function f outputs a probability p. Your expression is

    .

    So, clearly, this function f is not linear in w, n_*, N_*, w or any product of them. The nominator and denominator are linear in these variables, i guess this is what you mean. However the mixing function is not.

    But for numerous reasons linear regression is bad for this problem, mainly because our dependent variables p1 and p2 are binary.
    Logistic regression typically is used when one wants to map "explainatory" variables - not in the range [0, 1] onto probabilities, since the sigmoid function ("squash") is asymptotically bounded by 0 and 1. Common examples are predicting the probability of a stroke as a function of age, weight, etc.


    The only problem here is the interpretation of the weights.
    Somehow i feel that people tend to ignore or don't understand this.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    @Sebastian:
    here's a simple coder with static linear mixing of order-0 and order-1 statistics.
    http://nishi.dreamhosters.com/u/rc_v3.rar
    Can you post your mixer results using it and the included sample file (book1bwt)?

  23. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    @toffer, I see what you mean. It is not linear when you include the normalization in the denominator.

  24. #24
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    Can you post your mixer results using it and the included sample file (book1bwt)?
    I really don't get you here. My mixer with static mixing on book1bwt? If I use your mixer, results won't be too different from yours

    What I did now was disabling matching in my LZ-codec and using the O2-O1-O0 Literal-Model (2 mixers) straight, without changing constants.
    This gets 215.589 bytes with logistic mixing and two weights, and 223.168 using my linear mixer with one weight, though using the same update-rate (0.25) for both is a bad idea.
    Last edited by Sebastian; 9th November 2010 at 13:01.

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I really don't get you here. My mixer with static mixing on book1bwt? If I use your mixer, results won't be too different from yours

    I meant that you can plug in your adaptive mixer into it*, and see if it would improve the results over static mixing.
    Also I have a lot of other results for that file with mix_test and o01 coders, so it would be easy to see how good is your mixer.
    I don't think there's a better method of mixer quality estimation - when you compare completely different coders, results
    can be affected with any other implementation details, not only mixers.
    For example, in mix_test I had ~214xxx for book1 with logistic mix2(o0,o1), does that mean that your mixer is bad?

    * using the same probability inputs: they're 15-bit probabilities of zero.

  26. #26
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Quote Originally Posted by Shelwien View Post
    For example, in mix_test I had ~214xxx for book1 with logistic mix2(o0,o1), does that mean that your mixer is bad?
    Dunno, your O2-mix (with linear mixer) does about 218.xxx, so I think it's not too bad.
    If I add two more mixers for orders 0 and 1 I get about 213.xxx.

    I have not played with BWT-data, and you're using a hell of constants. I have about 4.
    I'll have to rewrite my squash/stretch functions to be compatible with different probability accuracies, then i'll post results. And an optimizer-backend wouldn't be too bad either.

    Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.
    As mentioned above I suggested something as linesearch to find the optimal update-rate.
    But that would be stupid in this case, because the optimal update rate would be either 0 (p1>p2) or 1 (p2>p1) if we retrieve a '1'-bit.
    So gradient descent is just a compromise in this case to "average" the time-dependent functions p1 and p2.
    There must be a better solution to this. You mentioned something as updating p like a counter and backpropagating the update to w. How does this works?
    Last edited by Sebastian; 10th November 2010 at 12:49.

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > > For example, in mix_test I had ~214xxx for book1 with logistic
    > > mix2(o0,o1), does that mean that your mixer is bad?

    > Dunno, your O2-mix (with linear mixer) does about 218.xxx, so I think it's not too bad.

    http://nishi.dreamhosters.com/u/rc_v3.rar
    Implements static linear mixing of order0 and order1 contexts, and its result is 216,774.
    Which o2 are you talking about?
    I meant http://encode.su/threads/387-Logisti...ull=1#post7926

    > If I add two more mixers for orders 0 and 1 I get about 213.xxx.

    That's another matter. We have specific probability inputs here, defined by rc_v3 above.
    And I'm suggesting to compare mixers by attaching them to that coder and looking at the results.
    For now we have the default static mix and your mixer to compare, I'd post the results
    with my logistic mixer later if necessary.

    > I have not played with BWT-data, and you're using a hell of constants. I have about 4.

    You're not supposed to tweak these anyway (except for M_mW0 = static mixer weight).
    Also its best to not have any constants, but in reality all coders have them, just
    in the implicit form. For example, p=squash(s) implies p=squash(1*s) if there's no
    such multiplier already in the formula for s.
    Thus, coders with more tuned parameters win
    (eg. bsc uses the same linear counters with 4 update parameters).

    > I'll have to rewrite my squash/stretch functions to be compatible
    > with different probability accuracies, then i'll post results.

    You don't really have to. Just shift the inputs to your desired precision,
    we're not looking at the speed here anyway, and rounding likely won't matter either -
    in fact it can improve compression there.

    > And an optimizer-backend wouldn't be too bad either.

    A demo of my optimizer is available on ctxmodel, but its perl-based, so you probably won't use it.
    Then there's also m1, but its optimizer would be hard to attach afaik.
    And then I remember osman talking about using matlab for optimization (I suppose you can
    define an external function there, which would estimate your entropy for given arguments).

    > Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.

    I think its fairly obvious.
    You have your current probability-of-1 estimation p and new estimation y (=bit value).
    Again, its a switching model - you can either use p to predict the next bit z (after y), or y.
    Let's suppose that the probability of y producing a shorter code for z is w.
    So, with probability (1-w) p might be better, and we have 4 cases:
    z=0 using p, probability is (1-w)*(1-p)
    z=1 using p, probability is (1-w)*p
    z=0 using y, probability is w*(1-y)
    z=1 using y, probability is w*y
    But we only care about z value, so which prediction is used doesn't matter,
    so we can merge the intervals.
    z=0 -> (1-w)*(1-p)+w*(1-y)
    z=1 -> (1-w)*p+w*y
    And here we have linear mixing, where w is not just some parameter, but a probability
    of y being better for coding.

    > As mentioned above I suggested something as linesearch to find the optimal update-rate.

    I'm not aware of any usable CM components with adaptive update parameters.
    Each update affects all the subsequent estimations, so you can't optimize
    the update parameters based only on the single next bit.
    And its hard to optimize for bit strings here, because the function
    becomes very complex.
    Well, there's that approximated polynomial approach which we discuss sometimes
    with toffer, but afaik its still theory.

    > There must be a better solution to this. You mentioned something as
    > updating p like a counter and backpropagating the update to w. How
    > does this works?

    Yes, my linear mixer and SSE components are based on that.
    Suppose you have an update formula for a counter: p'=f(p,y).
    And then you have a linear mix: p=(1-w)*p1+w*p2
    The idea is that if p'=f(p,y) is a good update function, then
    why not use it here - p1 and p2 may be constants for all we know.
    So we need such a w' that the next mix would produce p':
    (1-w')*p1+w*p2 = p'
    w' = (p'-p1)/(p2-p1)
    To that, I usually also add a |p2-p1|>threshold check, to avoid
    errors due to limited integer precision (also its common sense
    that with near equal p1 and p2 there's no reason to change the weight).

    This is only applicable to linear components though, and the current
    update formula in logistic mixer is not bad anyway - bruteforce update
    doesn't gain much there (=collecting codelengths for all weight values
    and choosing the next weight by best codelength).

  28. #28
    Member
    Join Date
    Oct 2010
    Location
    Germany
    Posts
    286
    Thanks
    9
    Thanked 33 Times in 21 Posts
    Thanks for your suggestions, I'll post my results in the next days.

    Quote Originally Posted by Shelwien View Post
    > Another thing which bothers me is the whole concept of linear mixing with 2 inputs p=(1-w)*p1+w*p2.

    I think its fairly obvious.
    You have your current probability-of-1 estimation p and new estimation y (=bit value).
    Again, its a switching model - you can either use p to predict the next bit z (after y), or y.
    Let's suppose that the probability of y producing a shorter code for z is w.
    So, with probability (1-w) p might be better, and we have 4 cases:
    z=0 using p, probability is (1-w)*(1-p)
    z=1 using p, probability is (1-w)*p
    z=0 using y, probability is w*(1-y)
    z=1 using y, probability is w*y
    But we only care about z value, so which prediction is used doesn't matter,
    so we can merge the intervals.
    z=0 -> (1-w)*(1-p)+w*(1-y)
    z=1 -> (1-w)*p+w*y
    And here we have linear mixing, where w is not just some parameter, but a probability
    of y being better for coding.
    I think you misunderstood me here. With "whole concept" I don't mean the linear combination of two predictions. But maybe this is only a language problem by myself.

    I don't understand what the switching between y and p has to do with the switching between p1 and p2?
    I think about something like delaying the mixer update in some form, but maybe this is the same as your suggested "counter-update".

    Ok simple and clear for a dumb ass like me
    p=(1-w)*p1+w*p2 describes a switching source.
    If you view w as a convex weight or a probability doesn't really matters to me.
    The problem arises if we adjust w when we encounter the bit y. We do it in a way that the model that predicted y better, is favored in the next step. Simple as that.

    But we discard some information in the update step where the models p1,p2 and the weight w get updated, because p1 and p2 get both adjusted in the direction of y too.

    Example:
    p1=0.6 p2=0.2 w=0.5
    y=1: p1 was better, make w smaller, say w=0.3, update p1,p2 say p1=0.8, p2=0.9 (p2 is a fast counter)
    y=1: p2 was better, make w larger, w=0.5

    but it would have been better if we adjusted w beforehand in the direction of p2

    see what I mean? Using the new p1 and p2 directly makes things really worse so I have to take a deeper look.

  29. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I don't understand what the switching between y and p has to do with
    > the switching between p1 and p2?

    Its the same thing. We have two predictions and a probability of choosing one.
    Btw, my counters mix 3 predictions in fact - p,y and a static one.

    > I think about something like delaying the mixer update in some form,
    > but maybe this is the same as your suggested "counter-update".

    That method is only applicable to linear components.
    It would be impractical for a logistic mixer, as it would involve exp/log.
    And anyway, precision-wise, there's nothing better than bruteforce -
    see mixer in http://nishi.dreamhosters.com/u/wav_00.rar

    > If you view w as a convex weight or a probability doesn't really matters to me.

    And I think that knowing what it is really helps.
    Otherwise the idea of n-ary logistic mixer redundancy due to Sum w_i != 1
    would never appear.
    Also knowing what kind of a probability it is allows to implement some
    alternate solutions - see "likelihood" update in http://www.ctxmodel.net/files/mix_test/mix_test_v4.rar
    its based directly on estimation of the probability of one of inputs
    producing shorter codes, which introduced a weird concept of different
    contexts for mixing and mixer update, unthinkable otherwise.

    > The problem arises if we adjust w when we encounter the bit y. We do
    > it in a way that the model that predicted y better, is favored in
    > the next step. Simple as that.

    Sure, but details matter. They only don't in asymptotic results for stationary data.
    While in reality choosing a wrong update rate for a good adaptive mixer can easily lead
    to worse results than with static mixing.

    > But we discard some information in the update step where the models
    > p1,p2 and the weight w get updated, because p1 and p2 get both
    > adjusted in the direction of y too.

    Note that there're contexts usually, in which case the updated weight likely
    won't be applied to previously updated inputs.
    Although you're right, and doing a few iterations of mixer updates sometimes
    helps too. But overall its much slower and not very stable, so isn't used
    in actual coders.

  30. #30
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    If anyone is interested i can rip off the optimizer and make a simple example. It is a fairly general-purpose genetic optimizer, in fact i developed and tested it with the minimization of common testing functions, like Rosenbrock or Rastrigin.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

Page 1 of 5 123 ... LastLast

Similar Threads

  1. Logistic mixing & modeling
    By toffer in forum Data Compression
    Replies: 44
    Last Post: 3rd March 2011, 06:25
  2. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  3. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  4. Mixing strategies
    By Alibert in forum Data Compression
    Replies: 38
    Last Post: 25th June 2009, 00:37
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •