Results 1 to 30 of 45

Thread: Logistic mixing & modeling

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Logistic mixing & modeling

    After thinking a bit about PAQ's mixer - which uses gradient descent to compute weights iteratively - i thought about applying that to bit models. Instead of having a counter with a state p (probability) in linear space one could have a logistic counter s=St(p) and update its state in logistic space. When using logistic mixing one doesn't need to convert the input probabilities p into logistic space - thus no stretch function needs to be implemented. Alltogether a logistic counter with gadient based update could look like:

    h = -ln(a+b*p)
    p = Sq(s)
    dh/ds = -(y-p)
    (derivation and terms like in this thread http://encode.su/forum/showthread.ph...ative#post7816)

    s' = s + L*(y-p) = s + L*(y-Sq(s))
    Sq(x) - Squash 1/(1+e^-x)

    That update can be derived like i did in the Mixing strategies thread - and actually equals a fixed input ("one") with a "weight" s. Hence it is just a weigthed bias neuron. BTW that update appears in bias neuron updates using logistic mixing - since it's the same. Nothing new actually.

    Now i noticed that we could actually "reinvent" the wheele: When using logistic mixing along with such a model - why not take the mixing functions influence into account, adjusting the bit models proportional to their contribution to the mixed output? Well integrating that into logistic mixing gives us a combined purely logistic mixing and modeling approach:

    (1) each model i has a state s_i in logistic space

    (2) the mixer assigns weights w_i to the models s_i

    (3) a mixed prediction p in linear space is

    p = Sq( sum w_i s_i )

    (4) the coding error e is

    e = y-p

    (4) weights are updated

    w_i' = w_i + L*e*s_i

    (5) predictions are updated

    s_i' = s_i + M*e*w_i

    0< L, M < 1 are learning rates. Note that -e*w_i is simply the partial derivative of h (coding cost) wrt to an input s_i. Alltogether this can be interpreted as a neural network: the input layer is made of bias neurons, with weights selected by a context. The output layer is a single neuron with linear activation and squash as the output function. The hidden layer just forwards the inputs (linear neurons). It learns using error backpropagation which minimizes coding cost rather than RMSE.

    I have no idea how well this will preform in practice - but i hope that it might work as well as logistic mixing. And most of this is already covered by logistic mixing. This is just an extension of modeling integration.

    I'd implement this when i've got some spare time.

    It might even be possible to get rather good results with *static* weights (no weight update) since the logistic model updates take the actual weight values into account.

    Comments?
    Last edited by toffer; 9th June 2009 at 14:12.

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,023
    Thanks
    415
    Thanked 416 Times in 158 Posts
    It would be great if you'll release a fpaq0-like compressor as a demo. Just a simple model with no templates and all that crazy constants as some authors like to embed into their programs to make the source code completely unreadable and a bit encrypted, so nobody can understand what's going on...

  3. #3
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You were faster than me I was thinking about it to remove completely stretch/squash functions for decreasing cache misses (I think, the tables are big enough to pollute the cache). I thought to implement a new counter model based on PAQ's mixer weight update/probability mix functions. But, got stuck in how to remove "even" final squash function. How can we integrate final logistic probability in RC? I mean, multiplaction free RC?
    BIT Archiver homepage: www.osmanturan.com

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Osman
    Actually the lookup tables are rather cache friendly since probabilities are clustered near 0/1 in linear space - just remember the model output histograms. I made some tests there and there is no difference in speed.

    @Encode
    Well if you want to have different parameters for different models there is no other flexible way in passimg them to the models.

    I'd implement that if i've got spare time. Don't forget that i'm rather busy with my final thesis. Actually implementing is pretty straight forward, since if you know how to update a PAQ mixer:

    e = y-p
    w_i' = w_i + L*e*s_i

    You know how to update models, too

    s_i' = s_i + M*e*w_i

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Toffer:
    Yes. I remember. I meant a bit different thing: they are big enough (2x8 KiB in total) to place in cache (especially in AMD architectures). There are the other data to which should be cached (i.e. SSE entries). So, by removing stretch/squash from whole implementation should leave more space for better caching. Also, I'm sure multiplaction free RC would be super fast.

    Edit: BTW, IIRC I intentionally removed stretch/squash table to see they were bottleneck in prior to BIT 0.5 (maybe more older). And I observed that it's faster. Yes I know the computation becomes more and more weird (the output is completely garbage). But, I proved that stretch/squash functions are slow part of PAQ mixer.
    Last edited by osmanturan; 9th June 2009 at 15:06.
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't think that a multiplication free RC would be possible (with enough precision). But i never made any investigations there.

    The cache is adaptive - and i doubt that there'll be more than a few (<10) 64 byte cache lines used actually. Since - as i said - the functions are evaluated near 0/1 in almost all cases. These locations are frequently used and will be cached. Others won't. Hence the other stuff is pushed out. As i said i tested this effect. There is no speed difference. On the first look i was pretty surprised. But looking at a model output histogram reveals why this hapens.

  7. #7
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    BTW, correct me if I'm wrong:
    At initially, we have a model which hold w and s. And they are all zero at the beginning. Because, we need assign p=0.5 and if we derive s from p, s=Sq(p=0.5), so it's s=0. So, here is the question: How can we update the function if coefficient are zero!? In PAQ mixer, s_i part in linear space, so there were no problem (p=0.5 at the beginning at there). But, in here it's logistic and it's zero at the beginning.

    e = y-p
    w_i' = w_i + L*e*s_i
    s_i' = s_i + M*e*w_i

    Edit: I miswrote. I meant s_i part is not in linear space in PAQ mixer but it comes from a linear space source and it can be a non-zero value.
    Last edited by osmanturan; 9th June 2009 at 15:59.
    BIT Archiver homepage: www.osmanturan.com

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > 1. Log[p/(1-p)] logistic mapping comes off the ceiling.
    > I doubt you can say anything about its choice except
    > that it fits your image - which is obviously not enough.
    > And with similar but different mapping being better for
    > specific application than Log[p/(1-p)] it becomes even
    > harder to explain.

    My iterpretation for Matts design here is:

    Matts idea was to use a NN for mixing - a design constrain. A sigmoidal function (differentiable threshold element) as an output neuron provides weight adaption via minimizing coding cost without a *divison*. Thus it is just a modification of his previous (< PAQ6) mixers which minimized coding cost using gradient descent, too. Now if the output p=Sq(s) of that *nonlinear* output neuron has the meaning of a "probability", s certainly is something different. Thus one can assume Sq(s) to be the inverse of some function - St(p) in particular - which transformed probabilities into a different domain (#).

    I know that this lacks a mathematical interpretation but to my understanding of that intuition that makes sense. On the other hand i think that the actual design process is reversed. Usually it'd be better to analyse the inputs and build the model based on that. Not the reverse way.

    And i still don't know what'd be wrong in knowing the reasons for it to work well - even if these reasons luckily coincided with a more or less random design process.

    > 2. You didn't have any suggestions as to avoiding weight overflows,
    > or changing the mixing formula to improve the results.
    > But using a single weight there improves both compression
    > and memory usage (a single 16-bit weight is currently used),
    > and it doesn't seem final either.

    That was already done in ZPAQ. We already talked about this - i didn't reject any of these ideas. And why shall i tell you how to implement it if you can do it on your own. I'm sure both of us know how this can be implemented.

    > 3. Using numeric optimization techniques doesn't guarantee
    > that the optimization target and iteration functions are
    > chosen consciously.

    Coding cost (entropy) should be a good target function. And i never said that gradient descent is a perfect minimization technique - i just sayed that its application in here *makes sense* since it is a technique to minimize a certain cost function. In addition here it's easy to implement and seems to be efficient.

    > For example, having a rate coefficient
    > in the logistic mixer update is a really bad idea - that is,
    > same effect can be acquired in a better way.

    Well as you know it doesn't play a great role if you locate it in mixing or in updates mathematically. And i already explained why it appeared in update. The normal procedure in iterative numeric minimization of a function H(x) is to estimate a search direction d(k) and a scalar step length L(k) in each step k having an initial estimation x(0). So x(k+1) = x(k) + L(k)*d(k). Having obtained a direction d(k) the dim x dimensional optimization problem is reduced to a 1 dimensional min H( x(k) + L(k)*d(k) ) w.r.t. L(k) in each step k. Thus the step lengths L(k) generally differ in each step k. A simple solution is to assume L(k) = L for all k. That's where that L comes from.

    > while about using paq mixer with alphabet distributions
    > I hear just "it won't work".

    I mentioned the reason several times: http://encode.su/forum/attachment.ph...2&d=1220619017
    That's different for bytewise models as you know.

    > I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
    > into my linear coder and it would help too. So what?
    > That's not all what a paq mixer does.

    I never said that and i don't know what you mean there actually.

    > And then we have a choice of multiple possible iteration methods,
    > and which one is better can be only determined by manual testing,
    > and which is the best is unknown.
    > Its unknown whether its good to use more information there.

    Mathematically using more information makes sense, since you differentiate the whole system dh/ds_i = dh/dp * dp/dx * dx/ds_i.
    h = -ln a+b*p
    p = S(sum s_i w_i)
    And not just bypass the effect of each individual component. That's the foundation of error backpropagation, too.

    And finally citing my first post: "...I have no idea how well this will preform in practice..."

    > 1. Afair its optimal for weighted code length
    > Sum -c^(N-i)*Log[ p_i ]

    Right - i didn't know. But in this case the minimizer of -sum c_k*ln(a_k+b_k*p) equals the minimizer of sum c_k (y_k-p)^2 which i found pretty interesting.

    > 2. Its true that this exponential weighting is arbitrary.
    > Even more can be said - different weighting functions
    > were tested and there were better ones.
    > But here its a matter of computation cost, these better
    > functions are much more expensive, but not much better
    > compression-wise, so this one is used.

    Well the same applies to simple gradient descent applied here as one possibility of numeric minization...

    ... and the Hessian is singular

    >
    > ****************
    >
    > Btw, I actually have a draft of logistic mixing interpretation.
    >
    > 1. The idea is that coding cost is a linear combination
    > of Log[p_i], so linear mixing of submodel inputs in
    > log space might be more fitting... and adaptive.
    >

    But where does Squash come from? If you answer that it's applied to get probs back from logistic space you'd confirm (#).
    Last edited by toffer; 9th June 2009 at 23:21.

  9. #9
    Member MontySpear's Avatar
    Join Date
    Mar 2011
    Location
    8220 N Perkins Rd, Stillwater, OK 74075
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Great done by all. Thanks for sharing the important stuff.

    Last edited by MontySpear; 2nd March 2011 at 12:18.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 798 Times in 489 Posts
    Somehow this escaped my attention. http://mattmahoney.net/dc/text.html#1678

Similar Threads

  1. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 13
    Last Post: 3rd April 2020, 17:04
  2. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  3. Mixing strategies
    By Alibert in forum Data Compression
    Replies: 38
    Last Post: 25th June 2009, 00:37
  4. Graphic/Image/Modeling Benchmark
    By Simon Berger in forum Data Compression
    Replies: 23
    Last Post: 24th May 2009, 18:50
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •