Page 1 of 2 12 LastLast
Results 1 to 30 of 51

Thread: Model orders

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Hi!

    Somehow got into an argument about model orders and such.
    That is, my opponent was saying that single context is enough
    for "most cases" - he made an exception for text -
    and at least when a proper initialization is used
    (inheritance from lower orders, actually) and even PPM is
    a waste of time (although its a trade-off for speed comparing
    to normal context mixing). So I tried to prove that there's
    good enough gain from using multiple models simultaneously.
    And, after some messing around with copy/paste, here's my
    comparison:

    http://shelwien.googlepages.com/order_test1.rar
    -------------------------------------------------- --------
    (compressed size, (de)coding time)

    mcpcom.exe wcc386.exe

    11223040 536624 uncompressed
    8175655 5.469s 398842 0.312s plain order0
    5714465 5.391s 318126 0.328s plain order1
    5658022 9.063s 310300 0.469s mix(order0,order1)
    4980971 8.062s 314199 0.625s plain order2
    4913790 8.422s 299592 0.640s order0->order1->order2 switch
    4755565 20.187s 291268 1.234s mix(mix(order0,order1),order2)

    4976344 3.703s 295342 0.281s PPMd vJ -m256 -o2

    Well, in the end looks like all this only confirmed his opinion ,
    but I'd already made these coders anyway, so maybe somebody
    will find something of interest there.

    Btw, I also tried using the mixer ripped out from paq8 (included),
    but wasn't able to make it do what I wanted. Wonder if anybody
    could explain if I made a mistake somewhere or maybe that mixer
    is just that efficient as it is.

  2. #2
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I think I should make some experiments with order-1-0 model.

    Using two predictors a la FPAQ0P we can mix two probabilities:

    For example:
    p1 - order-1 probability
    p0 - order-0 probability

    p = ((p1 * 2) + p0) / 3

    or

    p = ((p1 * 3) + p0) / 4

    here we can eliminate devision, since division by four can be replaced with shift by 2.

    i.e.

    p = ((p1 * 3) + p0) >> 2

    etc.


  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    That's static mixing, its not cool
    Also doesn't improve compression much anyway.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Btw, fpaq0p results for the same files:

    mcpcom wcc386
    8199543 401051 // fpaq0p result
    2.203s 0.140s // times by included binary
    1.969s 0.125s // compiled with IC10 /ML /O3 /QxK /Qipo
    4.703s 0.265s // compiled with IC10 /MD /O3 /QxK /Qipo (like executables in order_test1.rar)

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I'm just thinking about my LZPM and its decompression speed.

    Adaptive weighing as done with PAQ6 may seriously slowdown the process.

    Fixed weight mixing as done with PAQ1 is also slow.

    I think I should carefully look at yours. Can you post please a formula for your mixing?

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Shelwien
    Btw, fpaq0p results for the same files:

    mcpcom wcc386
    8199543 401051 // fpaq0p result
    2.203s 0.140s // times by included binary
    1.969s 0.125s // compiled with IC10 /ML /O3 /QxK /Qipo
    4.703s 0.265s // compiled with IC10 /MD /O3 /QxK /Qipo (like executables in order_test1.rar)
    Looking at this I think eventually I must get into IC10. By the way, do you use Visual C++ 2005 or ? As far as I know with latest VC all libraries are MT (multi-threaded) i.e. single threaded and the _CRT_DISABLE_PERFCRIT_LOCKS should be defined.

    Anyway, FPAQ0P is faster. LZPM uses even far more optimized version of this type of coder...

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by encode
    By the way, do you use Visual C++ 2005 or ? As far as I know with latest VC all libraries are MT (multi-threaded) i.e. single threaded and the _CRT_DISABLE_PERFCRIT_LOCKS should be defined.
    Just because:
    icl: warning: option /Qvc8 or higher used with /ML[d] is not supported

    (ICL 9.1)

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    1. Mixing itself can't be slow, its just a single multiplication.
    But additionally updating another counter _is_ slow.

    2. Look at Update() at sh_mixer.inc, I think its fairily obvious.
    Also I think that paq's mixer is really dumb and its not my mistake that it
    doesn't work as good as mine

    3. I don't use Studio, but I ripped out console compilers from all of them.
    And so tend to use recent compilers with VC6 libs - really helps with size
    and using msvcrt.dll instead of msvcr9.dll etc

    4. I didn't optimize anything, not only for speed, but for size too, in fact.
    Its what i've got by gathering some libs and copy-pasting from other projects.
    So I think that making a faster coder is no problem.
    Actually, some old coders like from http://compression.ru/sh/coders6c2.rar might do.

    5. IC10 looks like it has a completely rewritten (again) code generation engine.
    So its better to use it together with IC9.1 - sometimes 9.1 is better, sometimes not.
    Though IC10 might do much better for newer CPUs.

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    OK, thanks for reply!

    Will start diggin'...

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi!

    My expirience shows that mixing itself isn't slow, but the weight adaption is time consuming (cmm2 dropped from ~700kb/s to 420 on my pc). Mixing and higher order models are definitley worth the effort, just have a look at lpaq (strip off the match model). Most compression comes from orders 3,4 and higher ones, where you have to adapt quicker, the lower orders can adapt faster and give a decent baseline of compression. Your mixing seems to be hierarchical, like mix( p0, mix( p1, p2 ) ), which gives two averages instead of three. Implicitly this gives higher weights to lower orders (the orders in the "outer mix", here p0) , try to average over more than two probabilities at once or swap the mixes. Has anybody tried this hirarchical stuff before? Maybe this can be made efficient? I got a nice compression improvement in selecting the mixer by a small context (few msbs from the last byte, coding order and some other synthetic stuff, like steady state information). I only had a brief look at your paq-rip-off. You only use two mixers, independent of context? You shoud at least select a mixer by the coding order, i noticed that this is the most important mixing context (which should be obvious, after a second thought).

    I hope this can help you.

    Looking forward to further discussions.

    n8
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    > My expirience shows that mixing itself isn't slow, but the weight adaption
    > is time consuming (cmm2 dropped from ~700kb/s to 420 on my pc).

    Not exactly sure what are you talking about, but in my implementation
    mixer uses the same update as a counter, so with mixing 2 models you
    need to update 3 counters, though there're trade-offs like PPM with
    partial updates.

    Anyway, I was initially talking about custom models for custom data,
    like audio/video codecs and such. And its still remains the question
    whether these kinds of bruteforce approach, like combining results of
    multiple models etc, are worth their speed impact.
    Eg. is 2% smaller file worth 2x slower decompression speed or not.
    (It seems that even 100x slower compression speed is considered acceptable
    for almost any visible size reduction, if decoding speed remains the same.)

    > Mixing and higher order models are definitley worth the effort, just have a look at
    > lpaq (strip off the match model).

    I won't
    If you don't know me, I'm not anyone new to the scene, so you don't need
    to explain stuff like that

    For example, here's my compressor from 2003:
    http://compression.ru/sh/ash04.rar (04 with sources, remains obscure for some reason)
    http://compression.ru/sh/ash04a.rar (04 binary with optimized parameters)

    And also there's this "compressors.txt" file in my archive from the first post,
    with some statistics on that, and it shows that eg for executables using more
    than order2 "doesn't worth the speed impact".
    That is, I do personally agree with you, but am unable to persuade these who don't.

    > Most compression comes from orders 3,4
    > and higher ones, where you have to adapt quicker, the lower orders can
    > adapt faster and give a decent baseline of compression.

    > Your mixing seems to be hierarchical, like mix( p0, mix( p1, p2 ) ),
    > which gives two averages instead of three. Implicitly this gives higher
    > weights to lower orders (the orders in the "outer mix", here p0),

    It doesn't. Because it gives higher weights to the model which is more
    efficient entropy-wise. Also the mixers are initialized with the weight
    which doesn't include the first probability at all.

    > try to average over more than two probabilities at once or swap the mixes.

    I did it like that 5 years ago, thanks.
    And now I think that its better to use the mathematically valid mixer
    instead of shamanic ones.

    > Has anybody tried this hirarchical stuff before?
    > Maybe this can be made efficient?

    Well, it _is_ efficient.
    I think you'd see it if you tried to reach the same compression level
    with "dumb" contexts like used in my examples (whole previous byte and such).

    I did try to prove it your way, though, by stripping lpaq, but seems
    that is has too many "unsportsmanlike" tricks in it, so I did manage
    to disable models except order1 and 2 (despite the #defines ),
    and text/exe preprocessing, but there're too much stuff still left
    as even the hash-function and counters are somewhat tuned to the "real data".
    So I hope that for objective mixer comparison some "paq-way" representative
    can quickly compose a fpaq0-or-something based direct analog of my coders,
    because it'll take several hours for me both to tweak my coders that way
    or to break down lpaq even further.

    > I got a nice compression improvement in selecting the mixer by a small context
    > (few msbs from the last byte, coding order and some other synthetic stuff,
    > like steady state information).

    Well, thats normal. In my examples mixers have some context too btw.

    > I only had a brief look at your paq-rip-off.
    > You only use two mixers, independent of context?

    I use it the same way as my mixer.
    That is, I written a wrapper that allows to comment out
    "sh_mixer.inc" and uncomment "paq8_mixer.inc" and it'll work and
    even give some improvement over unmixed contexts, but
    much worse than my mixer.

    > You shoud at least select a mixer by the coding order,
    > i noticed that this is the most important mixing context
    > (which should be obvious, after a second thought).

    Well, thats only if you use the same mixer array for mixing different orders.
    Which isn't obvious. Instead, it's more obvious imho that you'd need different
    contexts for mixing different orders - at least different quantization.
    Anyway, I guess you didn't look at my sources.

    > I hope this can help you.

    Thanks, though I don't actually use this kind of approach for a long time
    already. But I needed some examples with clear distinction between using
    and not using different orders, and with orders in their historical sense.

    > Looking forward to further discussions.

    Okay.

  12. #12
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I had no look at your source, sorry and maybe i didn't read your posts "exact" enough, i apologize (and i didn't know you too ). I can't look at your sources right now, since i'm at work (blocking .zips, etc), btw i'm new to the scene.

    You do mixing like "counter" updates? What is a "counter" in your designations? Do you refer to a probability with adjustment by the error, or a counter like counting the occurences of 0/1 bits.

    I tired to make a mixer based on this idea with poor results, nice speed though. For mixing i use an optimization approach (minimizig the entropy), i can't see how this can do better?! (concerning the update rule - simply like a counter/bit model, this isn't optimal). The optimization (weight adjustment) after every bit can suffer from (i don't know how to clearly say it, language barriers :shame "local noise", which aggravates convergence. PAQ's mixer is a neural network, but not that diffrent from weighted averaging (expect the stretch/squash transformation preceeding and succseeding mixing) with adjustment based on the gradient. Matt modified the gradient descent to minimize ld( pbit ), rather than standard back propagation (minimizing RMSE). One problem you always face to is getting stuck in a local minimum.

    Could you please explain "efficient entropy wise". Better predictions lower the entropy - that is clear. My first thought was - the "inner" mixer joins the higher orders to a probability like (a*p1+b*p2) / (a+b) = p' and now you mix again with p0, the same way. The inner "mix" completly ignores the relation to the lower order - this is looked "suspicious" to me.
    Something which comes to my mind now was an experiment to improve mixing speed, by only using 3 or 4 active models out of 6 to 8 in mixing, e.g. orders 654, when coding at order 6 and completly ignoring lower ones - which improved speed AND compression.
    Maybe this is exactly what you meant!?

    Mixer initialization isn't that critical in my experience.

    I agree with you, that a slight compression gain compared to a huge amount of extra time isn't worth the effort. The general order-n models are somehow universal (since data is stored in bytes), but special models will definitley do better - like a bi- or trigram word for text, sourrounding pixels for images and so on. I think a specialized model along with maybe order210 will give nice results; and higher orders are definitely not worth the effort here.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    > and i didn't know you too

    Well, its no wonder as I've got a compression-related job
    which blocks me from advertising my inventions, and nothing
    interesting was happening in the field, beside Matt's works,
    which are great as programming examples, but imho a complete
    dead end considering further development of compression theory
    (maybe not for Matt himself, but surely for people trying to follow).
    And just recently I accidentally found this forum and discovered
    that maybe there's some interesting development again.

    > You do mixing like "counter" updates?
    > What is a "counter" in your designations?
    > Do you refer to a probability with adjustment by the error,
    > or a counter like counting the occurences of 0/1 bits.

    Well, by counter I mean a statistic counter like these "states".
    Actually my counter is derived from the same origin as states -
    integer count of occurences that is.
    Btw, here's my first compressor with more straight-forward
    counters than these derived from "ash", which I used in
    "order_test1.rar" examples.
    http://compression.ru/sh/ppmy_9a8.rar

    So, it evolved like this:
    1. P0 = bitcount[0]/(bitcount[0]+bitcount[1]);
    bitcount[bit]++;
    2. P0 = bitcount[0]/(bitcount[0]+bitcount[1]);
    (bitcount[bit]*=(1-1/InvSpeed)++;
    3. P0 = bitcount[0];
    (bitcount[0]*=(1-1/bitcount[1]))+=(bit?0:1/bitcoun t[1]);
    (bitcount[1]*=(1-1/InvSpeed))++;
    4. P0 = bitcount[0];
    wr = Tbl[InvSpeed][bitcount[1]];
    (bitcount[0]*=(1-wr))+=(bit?0:wr);
    bitcount[1]++;

    So [2] here is a nonstationarity countermeasure, and other steps
    are mostly speed optimizations. And Matt made another step here,
    by quantizing such a counter into a single byte - which takes
    away the freedom of adaptation speed controlling and even adds
    some redundancy. So of course his approach has its place in the
    industrial compression schemes (like h264), but I think it really
    complicates the development of new models.

    > I tried to make a mixer based on this idea with poor results, nice speed
    > though. For mixing i use an optimization approach (minimizig the entropy),

    There was this trick which I called "bruteforce adaptation",BFA - true entropy
    minimization, which accumulates all the possible codelengths for a given
    set of weight values and uses the weight with minimal length for actual
    encoding:

    for( w=0,i=0; w<=1; w+=1/63,i++ ) {
    p = p1*(1-w) + p2*w;
    (len[i]*=(1-1/InvSpeed)) += -log2(bit?1-p);
    }

    Here's a working example from that time:
    http://compression.ru/sh/o1bfa.rar

    Well, this is what I usually call "entropy-wise"... but my mixer from
    the first post is based on a newer idea... and has even better peformance,
    though I was surprised when compared it with a BFA mixer and found that.

    And that new idea is really simple: we update the mixed probability as
    if it was from a normal counter, and then "back-propagate" the change
    into reachable parameters... weight, this time.
    Well, it looks like this (here's a "translated" version from my sh_mixer.inc):

    int w;
    iMixer::Process( bit, p1, p2, InvSpeed, Limit ) {
    Counter tmp;
    tmp.P = p1+(p2-p1)*w; // mixed probability of 0
    tmp.Encode( bit ); // encode the bit with mixing result
    tmp.Update( bit, InvSpeed ); // update tmp counter like usual
    // speed opt - no sense to change weight when p1 and p2 are too close
    if( ABS(p2-p1)>Limit ) {
    // change weight so that we'll get an "updated" value from repeated mixing
    w = (tmp.P-p1)/(p2-p1);
    w = max(0,min(1,w));
    }
    }

    > PAQ's mixer is a neural network, but not that diffrent from
    > weighted averaging (expect the stretch/squash transformation preceeding and
    > succseeding mixing) with adjustment based on the gradient. Matt modified
    > the gradient descent to minimize ld( pbit ), rather than standard back
    > propagation (minimizing RMSE). One problem you always face to is getting
    > stuck in a local minimum.

    Well, I think that neural network origin is what makes Matt's mixer so
    inefficient in my example. That is, maybe he did manage somehow to make
    it output something useful (its always possible when you build a model
    from a given set of functions comparing to trying to insert a function
    into a working model), but it surely doesn't adapt well to some abstract
    probabilities, out of paq context.

    > Could you please explain "efficient entropy wise". Better predictions lower
    > the entropy - that is clear. My first thought was - the "inner" mixer joins
    > the higher orders to a probability like (a*p1+b*p2) / (a+b) = p' and now
    > you mix again with p0, the same way. The inner "mix" completly ignores the
    > relation to the lower order - this is looked "suspicious" to me.

    Well, my 2-1-0 mix looks like this:

    // (de)coding a bit from "flag" variable.
    // TTable_NN,MMM are counter update mode settings (kinda random here)
    // also "100" in mixer calls is a limit value = 0.003 in 0..1 interval
    p1 = order0[z].P;
    p2 = order1[p][z].P;
    p3 = order2[pp][p][z].P;
    p4 = p1 + (p2-p1)*w1[p][z].w;
    w2[p][z].Process( flag, TTable_5,1,100, p4,p3 );
    order0[z].Update( flag, TTable_20, 700 );
    order1[p][z].Update( flag, TTable_10, 600 );
    order2[pp][p][z].Update( flag, TTable_8, 400 );
    w1[p][z].Process( flag, TTable_5,1,100, p1,p2 );

    > Something which comes to my mind now was an experiment to improve mixing
    > speed, by only using 3 or 4 active models out of 6 to 8 in mixing, e.g.
    > orders 654, when coding at order 6 and completly ignoring lower ones -
    > which improved speed AND compression.
    > Maybe this is exactly what you meant!?

    I didn't, Though I'm pretty sure that you'd get a valid result from
    any set of estimations you'll run through my mixer.

    > Mixer initialization isn't that critical in my experience.

    It is quite significant with wide enough mixer context.

    > I agree with you, that a slight compression gain compared to a huge amount
    > of extra time isn't worth the effort.

    Well, actually you agree with my opponent here
    What I think is that its always useful to have the best result you're able to
    reach within common sense restrictions (eg no supercomputers etc).
    Because only when you have values to compare, you'd be able to make a
    correct decision on where to stop.
    And the current problem is that paq8 isn't quite that. Its obviously more
    useful than some widespread archivers for data redundancy analysis, but it's
    too easy to make a very simple model with better results for custom data -
    and no wonder as paq8 uses lots of tricks for common files compression, but
    its compression classes, which aren't particularly good even in theory, are
    made even less efficient by all these speed and memory trade-offs.

    > The general order-n models are
    > somehow universal (since data is stored in bytes), but special models will
    > definitley do better - like a bi- or trigram word for text, sourrounding
    > pixels for images and so on. I think a specialized model along with maybe
    > order210 will give nice results; and higher orders are definitely not worth
    > the effort here.

    Well, it depends on the target.
    That is, yes, if you aim for best places in composite comparisons (size+speed),
    but this approach completely fails in data analysis.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    @encode: wonder if there's a way for having my idents back and no smilies.

  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I have to state that i never saw this weighting approach before. Looks interesting to me, maybe i'll give it a try. To resume your "binary" (two inputs) weighting (correct me, if i'm wrong):

    w0 = some initial weight

    update - binary averaging:
    p' = p1 + (p2-p1)*wk = (1-wk)*p1 + wk*p2
    (saves one multiplication, nice )
    note: (p' - p1)/(p2-p1) = wk
    adjust p' in the direction of the prediction, maybe like ilia said:
    p'' += (p* - p') / S
    p* = optimal prediction, probability = 0 or 1 depending on the bit
    and extract the new weight:
    w(k+1) = (p''-p1)/(p2-p1)

    I think the main problem is the division - a speed impact, this could be improved by using a (nonlinear quantized) lookup table for 1/(p2-p1) scaled by some factor.

    But i can't see a "clear"mathematical motivation behind this - like minimization of coding cost. It is clear, that this (somehow - but how, mathematically) pushes wk in the direction of the more accurate prediction.

    Your BFA is locally optimal (per bit coding step) but looses relation to the other orders. This reminds me of lz77 parsing: optimal match length (viewed as entropy) for the current position (greedy; the "innermost mix()"), or optimal length for the current (innermost mix())+ the next (the next outer to innermost mix()), etc... (a few bytes look ahead or even optimal parsing of a whole file/block). Ilia is the expert here

    I don't think that a NN mixer is inefficient - it does, what it is forced to do mathematically - minimizing coding cost by gradient decsent.

    Btw Matt doesn't loose the ability to change adaption speed. The bit probabilities aren't directly calculated by counter values. Instead the counter state is mapped to a probability, which is adjusted. The adjustment speed can be modified, in fact it is decreased (somehow) proportional to the frequency of the corresponding state's occourance.

    To correct my previous statement: i agree with your conclusion. Using specialized modelling will better than general modelling. I do general modelling, since i'm implementing some compressor (hobbyist) - i don't do data analysis.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Shelwien
    @encode: wonder if theres a way for having my idents back and no smilies.
    Not quite understand the question. If its about forum engine - its just as is.
    Forum should remember you via cookies.
    To insert a quoting press the "Quote" link upside the messages, or use the "" tags.
    Smilies are always enabled.


  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    2Shelwien
    By the way, do you know more efficient model updates schemes for pure order-N coders? Can you recommend something for my LZPM?
    Current well known schemes are:
    1.
    if (y) pr += ((4096 - pr) >> 5);
    else pr -= (pr >> 5);

    2.
    pr += (((y << 12) - pr) >> 5);
    // or
    pr += ((((y << 12) - pr) >> 5) + (pr < 204);

    3.
    if (y) pr += ((65536 - pr) >> 5);
    else pr -= (pr >> 5);
    // (P() = pr >> 4)
    Uwe proposed to use different shift values depending on the current probability value - i.e. more aggressive update then probability is more skewed.
    I just tested LZPM with order-1-0 coder and simple mixer. Well, it helps mainly with binary files, at the same time the code should be fully rewritten and not as simple.
    Just interesting to ask the expert like you.
    Who didnt know Eugene Shelwien is the worlds biggest expert in data compression and math. Its proud and pleasure to see such people on this forum!

  18. #18
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Taking a few minutes of my break i derived the following (using an update rule, like ilia posted):

    p' = pa * w(k) + pb * (1-w(k))
    p* = optimal choice 0 or 1
    p'' = p' + (p*-p') * alpha
    w(k+1) = (p'' - pa) / (pb - pa)
    adjusted by alpha*100% of prediction error

    ...calculations, simplify:

    w(k+1) = (1-alpha)*w(k) + alpha * (p* - pb)/(pa - pb)

    (maybe there's a mistake, i only wrote it down quickly)

    Maybe the second term (p*-pb)/(pa-pb) can be viewed as some approximization of the gradient regarding ld( p(bit) )?! I would interpret it, as model b's prediction error normalized by the compliance of both predictions.

    Any ideas?

    Greets
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    >> wonder if theres a way for having my idents back and no smilies.

    {
    test
    }
    "test" is supposed to have 2 spaces at the line start.

    > Not quite understand the question. If its about forum engine - its just as is.
    > Forum should remember you via cookies.
    > To insert a quoting press the "Quote" link upside the messages, or > use the "" tags.
    > Smilies are always enabled.

    Well, not my problem if it messes up my explainations, then.

  20. #20
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Will look at miniBB FAQ, however, AFAIK, there is no solution for ident...


  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    @encode:
    > By the way, do you know more efficient model updates schemes for pure
    > order-N coders? Can you recommend something for my LZPM?
    > Current well known schemes are:
    >
    > 1. if (y) pr += ((4096 - pr) >> 5); else pr -= (pr >> 5);

    So, let's rewrite this for pr in [0..1]:
    wr = 1/32
    pr = pr*(1-wr) + y*wr

    I think that with this you'll be suddenly able to
    tune the counter much better than like quoted above.
    Like, the number in 1/32 can even be dynamic,
    eg. number of context occurences.
    And it all can be implemented with one multiplication,
    which is not any worse than shifts nowadays.

    > 2. pr += (((y << 12) - pr) >> 5);

    This one is a speed optimization of (1).
    With a small number of such updates per iteration, if()
    version might be faster for skewed pr (far from 0.5),
    due to CPU's branch prediction, but I suppose that
    (2) is better for common use anyway.

    > 2a. pr += ((((y << 12) - pr) >> 5) + (pr < 204);

    wr = 1/32
    pr = pr*(1-wr) + y*wr + pr*wr/64
    pr = pr*(1-wr*63/64) + y*wr

    So guess this is basically a usual update, but with slightly
    different speed and an extra possibility of going out of
    [0..1] interval, so additinal check is needed.

    > 3. if (y) pr += ((65536 - pr) >> 5); else pr -= (pr >> 5);
    > // (P() = pr >> 4)

    Well, this one is (1) with 16bit precision instead of 12bit.
    I can say that my tests showed that with probability
    precision from 12 to 16 bits you can "get lucky" with any
    depending on the optimization target.

    To be specific, I have results like this:
    370648 370607 // 15 bits
    370652 370607 // 12 bits
    370646 370628 // 11 bits
    371075 370708 // 8 bits
    And its not about simple change of shift value - complete
    model reoptimization (contexts and other parameters) was
    performed for objective comparison (numbers in first column
    are results before optimization).
    So I think that you need 12bit precision, but anything more
    is masked by the "model noise".

    Here's a list of "error weight" values for different precisions
    (bits lost due to given order of redundancy in 1Mbit),
    kinda supports my statement too.

    In[1]:= N[Table[{i, 1000000*Log[2, 1/(1 - 2^-i)]}, {i, 8, 15}]]

    Out[1]= {{8, 5646.56}, {9, 2820.52}, {10, 1409.57}, {11, 704.613},
    {12, 352.263}, {13, 176.121}, {14, 88.0578}, {15, 44.0282}}

    > Uwe proposed to use different shift values depending on the current
    > probability value - i.e. more aggressive update when probability is more
    > skewed.

    Well, its basically what SSE/APM does...
    So obviously you can use a static map and include other contexts too.
    If only there was a good way for optimization of such things,
    except bruteforce...

    > I just tested LZPM with order-1-0 coder and simple mixer. Well, it helps
    > mainly with binary files, at the same time the code should be fully
    > rewritten and not as simple.

    Why, surely where you see the difference depends on your model...
    Also some more things to mention:
    1. PPM compresses better than LZP and nobody proved that LZ is necessarily faster.
    2. Bit tweaking and table lookups are good for embedded solutions,
    but can cause even worse slowdown than a bunch of divisions on current x86 arch.
    3. You don't need to do all things the same way as Matt does. For example,
    unary symbol coding (symbol can as well be an offset in LZ) is much faster
    than binary coding with modeling of each bit.

    So, say, what do you think about ppmonstr?..

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Ah, forgot one more thing.
    My counters do actually have one more parameter,
    tuning of which helps too. Its kinda like this:

    wr = 1/Speed
    w0 = 1/Weight // order(-1) weight
    pr = pr*(1-wr) + (y*(1-w0)+(1-y)*w0)*wr

    dp = w0*wr
    pr = pr*(1-wr) + y ? wr-dp : dp

    So your (2a) is imho a glitchy version of this.

  23. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @Shelwien:

    Hi,
    Even though I'm not an expert on data compression, your explanations are very clear at my point of view. Thanks a lot. I'm desinging an archive format for game / simulation softwares. It's name is Bit Archive Format. I'm trying to use ROLZ based approach on compressing side. I've found that ROLZ is very fast and efficient. I don't like PPM or pure CM. But, I'll try to use your mixing ideas. I'm sure this will help a lot for encoding literal/matching length symbols instead of bare order-0 range coding. As "encode" said before, order 1-0 mixing seems very good.

    Main design goal is effective compression (can be very slow and resource hungry), fast decompression. Game files are generally in binary format which aligned with 32 bits data (lots of int and floats). Do you have any benefit for me?

    Thanks a lot
    Osman Turan
    BIT Archiver homepage: www.osmanturan.com

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Btw, about these table lookups you all seem to like so much.
    I've made a test simulation:
    http://shelwien.googlepages.com/divtest.cpp
    http://shelwien.googlepages.com/divtest.exe
    And it doesn't look very good for lookup-only compressor
    implementations as they would obviously have L1 stuffed
    with counter values and such - even more so when using
    hashtables for context lookups.
    Wonder if somebody could run this on some Core Duo,
    lookups should be even slower (relatively) there.

    Test result for P4 3.1:

    16k: DivConst(imul)=0.015s, Div=0.078s, Lookup=0.015s
    32k: DivConst(imul)=0.031s, Div=0.157s, Lookup=0.015s
    64k: DivConst(imul)=0.094s, Div=0.312s, Lookup=0.032s
    128k: DivConst(imul)=0.172s, Div=0.625s, Lookup=0.078s
    256k: DivConst(imul)=0.390s, Div=1.282s, Lookup=0.531s
    512k: DivConst(imul)=0.812s, Div=2.532s, Lookup=1.562s
    1024k: DivConst(imul)=1.609s, Div=5.875s, Lookup=4.219s
    2048k: DivConst(imul)=3.187s, Div=10.969s, Lookup=15.547s

    So
    L1Lookup = 1/2*Mul = 1/10*Div
    L2Lookup = 2*Mul = 3/5*Div
    DDR2Lookup = 5*Mul = 3/2*Div

  25. #25
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Ive just run it on my asus based laptop. Here is detail about computer and test results:

    Core 2 Duo T7500 @ 2.20 GHz
    2 GB RAM
    Vista Business x64

    16k: DivConst(imul)=0.016s, Div=0.046s, Lookup=0.016s
    32k: DivConst(imul)=0.031s, Div=0.094s, Lookup=0.015s
    64k: DivConst(imul)=0.047s, Div=0.187s, Lookup=0.032s
    128k: DivConst(imul)=0.109s, Div=0.343s, Lookup=0.078s
    256k: DivConst(imul)=0.219s, Div=0.717s, Lookup=0.156s
    512k: DivConst(imul)=0.453s, Div=1.388s, Lookup=0.328s
    1024k: DivConst(imul)=0.889s, Div=2.793s, Lookup=0.920s
    2048k: DivConst(imul)=1.825s, Div=5.600s, Lookup=2.262s
    BIT Archiver homepage: www.osmanturan.com

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    @osmanturan:

    1. Thanks for testing, though 2M is too small to make it out of
    CoreDuo's L2, I guess. But even here multiplications look
    comparable to lookups... and they promised to make divisions
    2x faster in penryn...

    2.
    > I'm trying to use ROLZ based approach on compressing side. I've found that
    > ROLZ is very fast and efficient. I don't like PPM or pure CM.

    Well, CM surely is not the best method if you aim for fast (de)compression.
    But PPM worth considering, I think. Actually, PPM with unary encoding isn't
    that much different from LZ, only difference is that you kinda use unary
    encoding for match length in PPM. Guess, somebody should try optimizing
    PPM by speed instead of improving LZ's compression...

    > But, I'll try
    > to use your mixing ideas. I'm sure this will help a lot for encoding
    > literal/matching length symbols instead of bare order-0 range coding. As
    > "encode" said before, order 1-0 mixing seems very good.

    There's a good trick even within a single order.
    Like, you can have two counters - with "fast" update and with "slow",
    so after mixing the result would be close to optimal both for
    stationary and nonstationary data.
    Actually, Matt has explained this in paq source.

    > Main design goal is effective compression (can be very slow and resource
    > hungry), fast decompression.

    Well, your work would be more original/promising if you tried to take some
    good CM model and make it asymmetric. For example, it can be a static model
    (Bulat did that already, though), or realtime tuning of fast simple model
    to the data.

    > Game files are generally in binary format which aligned with 32 bits data
    > (lots of int and floats). Do you have any benefit for me?

    Well, for numbers you surely should consider at least something like
    http://en.wikipedia.org/wiki/Regression_analysis


    Also unary coding is much better than binary unless the data is really random.

    And for floats the best thing seems to transform them to fixedpoint form
    and use interval encoding.

    Actually, I did something similar for CalgaryCorpus/geo compression
    (it mostly contains floats in some ancient format)
    http://compression.ru/sh/fltgeo3b.rar
    The model there is wrong in many ways, but still its how you supposed to
    compress floats. Also it appears that it still compresses geo file better
    than paq8... quite surprisingly:

    geodump.ha 40047
    GEO.paq8o6 43911
    GEO.pmm 49638

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    Btw, to russians here: wonder if everybody seen this:
    http://compression.ru/sh/2003.htm
    (as lots of stuff I linked in this thread is mentioned there).

  28. #28
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    16k: DivConst(imul)=0.016s, Div=0.016s, Lookup=0.000s
    32k: DivConst(imul)=0.015s, Div=0.047s, Lookup=0.000s
    64k: DivConst(imul)=0.031s, Div=0.079s, Lookup=0.015s
    128k: DivConst(imul)=0.063s, Div=0.172s, Lookup=0.031s
    256k: DivConst(imul)=0.109s, Div=0.344s, Lookup=0.062s
    512k: DivConst(imul)=0.219s, Div=0.672s, Lookup=0.172s
    1024k: DivConst(imul)=0.437s, Div=1.375s, Lookup=0.453s
    2048k: DivConst(imul)=0.969s, Div=2.781s, Lookup=1.344s

    The Q6600 @2400 on my side.

  29. #29
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    2.2 GHz Athlon-64 3500+, 2 GB, WinXP (32 bit), compiled with g++ -O2 -Os -march=pentiumpro -fomit-frame-pointer -s

    16k: DivConst(imul)=0.219s, Div=0.125s, Lookup=0.016s
    32k: DivConst(imul)=0.187s, Div=0.188s, Lookup=0.031s
    64k: DivConst(imul)=0.359s, Div=0.375s, Lookup=0.063s
    128k: DivConst(imul)=0.750s, Div=0.734s, Lookup=0.141s
    256k: DivConst(imul)=1.562s, Div=1.531s, Lookup=0.422s
    512k: DivConst(imul)=3.125s, Div=3.047s, Lookup=0.968s
    1024k: DivConst(imul)=6.265s, Div=6.110s, Lookup=2.484s
    2048k: DivConst(imul)=12.562s, Div=12.266s, Lookup=6.328s

    Edit: my bad. I looked at assembler source (-S) -Os (optimize for size) prevents g++ from optimizing a division by a constant to a multiplication. Repeating the test,

    C: mp>g++ -O2 divtest.cpp

    C: mp>a
    16k: DivConst(imul)=0.031s, Div=0.203s, Lookup=0.047s
    32k: DivConst(imul)=0.063s, Div=0.219s, Lookup=0.046s
    64k: DivConst(imul)=0.047s, Div=0.360s, Lookup=0.078s
    128k: DivConst(imul)=0.109s, Div=0.719s, Lookup=0.187s
    256k: DivConst(imul)=0.375s, Div=1.485s, Lookup=0.453s
    512k: DivConst(imul)=0.750s, Div=2.984s, Lookup=1.031s
    1024k: DivConst(imul)=1.500s, Div=5.938s, Lookup=2.687s
    2048k: DivConst(imul)=3.016s, Div=11.953s, Lookup=6.547s

    C: mp>g++ -O2 -Os divtest.cpp

    C: mp>a
    16k: DivConst(imul)=0.218s, Div=0.141s, Lookup=0.016s
    32k: DivConst(imul)=0.187s, Div=0.172s, Lookup=0.047s
    64k: DivConst(imul)=0.359s, Div=0.360s, Lookup=0.078s
    128k: DivConst(imul)=0.750s, Div=0.719s, Lookup=0.156s
    256k: DivConst(imul)=1.562s, Div=1.516s, Lookup=0.422s
    512k: DivConst(imul)=3.110s, Div=3.031s, Lookup=1.000s
    1024k: DivConst(imul)=6.234s, Div=6.078s, Lookup=2.422s
    2048k: DivConst(imul)=12.516s, Div=12.203s, Lookup=6.531

    C: mp>g++ divtest.cpp

    C: mp>a
    16k: DivConst(imul)=0.063s, Div=0.265s, Lookup=0.032s
    32k: DivConst(imul)=0.062s, Div=0.266s, Lookup=0.078s
    64k: DivConst(imul)=0.109s, Div=0.516s, Lookup=0.156s
    128k: DivConst(imul)=0.234s, Div=1.047s, Lookup=0.328s
    256k: DivConst(imul)=0.547s, Div=2.125s, Lookup=0.703s
    512k: DivConst(imul)=1.078s, Div=4.265s, Lookup=1.516s
    1024k: DivConst(imul)=2.157s, Div=8.531s, Lookup=3.578s
    2048k: DivConst(imul)=4.313s, Div=17.156s, Lookup=8.500s

    I used -Os because it makes paq8 faster for some reason. But paq8 doesn't use integer division.

    BTW, this is with 128K L1 cache, 512K L2 cache, 2 GB DDR2 3-3-4-8.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,362
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    @toffer:

    > w(k+1) = (p''-p1)/(p2-p1)
    >
    > I think the main problem is the division - a speed impact, this could be
    > improved by using a (nonlinear quantized) lookup table for 1/(p2-p1) scaled
    > by some factor.

    Well, that'd be a minor problem if you'd manage to gain any worthy results
    out of this

    > But i can't see a "clear" mathematical motivation behind this - like
    > minimization of coding cost. It is clear, that this (somehow - but how,
    > mathematically) pushes wk in the direction of the more accurate prediction.

    Okay, we don't have any objective metric for update optimization -
    because the only real metric, size of compressed data, can't be
    completely recomputed on every iteration.
    So actually I was talking about
    1. Update like this: pr = pr*(1-wr) + y*wr
    is mathematically optimal using the codesize sum with exponential weights
    (That is, if we count it like len=len*(1-wr)-log2(y*pr+(1-y)*(1-pr))
    2. My mixer works completely the same as a counter, if both given
    probabilities are constant.

    But of course, I can't call it an ideal mixer or anything like that...
    It'd be plain wrong considering that I don't actually use it in my
    models now . Just that imho its better than paq's mixer.

    > Your BFA is locally optimal (per bit coding step) but looses relation to
    > the other orders.

    Yes, but its better than nothing anyway.
    And you always can calculate estimates for all weight combinations etc
    Also, the same approach applies to any parameters from non-recurrent
    functions, so I once made a BFA-driven ppmy version.

    Btw did you see bfa2 includes with log() replaced with division?
    Funny thing is, its not some random empiric optimization, it just
    uses a derivative of length function by w variable

    > I don't think that a NN mixer is inefficient - it does, what it is forced
    > to do mathematically - minimizing coding cost by gradient decsent.

    Well, I can believe that it might reach some optimum (local probably)
    on stationary data, but it just isn't adapting fast enough for real environment.

    > Btw Matt doesn't loose the ability to change adaption speed. The bit
    > probabilities aren't directly calculated by counter values. Instead the
    > counter state is mapped to a probability, which is adjusted. The adjustment
    > speed can be modified, in fact it is decreased (somehow) proportional to
    > the frequency of the corresponding state's occourance.

    Well, SSE isn't actually used like that... it does partial compensation,
    but its kinda the same like saying that you can map any hash value back
    to source string. So if you're able to change both counter behavior and
    SSE parameters, it works much better than with only SSE parameters.

    > ==========
    >
    > Taking a few minutes of my break i derived the following (using an update
    > rule, like ilia posted):
    >
    > p' = pa * w(k) + pb * (1-w(k))
    > p* = optimal choice 0 or 1
    > p'' = p' + (p*-p') * alpha
    > w(k+1) = (p'' - pa) / (pb - pa)
    > adjusted by alpha*100% of prediction error
    >
    > ...calculations, simplify:
    >
    > w(k+1) = (1-alpha)*w(k) + alpha * (p* - pb)/(pa - pb)
    >
    > (maybe there's a mistake, i only wrote it down quickly)

    No, its the same thing. Actually, I had initially written it like that
    too, but current version appears slightly better due to different
    rounding or something like that.

    > Maybe the second term (p*-pb)/(pa-pb) can be viewed as some approximization
    > of the gradient regarding ld( p(bit) )?! I would interpret it, as model b's
    > prediction error normalized by the compliance of both predictions.
    > Any ideas?

    Well, afair it was like this:

    In[1]:= Solve[ D[y*Log[p1*(1-w)+p2*w] + (1-y)*Log[1-p1*(1-w)-p2*w],w]==0, w]
    Out[1]= { w -> (p1 - y)/(p1 - p2) }

    With y = bit value and p1,p2 = probabilities of y==1.

Page 1 of 2 12 LastLast

Similar Threads

  1. Subset model selection problem
    By Alexandre Mutel in forum Data Compression
    Replies: 28
    Last Post: 17th July 2009, 10:43
  2. deflate model for paq8?
    By kaitz in forum Data Compression
    Replies: 2
    Last Post: 6th February 2009, 21:48
  3. fpaq0f new order 0 model
    By Matt Mahoney in forum Forum Archive
    Replies: 31
    Last Post: 5th February 2008, 23:08
  4. CM Match model
    By toffer in forum Forum Archive
    Replies: 32
    Last Post: 1st February 2008, 20:26
  5. CM model
    By toffer in forum Forum Archive
    Replies: 40
    Last Post: 26th January 2008, 07:59

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •