Page 1 of 2 12 LastLast
Results 1 to 30 of 43

Thread: Parameter optimization

  1. #1
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Parameter optimization

    Hello everyone!

    I'm currently working on bit model parameter optimization for my study thesis. My bit models have got two parameters, 1-c and d:

    p_k+1 = p_k + (y_k-p_k)*(1-c) + d

    I've already written a working optimizer for simple models (direct lookup - order 0-2) and will possibly include a wrapper which dumps coding steps to a data file to allow an easy integration into compression programs (so you can tune your bit models on a large data set with less effort). My first results show, that the expectable improvement will be somewhere around .1% .. 1% - not very stunning, but hey - the improvement doesn't require extra effort.

    I might include more advanced bit models, e.g. using different parameter sets for 1-c and d for the startup phase (small k), similar to Matt's approach in PAQ, to allow rapid initial learning - but the learning curve will be optimized . Or using different parameters wether the current bit is a more probable symbol or a less probable symbol. Do you have any other ideas here?

    As a short example i ripped of some fpaq version and included an (locally) optimal 1-c and d at 15 bit precision. It was tuned on enwik8 - and shows best results for such a simple model.

    http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

    Greets

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > I'm currently working on bit model parameter optimization
    > for my study thesis.
    > My bit models have got two parameters, 1-c and d:
    >
    > p_k+1 = p_k + (y_k-p_k)*(1-c) + d

    I normally use counters like this:
    p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw =
    = p + mw*(pm-p) + (1-mw)*wr*(y-p) =
    = p + (y-p)*(mw*(1-wr)+wr) - mw*(y-pm)
    but with fixed pm=0.5 (of course it can be optimized too)

    Its similar, but your parametrization can overflow
    and there'd be a correlation between your c and d optimal values.

    > I've already written a working optimizer for simple models
    > (direct lookup - order 0-2)

    What kind of optimizer, I wonder?

    > wrapper which dumps coding steps to a data file to allow an
    > easy integration into compression programs (so you can tune
    > your bit models on a large data set with less effort).

    Which "coding steps"? Contexts and context histories?
    I do that manually... as most of models are still written
    manually anyway - not really constructed from common blocks
    or anything.

    > first results show, that the expectable improvement will be
    > somewhere around .1% .. 1% - not very stunning, but hey -
    > the improvement doesn't require extra effort.

    Right, counter parameter optimization isn't very interesting
    (bruteforce optimization at least, as there're some applications
    for smarter techniques).
    But what's really pretty much impossible without special
    optimization tools is context quantization.
    And there're lots of useful non-probability statistics, like
    eg. match offset/length which need to be quantized to be used
    as a context.

    > I might include more advanced bit models, e.g. using
    > different parameter sets for 1-c and d for the startup phase
    > (small k), similar to Matt's approach in PAQ, to allow rapid
    > initial learning - but the learning curve will be optimized.

    Normally that's impractical.
    I mean, like that you'd get an improvement with simple counters, right.
    But with secondary models results would be worse...
    Although sometimes I do use a simpler model for initial coding...
    I call it a "bootstrap model" - usually its a small counter table
    and its estimation is quantized into a part of main model's context,
    but also is directly used for coding in initial phase,
    before switching to the main model.

    > Or using different parameters whether the current bit is a
    > more probable symbol or a less probable symbol.
    > Do you have any other ideas here?

    More than one context history bit would be useful like that,
    but updating the counter with bits and using the same bits in
    other contexts causes context correlation which is no good
    at least for memory usage.
    So its better to have a delayed counter with one static mapping
    (in context of some bit history) for probability estimation and
    another mapping for update.
    These mapping are similar to SSE, but have better to be static,
    or you won't be able to apply another (dynamic) SSE stage there.
    Also quantization could be applied to bit histories.

    > http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

    You could at least separately optimize parameter values for
    different contexts - that would be somewhat interesting at least.

  3. #3
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yep, i know about your counters.

    Calculations are done like this:

    p_k+1 = p_k + ((y_k-p_k)*(1-c) + d)/S

    where 1-c ranges from 0..S-1 and d can range from [-S/2, S/2). y_k-p_k can at most be a N bit signed integer and 1-c a M bit unsigned integer (S=2^M). I've choosen N+M<(native integer matissa bits), N+M<31. So nothing can overflow.

    I'm using this type of calculation since its simple and fast.

    There are correlations between c and d, but this doesn't influence the optimization, since derivates by c and d or d and c are both zero.

    > What kind of optimizer, I wonder?

    I'm numerically minimizing coding cost of the test data set, currently i use CG (calculating the hessian matrix is too slow) and might switch to a second order optimization algorithm and estimate the hessian matrix using bfgs.

    > Which "coding steps"? Contexts and context histories?
    > I do that manually... as most of models are still written
    > manually anyway - not really constructed from common blocks
    > or anything.

    All my work upon now uses one thing: "class bitmodel", which encapsulates the above counter. So calling this class can dump the needed information for the optimization to a file.

    > Normally that's impractical.

    In my experience it isn't. Recall some of our previous discussions of including 1-c_k instead of simply 1-c to simulate a counter like p1 = (c*n1_k + y_k) / (c*n_k + 1). I've got a positive effect of doing this (still when using second order models) - just test lpaq with disabled variable bit model adaption speed.

    I agree that this type of learning changes the output characteristic of a bit model, which makes the distribution a secondary mapping has to learn more unstable.

    The difference isn't that high, but as i said it doesn't require any extra effort.

    > I mean, like that you'd get an improvement with simple counters, right.
    > But with secondary models results would be worse...
    > Although sometimes I do use a simpler model for initial coding...
    > I call it a "bootstrap model" - usually its a small counter table
    > and its estimation is quantized into a part of main model's context,
    > but also is directly used for coding in initial phase,
    > before switching to the main model.

    I thought about something like this since it is something PAQ's state machine does.

    > Also quantization could be applied to bit histories.

    This is very interesting for me. I've made up a simple mathematical describtion (basically densitiy and transistion functions) which can be used to derive a state machine. The needed functions can be derived from my bit model parameters which are divided into special classes.

    > http://freenet-homepage.de/toffer_86/fpaq0cm.cpp

    It's just a proof-of-concept, nothing special.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > Yep, i know about your counters.

    I just wanted to point that your parameters are functions of mine.
    So in the case where my representation fits the source model better
    than yours, your parameters are harder to optimize because they're
    entangled.
    However I won't bother to prove that its really the case.

    > Calculations are done like this:
    > p_k+1 = p_k + ((y_k-p_k)*(1-c) + d)/S
    > where 1-c ranges from 0..S-1 and d can range from [-S/2,S/2).

    You skipped that before, so I thought that d is [0..S-1].
    But like that it won't overflow, you're right.

    > I'm using this type of calculation since its simple and fast.

    Its not really simple, as it won't work completely the same
    if you replace the division with a shift.

    > There are correlations between c and d, but this doesn't
    > influence the optimization, since derivates by c and d or d
    > and c are both zero.

    Well, I think it does.
    Even if you're able to find a global minimum (which I doubt)
    of a code length _function_, its still not the same as a real rangecoder's
    code length.

    > > What kind of optimizer, I wonder?
    >
    > I'm numerically minimizing coding cost of the test data set,
    > currently i use CG (calculating the hessian matrix is too
    > slow) and might switch to a second order optimization
    > algorithm and estimate the hessian matrix using bfgs.

    Somehow that doesn't feel right.
    Are you doing gradient descent parameter tuning for whole enwik8 or what?
    At least you could split it at some points, like long sequences of the
    same bit value or something - well, points where probability value is
    least dependent on the parameters.
    Also did you consider the reduced polinomial approximation which I was
    talking about once?
    Wonder if you're trying to be precise here... its no use because
    you cannot precisely compute the behaviour of discrete system like rc
    with mathematics.

    > > Which "coding steps"? Contexts and context histories?
    > > I do that manually... as most of models are still written
    > > manually anyway - not really constructed from common blocks
    > > or anything.
    >
    > All my work upon now uses one thing: "class bitmodel", which
    > encapsulates the above counter. So calling this class can
    > dump the needed information for the optimization to a file.

    Is it a single counter class, or counter table class?
    Because sometimes its useful to dump the whole context
    values along with the data as it allows to tune the quantization.
    However simple flattened bit histories are tricky too -
    you have to add terminators and implement the model flush
    (and its obviously not much sense tuning to the single context sequence)

    > > Normally that's impractical.
    >
    > In my experience it isn't.
    [...]
    > I agree that this type of learning changes the output
    > characteristic of a bit model, which makes the distribution
    > a secondary mapping has to learn more unstable.

    Yeah, so here I meant that not using any secondary model is impractical.

    > > Also quantization could be applied to bit histories.
    >
    > This is very interesting for me. I've made up a simple
    > mathematical describtion (basically densitiy and transistion
    > functions) which can be used to derive a state machine. The
    > needed functions can be derived from my bit model parameters
    > which are divided into special classes.

    Quantization is a simple case of context clustering
    (where any two contexts can correspond to the same table index, or not).
    In my implementation, there're bitmaps corresponding to contexts
    which are optimized in a somewhat "genetic" fashion and then used
    to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).

    > > http://freenet-homepage.de/toffer_86/fpaq0cm.cpp
    >
    > It's just a proof-of-concept, nothing special.

    Can you post the enwik8 result for this version and also
    for some neighbourhood like 32088+-10,20,30 and 7000+-10,20,30?

  5. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    > I just wanted to point that your parameters are functions of mine.
    > So in the case where my representation fits the source model better
    > than yours, your parameters are harder to optimize because they're
    > entangled.

    Mathematically it is the same, but the real-life numeric situation might be
    different. At the moment i don't concider a change of the bit model.

    > Its not really simple, as it won't work completely the same
    > if you replace the division with a shift.

    The only difference is that if the compiler generates a SAR/SAL instruction it
    will carry over the sign bit (-1>>1 = -1). However writing this as a division
    (-1/2) solves the problem (gcc dosn't generate a division and you don't have the
    sign problem).

    > Well, I think it does.
    > Even if you're able to find a global minimum (which I doubt)
    > of a code length _function_, its still not the same as a real rangecoder's
    > code length.

    This is true. Minimizing the coding cost function represents a theoretical
    limit, the range coder is bounded by this. But the "direction" of parameter
    influence is *about* the same. Actually i use a range coder to estimate the
    coding cost, since the difference to the true entropy function is small (and the
    odd thing - it is faster!).

    After falling into a local optimum i'll start a brute force search around it,
    which can compensate the difference between the numerically found minimum and
    the "true" local minimum. This isn't perfect - i know.

    > Somehow that doesn't feel right.

    I can't have any other clue than the coding cost function - which can be used
    to optimize with limited time.

    > Are you doing gradient descent parameter tuning for whole enwik8 or what?

    It was just a test case, as i said - "proof of concept". My plans are to add
    tuned parameters to my compressors, which will detect the data type by some
    (maybe correlative) measure and use optimal parameters. I'll find "optimal"
    parameters by optimizing for large data sets per type.

    As i said before another useful thing is that i can add state machines which
    are built from empirical collected statistics instead of using "ad-hoc" ones
    like in PAQ.

    > At least you could split it at some points, like long sequences of the
    > same bit value or something - well, points where probability value is
    > least dependent on the parameters.

    Since i probe for the minimum over the whole (large) data set the influence of
    such "unnormal" regions should be tiny (averaged out).

    > Also did you consider the reduced polinomial approximation which I was
    > talking about once?

    Nope. Could you give me some more information about this?

    > Wonder if you're trying to be precise here... its no use because
    > you cannot precisely compute the behaviour of discrete system like rc
    > with mathematics.

    True.

    > Is it a single counter class, or counter table class?
    > Because sometimes its useful to dump the whole context
    > values along with the data as it allows to tune the quantization.

    It is a single class. Normally i call

    bitmodel bm;
    ...
    bm.Update(y);

    Which can easily replaced by:

    bm.UpdateAndDump(y, x);

    In my case x is an integer which represents a "class of parameters". In some
    situations x needs to be passed as a parameter, in others it can be calculated
    by the bit model. It depends on the type of classes you want to regard. Of course
    i could add more advanced things, but this study thesis is nothing special. I'm
    just focusing on completing it, since i'm happy to have my advisor enthused for
    this topic (since i could join a hobby and work). If this is completed i will
    probably do more advanced things with this - but i'll always listen to
    suggestions and remarks for the (future) development of this stuff.

    > > Normally that's impractical.
    >
    > In my experience it isn't.
    [...]
    > I agree that this type of learning changes the output
    > characteristic of a bit model, which makes the distribution
    > a secondary mapping has to learn more unstable.

    Yeah, so here I meant that not using any secondary model is impractical.

    > Quantization is a simple case of context clustering
    > (where any two contexts can correspond to the same table index, or not).
    > In my implementation, there're bitmaps corresponding to contexts
    > which are optimized in a somewhat "genetic" fashion and then used
    > to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).

    We discussed something similar before, didn't we? But i can't remember the
    thread. Or am i wrong here?

    > Can you post the enwik8 result for this version and also
    > for some neighbourhood like 32088+-10,20,30 and 7000+-10,20,30?

    You want to see, if it's really minimum? I can do, but at the moment i'm working
    on the improvment of my optimizer. I still need to add simple things like
    automatic stopping or a line search algorithm.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > Its not really simple, as it won't work completely the same
    > > if you replace the division with a shift.
    >
    > The only difference is that if the compiler generates a SAR/SAL instruction it


    Right:

    > ;;; zz/=32768;
    > mov eax, DWORD PTR [esp+28]
    > mov edx, eax
    > sar edx, 14
    > shr edx, 17
    > add edx, eax
    > sar edx, 15

    I meant that its totally slow comparing to a single SHR.
    Probably slower than multiplication.

    > > Somehow that doesn't feel right.
    >
    > I can't have any other clue than the coding cost function - which can be used
    > to optimize with limited time.

    I just hoped that you made something smarter than a bruteforce search.
    Btw there's nothing wrong with that, I still do it like that myself,
    but its too heavy a method so can't be applied to really interesting
    tasks like automated context clustering.

    > As i said before another useful thing is that i can add state machines which
    > are built from empirical collected statistics instead of using "ad-hoc" ones
    > like in PAQ.

    Well, I advertised this approach since my first post on that forum

    > > At least you could split it at some points, like long sequences of the
    > > same bit value or something - well, points where probability value is
    > > least dependent on the parameters.
    >
    > Since i probe for the minimum over the whole (large) data set the influence of
    > such "unnormal" regions should be tiny (averaged out).

    Guess you misunderstood... I meant that its possible to split it to increase
    optimization speed.

    > > Also did you consider the reduced polinomial approximation which I was
    > > talking about once?
    >
    > Nope. Could you give me some more information about this?

    http://encode.su/forums/index.php?ac...ic=648#msg8913

    > > Quantization is a simple case of context clustering
    > > (where any two contexts can correspond to the same table index, or not).
    > > In my implementation, there're bitmaps corresponding to contexts
    > > which are optimized in a somewhat "genetic" fashion and then used
    > > to generate a quantization code like ctx=ctx*4+(c>4)+(c>+(c+11).
    >
    > We discussed something similar before, didn't we? But i can't remember the
    > thread. Or am i wrong here?

    http://encode.su/forums/index.php?ac...page=0#msg8093

  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    I meant that its totally slow comparing to a single SHR.
    Probably slower than multiplication.
    I've tested against an order 0 model. There is no difference in processing time.
    A solution could be: (y-p)*(1-c)+d & (1<<LD_SCALE)-1 >> LD_SCALE
    Loosing the lowest bit shouldn't make any difference (it is almost random).

    Quote Originally Posted by Shelwien View Post
    > Guess you misunderstood... I meant that its possible to split it to increase
    > optimization speed.
    You mean to optimize just "typical" pieces of data and discard the rest?

    You mean one should approximate log2 using a taylor polynomal? This would avoid
    divisions.

    BTW:

    Code:
    toffer@debian:~$ ./opt enwik8 32069 -8069
       0: [c d] = [0.97867 -0.24625] ~ 1/32768 [32069 -8069]  grad H = [-0.00546  0.00005]  H = 61185836  (47.11 s)
       1: [c d] = [0.97870 -0.24625] ~ 1/32768 [32070 -8069]  grad H = [-0.00514  0.00003]  H = 61185856  (47.10 s)
       2: [c d] = [0.97873 -0.24625] ~ 1/32768 [32071 -8069]  grad H = [-0.00479 -0.00005]  H = 61185832  (47.09 s)
       3: [c d] = [0.97876 -0.24625] ~ 1/32768 [32072 -8069]  grad H = [-0.00444 -0.00008]  H = 61185821  (47.09 s)
       4: [c d] = [0.97879 -0.24625] ~ 1/32768 [32073 -8068]  grad H = [-0.00407 -0.00000]  H = 61185839  (47.11 s)
       5: [c d] = [0.97882 -0.24625] ~ 1/32768 [32074 -8068]  grad H = [-0.00372 -0.00000]  H = 61185826  (47.09 s)
       6: [c d] = [0.97885 -0.24625] ~ 1/32768 [32075 -8068]  grad H = [-0.00332 -0.00000]  H = 61185780  (47.10 s)
       7: [c d] = [0.97888 -0.24625] ~ 1/32768 [32076 -8068]  grad H = [-0.00296 -0.00000]  H = 61185781  (47.09 s)
       8: [c d] = [0.97891 -0.24625] ~ 1/32768 [32077 -8068]  grad H = [-0.00260 -0.00000]  H = 61185750  (47.10 s)
       9: [c d] = [0.97894 -0.24625] ~ 1/32768 [32078 -8068]  grad H = [-0.00225 -0.00000]  H = 61185729  (47.10 s)
      10: [c d] = [0.97897 -0.24625] ~ 1/32768 [32079 -8068]  grad H = [-0.00190 -0.00002]  H = 61185721  (47.10 s)
      11: [c d] = [0.97900 -0.24625] ~ 1/32768 [32080 -8068]  grad H = [-0.00152 -0.00002]  H = 61185692  (47.09 s)
      12: [c d] = [0.97903 -0.24625] ~ 1/32768 [32081 -8068]  grad H = [-0.00102 -0.00000]  H = 61185799  (47.10 s)
      13: [c d] = [0.97906 -0.24625] ~ 1/32768 [32082 -8068]  grad H = [-0.00066 -0.00000]  H = 61185767  (47.10 s)
      14: [c d] = [0.97910 -0.24624] ~ 1/32768 [32083 -8068]  grad H = [-0.00033 -0.00001]  H = 61185797  (47.10 s)
      15: [c d] = [0.97913 -0.24624] ~ 1/32768 [32084 -8068]  grad H = [ 0.00005 -0.00001]  H = 61185745  (47.08 s)
      15: [c d]* = 1/32768 [32080 -8068]   H* = 61185692

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > I meant that its totally slow comparing to a single SHR.
    > > Probably slower than multiplication.
    >
    > I've tested against an order 0 model. There is no difference in processing time.
    > A solution could be: (y-p)*(1-c)+d & (1<<LD_SCALE)-1 >> LD_SCALE

    Not sure where's "no difference".
    That code for division via SAR is surely slower than a single SHR.

    Also your d parameter has almost no meaning.
    So I think that my formula would have better effect even not touching
    the third parameter.

    Edit: Here's a fpaq0p version with my formula, btw
    (and parameters tuned with my system, not to enwik8 though)
    http://shelwien.googlepages.com/fpaq0p-sb_sh.cpp

    > > Guess you misunderstood... I meant that its possible to split it to increase
    > > optimization speed.
    >
    > You mean to optimize just "typical" pieces of data and discard the rest?

    I mean that at some points the probability value depends on update parameter
    less than on average. Like after a long sequence of 1 the probability of 1
    would be close to 1.0 with almost any parameters.
    So its possible to cut the file at these points and improve optimization speed.

    Edit: Guess its even possible to _force_ the counter to specific
    probability values at some contexts, to make the recurrent
    regions shorter.
    As to how it allows to improve optimization speed - at least
    by factoring the similar sequences.

    > You mean one should approximate log2 using a taylor polynomal?
    > This would avoid divisions.

    No, that's not necessary.
    I mean that you can formally calculate a string probability and it
    would be a pure polynomial.
    And probability is nowhere worse than its log2 for optimization.

    > 15: [c d]* = 1/32768 [32080 -8068] H* = 61185692

    That's not [32088 -7000]
    Last edited by Shelwien; 7th May 2008 at 21:43.

  9. #9
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Not sure where's "no difference".
    That code for division via SAR is surely slower than a single SHR.
    I still don't measure any speed difference, the code is longer though.

    Quote Originally Posted by Shelwien View Post
    Also your d parameter has almost no meaning.
    So I think that my formula would have better effect even not touching
    the third parameter.
    Well, the way i've rewritten your counter i would interpret d as a rounding
    constant. Your parameters are meaningfuller than mine - but both representations
    are equal (mathematically).

    Quote Originally Posted by Shelwien View Post
    Edit: Here's a fpaq0p version with my formula, btw
    (and parameters tuned with my system, not to enwik8 though)
    http://shelwien.googlepages.com/fpaq0p-sb_sh.cpp
    Comparing against my simple bitmodel is unfair here, since this model has
    4096*2 "classes" (different pairs of [c d]) compared to just a single class
    with my model. Btw i plan to seperate the parameter classes exactly that way.
    But alltogether i've got far better (in terms of how much you can squeeze out
    of this) results.

    61499101 bytes vs 61185692 bytes

    Quote Originally Posted by Shelwien View Post
    Edit: Guess its even possible to _force_ the counter to specific
    probability values at some contexts, to make the recurrent
    regions shorter.
    Yep, this is what i want to do, introducing several classes dependent on
    distigushing probability and more/less probable symbols - but first get the
    simpler things to work (to complete my work).

    Quote Originally Posted by Shelwien View Post
    No, that's not necessary.
    I mean that you can formally calculate a string probability and it
    would be a pure polynomial.
    Approximate the whole training-string as a polynomal?


    Quote Originally Posted by Shelwien View Post
    And probability is nowhere worse than its log2 for optimization.
    That's true (monotone transform), but in my case (iterative minimization) log2
    changes the problem's optimization metric.

    Quote Originally Posted by Shelwien View Post
    That's not [32088 -7000]
    Yep, since i frequently change my optimizer's source it behaves slightly
    different. Despite some internal changes i used another initial estimate.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > Not sure where's "no difference".
    > > That code for division via SAR is surely slower than a single SHR.
    >
    > I still don't measure any speed difference, the code is longer though.

    Guess you have something much slower there
    But in fpaq0pv4B this could easily cause ~10% slowdown.

    > Well, the way i've rewritten your counter i would interpret d as a rounding
    > constant. Your parameters are meaningfuller than mine - but both representations
    > are equal (mathematically).

    Your rounding constant is fourth parameter actually, if I'd add it to my formula.

    > > Here's a fpaq0p version with my formula, btw
    > > (and parameters tuned with my system, not to enwik8 though)
    > > http://shelwien.googlepages.com/fpaq0p-sb_sh.cpp
    >
    > Comparing against my simple bitmodel is unfair here, since this model has
    > 4096*2 "classes" (different pairs of [c d]) compared to just a single class
    > with my model.

    Only potentially. In fact, there're different parameter values, but not that much,
    specifically 6.

    > Btw i plan to seperate the parameter classes exactly that way.
    > But alltogether i've got far better (in terms of how much you can squeeze out
    > of this) results.
    >
    > 61499101 bytes vs 61185692 bytes

    I was talking about my p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
    update formula, there's not exactly the same thing.

    > > No, that's not necessary.
    > > I mean that you can formally calculate a string probability and it
    > > would be a pure polynomial.
    >
    > Approximate the whole training-string as a polynomal?

    The probability of encountering the given string, yes.
    But the point is that we can shift the parameters into
    [-0.5,0.5] range and then only a handful of orders
    would have any meaning.

    For example, leaving a single parameter:
    p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
    Then, if we start with p0 and process a
    string "00110100", we'd have

    p = ( (44+p0) -
    16*(9+p0)*w +
    16*(1+7*p0)*w^2 -
    64*(-6+7*p0)*w^3 +
    32*(-22+35*p0)*w^4 -
    256*(-3+7*p0)*w^5 +
    256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256

    And even here eg. w^8 is already relatively insignificant.
    (Optimization target is product of such p's, though).

  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Guess you have something much slower there
    But in fpaq0pv4B this could easily cause ~10% slowdown.
    Could be true, there is one virtual function call per nibble. In addition,
    estimating the gradient makes the overall process about 4 times slower. I tested
    without a gradient estimation, speed stays the same. The speed is comparable to
    fpaq0p.

    Quote Originally Posted by Shelwien
    Your rounding constant is fourth parameter actually, if I'd add it to my formula.
    Unfortuneatly you are right. My first look wasn't carefully enough. If i rewrite
    your model i get:

    p' = (p*(1-wr)+y*wr)*(1-mw)+pm*mw
    = (p + (y-p)*wr)*(1-mw) + pm*mw
    = p + (y-p)*wr*(1-mw) + (pm-p)*m
    = p + (y-p)*w0 + (pm-p)*w1

    p' = p + (y-p)*c + d

    This way c=w0 weightens new data and d is the weightend average difference
    between a mixed distribution and the estimated probability. However this is a
    bit "around the corner".

    > > Here's a fpaq0p version with my formula, btw
    > > (and parameters tuned with my system, not to enwik8 though)
    > > http://shelwien.googlepages.com/fpaq0p-sb_sh.cpp
    >
    > Comparing against my simple bitmodel is unfair here, since this model has
    > 4096*2 "classes" (different pairs of [c d]) compared to just a single class
    > with my model.

    Quote Originally Posted by Shelwien
    Only potentially. In fact, there're different parameter values, but not that much,
    specifically 6.
    Code:
    int j = 19 + (i < 2048) + (i < 1024) + (i < 512) + (i < 256) + (i < 128);
    Yep, my fault. I only looked at the probability update. But i think here's
    much potential introducing several classes dependent on p and y.

    Quote Originally Posted by Shelwien
    For example, leaving a single parameter:
    p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
    Then, if we start with p0 and process a
    string "00110100", we'd have

    p = ( (44+p0) -
    16*(9+p0)*w +
    16*(1+7*p0)*w^2 -
    64*(-6+7*p0)*w^3 +
    32*(-22+35*p0)*w^4 -
    256*(-3+7*p0)*w^5 +
    256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256

    And even here eg. w^8 is already relatively insignificant.
    (Optimization target is product of such p's, though).
    To be a bit more preciese you need more powers of w. My expirience says that
    0.5+w will be close to 0 so w will be close to -.5. Taking 2^-8 still
    contributes in the domain of 1e-3. But this appoach seems to be interesting too,
    thanks for the explaination!
    So the optimization target would be to maximize the probability of the training
    string dependent on the rescaled parameters?
    How do you compute the coefficients a_i for w^i. I couldn't really identify the
    patterns here - especially not at 3 in the morning

    I really like the idea of shifting the parameters into this range to improve the
    convergence speed

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > Guess you have something much slower there
    > > But in fpaq0pv4B this could easily cause ~10% slowdown.
    >
    > Could be true, there is one virtual function call per nibble.

    Well, I checked and it seems that virtual functions are called
    statically (or even inlined) if compiler is able to resolve them,
    and via a single pointer otherwise - in which case in might be equivalent
    to static calls due to branch prediction.
    So I guess that virtual functions are really problematic only
    with multiple inheritance etc.

    > The speed is comparable to fpaq0p.

    fpaq0p binary or the one you compiled?
    Somehow I'm not sure if you have a good compiler.

    > > Your rounding constant is fourth parameter actually, if I'd add it to my formula.
    >
    > Unfortunately you are right. My first look wasn't carefully enough.

    Its quite good to have more parameters actually.
    However using a delayed counter I can get as much parameters as I want.

    > Yep, my fault. I only looked at the probability update. But i think here's
    > much potential introducing several classes dependent on p and y.

    In the end, counter optimization turns into context clustering problem anyway.
    I mean that "ideal counter" can precisely predict the next
    bit by context, without any coding.

    > > For example, leaving a single parameter:
    > > p'=p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w)
    > > Then, if we start with p0 and process a
    > > string "00110100", we'd have
    > >
    > > p = ( (44+p0) -
    > > 16*(9+p0)*w +
    > > 16*(1+7*p0)*w^2 -
    > > 64*(-6+7*p0)*w^3 +
    > > 32*(-22+35*p0)*w^4 -
    > > 256*(-3+7*p0)*w^5 +
    > > 256*(-1+7*p0)*w^6 - 1024*p0*w^7 + 256*p0*w^8 )/256
    >
    > To be a bit more precise you need more powers of w.

    Its just an example, I'd never said I'd stop at that.
    Though order 8 might be barely enough for a current
    probability function representation...

    > My expirience says that 0.5+w will be close to 0 so w will be close to -.5.

    Its always possible to adjust it to wr range like [0..0.1] instead of [0..1]

    > Taking 2^-8 still contributes in the domain of 1e-3.

    Note that its a bit probability function, not string probability.
    I'd expect order 16 to be enough for a bit probability, but
    with a real optimization target it'd be more tricky.

    > So the optimization target would be to maximize the probability of the training
    > string dependent on the rescaled parameters?

    You can take a log2 of a polynomial approximation after you calculate it,
    and then a derivative of that.

    > How do you compute the coefficients a_i for w^i. I couldn't really identify the
    > patterns here - especially not at 3 in the morning

    There're no common patterns... it somehow correlates with processed data probably.
    I just written a couple of update functions and calculated
    b0[b0[b1[b0[b1[b1[b0[b0[p0]]]]]]]] in Mathematica, and then sorted it by w powers.

    But of course, for practical use you'd need a polynomial class with custom
    number class for coefficients (not necessarily infinite precision, but
    with longer exponent part). Or coefficients could be polynomials too, but
    I doubt that there'd be enough precision even for simultaneous optimization
    of two parameters.

    But still, even if 1000s of coefficients would appear necessary, it would
    be incomparably faster than processing all the data.

    > I really like the idea of shifting the parameters into this range
    > to improve the convergence speed

    Actually I think that [-1;1] is better for target parameter range as it'd
    allow to determine what can be cut off directly by polynomial coefficients.

  13. #13
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Well, I checked and it seems that virtual functions are called
    statically (or even inlined) if compiler is able to resolve them,
    and via a single pointer otherwise - in which case in might be equivalent
    to static calls due to branch prediction.
    So I guess that virtual functions are really problematic only
    with multiple inheritance etc.

    fpaq0p binary or the one you compiled?
    Somehow I'm not sure if you have a good compiler.
    I've compiled everything using gcc 4.1.2 with the switches -O3 -lstdc++. Since
    in this stage the program is a single file gcc might optimize the function calls
    out. I haven't looked at the disassembly, though.

    Quote Originally Posted by Shelwien
    Its quite good to have more parameters actually.
    However using a delayed counter I can get as much parameters as I want.
    Do you update your counter based on all delayed bits?

    p_k+1 = t(p_k, y_k, y_k-1, ..., y_k-n+1)

    I think converting the bits like x_k = y_k XOR (p_k>.5) could improve
    compression, since you distingush dependent on more/less probable bits.
    Doing things like this intorduces the possibility of distingushing a change
    in statistics (the bit history), like ...00000111111....

    EDIT: This is something what Matt's newest fpaq0xx variant does: You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2, which makes you able to distingush a change in statistics like
    ...y (not y) ... However there are no further (parameter) optimizations

    Quote Originally Posted by Shelwien
    In the end, counter optimization turns into context clustering problem anyway.
    I mean that "ideal counter" can precisely predict the next
    bit by context, without any coding.
    Unfortunately ideality isn't practical
    But the above thing sounds very much like context clustering...

    [QUOTE=Shelwien]

    No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

    Maximizing the string probability should be the same as minimizing its coding
    cost, the metric would change though. Using the coding cost would be better
    numericall, i think.

    Quote Originally Posted by Shelwien
    But still, even if 1000s of coefficients would appear necessary, it would
    be incomparably faster than processing all the data.
    This is very nice - since constructing the neccesarry coefficients once from the
    training data and iterating over the polynomal is far faster than calculating
    the coding cost function every step. The speedup would be huge - but how
    preciese would it be? I mean the slower way i'm optimizing atm has the advantage
    of calculating the coding cost *exactly* using the system i'm targeting my
    optimizations with (range coder, maybe a hash mapping of contexts to bit
    models, etc...)
    Last edited by toffer; 8th May 2008 at 18:28.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > I've compiled everything using gcc 4.1.2 with the switches -O3 -lstdc++. Since
    > in this stage the program is a single file gcc might optimize the function calls
    > out. I haven't looked at the disassembly, though.

    Well, generally imho IntelC is better, but for basic fpaq0p clones gcc/mingw is quite ok.
    Consider using -fprofile-generate though... it might make things significantly faster.

    > > Its quite good to have more parameters actually.
    > > However using a delayed counter I can get as much parameters as I want.
    >
    > Do you update your counter based on all delayed bits?
    > p_k+1 = t(p_k, y_k, y_k-1, ..., y_k-n+1)

    Well, you can skip some bits but how'd that help?
    Another point is that estimation mapping is different
    from update mapping... also p0's are selected by context too.
    So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1
    parameters.

    > I think converting the bits like x_k = y_k XOR (p_k>.5) could improve
    > compression, since you distingush dependent on more/less probable bits.

    Its a static SSE, so p can be used in context of course.
    And I agree that correlations in the optimized mapping coefficients
    could be used to improve further optimization speed.
    But I can't say that your suggestion would be helpful in general.

    > Doing things like this intorduces the possibility of distingushing a change
    > in statistics (the bit history), like ...00000111111....

    You should check first whether such substrings make a significant amount of
    output code.

    > This is something what Matt's newest fpaq0xx variant does:

    You mean fpaq0f2?

    > You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2,

    Its just a basic SSE coder, nothing special, except that Matt made it.
    And its not even a good example because normally SSE is used to improve
    an already good primary model, which is not the case with order0, and
    secondary models for barely redundant data (like usually) and quite redundant
    (like in fpaq0f) are not very similar.

    > > In the end, counter optimization turns into context clustering problem anyway.
    > > I mean that "ideal counter" can precisely predict the next
    > > bit by context, without any coding.
    >
    > Unfortunately ideality isn't practical

    This case is kinda special.
    Static models have their use for asymmetrical statistical coding.

    > But the above thing sounds very much like context clustering...

    Yeah, lossy model compression with a perfect metric

    > No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

    Wonder what you meant by that.
    The point of this polynomial representation is to be able to shrink it
    by removing the insignificant powers of target parameter, while
    a precise polynomial is easily computable but would have n+1 coefficients
    for n bits of data, so no sense doing it.

    > Maximizing the string probability should be the same as minimizing its coding
    > cost, the metric would change though. Using the coding cost would be better
    > numericall, i think.

    After you calculate the approximate probability function, you can
    easily turn it into approximate code length or whatever.
    Point is that with linear update its possible to have a _precise_
    polynomial representation of the string probability, while log2
    approximations would cause additional errors.

    > > But still, even if 1000s of coefficients would appear necessary, it would
    > > be incomparably faster than processing all the data.
    >
    > This is very nice - since constructing the neccesarry coefficients once from the
    > training data and iterating over the polynomal is far faster than calculating
    > the coding cost function every step.

    Its not exactly once... As I said, it would probably only allow to calculate
    an optimal value of a single parameter without multiple tests, but another
    pass would be necessary to switch to another parameter.
    So its basically about removing one dimension from optimization process.
    And it would be surely impossible to have a precise enough approximation
    with 3 or more parameters, while 2D polynomial might be possible but probably
    impractical.

    > The speedup would be huge - but how preciese would it be?

    As much as you want. Well, except for rc specifics.

    > I mean the slower way i'm optimizing atm has the advantage
    > of calculating the coding cost *exactly* using the system i'm targeting my
    > optimizations with (range coder, maybe a hash mapping of contexts to bit
    > models, etc...)

    Right, but that blocks off many advanced options.
    Like, its hardly possible to optimize the parameters even for a single
    delayed counter with a 3-bit context, and full context clustering is
    just impossible.
    But if this polynomial approximation proves successful, that would allow
    at least affordable optimization of multiple delayed counters.

  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Well, you can skip some bits but how'd that help?
    I meant you could be applying some function(s) of the delayed bits.

    Quote Originally Posted by Shelwien
    Another point is that estimation mapping is different
    from update mapping...
    I never took this into account. One could interpret the PAQ implementation like
    this: the automata itself represents a static update mapping, while the APMs
    try to dynamically model the estimation mapping.

    Quote Originally Posted by Shelwien
    also p0's are selected by context too.
    So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1
    parameters.
    How do you get so much parameters? With k bits delay there are 2^k possible
    discrete states, taking N parameters per state i get N*2^k parameters. Am
    missing something here?

    Quote Originally Posted by Shelwien
    > Doing things like this intorduces the possibility of distingushing a change
    > in statistics (the bit history), like ...00000111111....

    You should check first whether such substrings make a significant amount of
    output code.

    -> fpaq02f

    > You can measure a _rapid_ increase in compression by increasing the "bit history length" over 2,

    Its just a basic SSE coder, nothing special, except that Matt made it.
    And its not even a good example because normally SSE is used to improve
    an already good primary model, which is not the case with order0, and
    secondary models for barely redundant data (like usually) and quite redundant
    (like in fpaq0f) are not very similar.
    What i wanted to point out:

    I haven't checked if there are lots of places where such a change of statistics
    happen. But i have a clue - make fpaq02's bit history length equal to 1, 2 and 3.
    You'll see a notable increase in compression when coming accross 2, increasing
    further doesn't result in a larger step in compression.
    I would interpret it as a clue that such changes of statistics happen frequently
    (especially in lower order models, since the contextual seperation isn't as
    strong as with higher orders) and the jump in compression gain is caused by the
    ability of bit histories longer than two to distingush such changes (..000111..)

    Quote Originally Posted by Shelwien
    > > In the end, counter optimization turns into context clustering problem anyway.
    > > I mean that "ideal counter" can precisely predict the next
    > > bit by context, without any coding.
    >
    > Unfortunately ideality isn't practical

    This case is kinda special.
    Static models have their use for asymmetrical statistical coding.

    > But the above thing sounds very much like context clustering...

    Yeah, lossy model compression with a perfect metric

    > No patterns? I meant writing it like that: a_i = (b_i0 * p0 + b_i1 * y0 + ...)

    Wonder what you meant by that.
    The point of this polynomial representation is to be able to shrink it
    by removing the insignificant powers of target parameter, while
    a precise polynomial is easily computable but would have n+1 coefficients
    for n bits of data, so no sense doing it.
    I meant, the approx. string probability would be:

    prod i=1..n (1-y_i) + (2*y_i-1)*p_i

    Where n is the number of input bits and p_i is the estimate (for y_i = 1)
    our model gives in the i.th step. If you estimate using a contextual seperation,
    p_i are functions of the bit histories under each coding steps context. Let the
    bit history in the under the i.th step be bh_i = b_1 ... b_k (k bits in the
    current bit history), so p_i = f(bh_i), where f is the probability estimate
    function dependent on our parameter(s).
    p_i(k+1) = p_i(k)*(.5-w) + b_k*(.5+w)
    (can be rewritten as a serie to just depend on b_k, p_0 and w)

    Rewriting p_i(k) dependent on the last bit history updates and ordering by w
    yields a polynomal in w, the thing you posted. This polynomal only depends on:
    (1) the bit history b_1 .. b_k under the current context,
    (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
    (3) the parameter w

    k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)

    So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
    by the powers of w.

    My question is, how can we compute the coefficients a_i? I tried to find patterns
    substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
    any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
    Otherwise one can't implement it.

    Some more stuff this afternoon, i have to catch my train
    Last edited by toffer; 9th May 2008 at 14:52.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    >> Well, you can skip some bits but how'd that help?
    >
    > I meant you could be applying some function(s) of the delayed bits.

    And I already said that it should be done after correlation analysis
    of optimized coefficients, not by correlations that you _think_
    there'd be.

    > > Another point is that estimation mapping is different
    > > from update mapping...
    >
    > I never took this into account. One could interpret the PAQ implementation like
    > this: the automata itself represents a static update mapping, while the APMs
    > try to dynamically model the estimation mapping.

    Yes, but paq uses this just to fit counters into bytes.

    > > So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1 parameters.
    >
    > How do you get so much parameters?

    As I said... there're N*2^k estimation parameters, N*2^(k+1) update mapping
    parameters (including the just processed bit) and 2^(k+1)-1 initial probability
    values for every context of lengths 1..k.
    Well, also estimation and update mappings can have different contexts, but
    you'd need enough initial values for update mapping to work.

    Also its important that all these things are linear, so its easy to
    calculate a set of coefficients for a perfect simulation of a simple
    linear-update counter.

    > I would interpret it as a clue that such changes of statistics happen frequently
    > (especially in lower order models, since the contextual seperation isn't as
    > strong as with higher orders) and the jump in compression gain is caused by the
    > ability of bit histories longer than two to distingush such changes (..000111..)

    Whatever... Its the same as any other primary context model.
    Actually fpaq0f2 has very similar performance to a traditional order1 model
    and uses the same number of counters: http://shelwien.googlepages.com/fpaq0f2.htm
    And there's no reason to think that previous data bytes are less significant
    than context history bits... in fact, its kinda obvious that source models
    don't use history bits - there're even data types with known generator, like
    executables.

    Its another case when the primary model grows huge and lacks statistics.
    Then we just have to merge some contexts and context history similarity
    is a good indicator of what to merge.
    And that's what SSE is, basically, though it uses a quantized version of
    context history aka probability.

    > Rewriting p_i(k) dependent on the last bit history updates and ordering by w
    > yields a polynomal in w, the thing you posted. This polynomal only depends on:
    > (1) the bit history b_1 .. b_k under the current context,
    > (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
    > (3) the parameter w
    >
    > k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)
    >
    > So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
    > by the powers of w.
    >
    > My question is, how can we compute the coefficients a_i? I tried to find patterns
    > substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
    > any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
    > Otherwise one can't implement it.

    Guess I was wrong to leave p0 as a parameter in that example...
    Look, imagine that we want to optimize just a single w parameter

    p' = p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w) = w*(y-p)+(y+p)/2

    Now lets assume that p=Sum[ a[i]*w^i, i=0..k ]. Then

    p' = Sum[ (a[i]/2-a[i-1])*w^i, i=2..k ] + (a[1]/2+y-a[0])*w + (y+a[0])/2

    a'[0] = (y+a[0])/2
    a'[1] = (y-a[0]-a[1]/2)
    a'[i] = (a[i]/2-a[i-1])

    Is it clear now?

    Then, its possible to work with more parameters in the same way,
    eg. we can use the same polynomial class based on p0 for a[i] coefficients.
    But I suspect that it'd take too much memory to get a tolerable precision
    even with two variables.

  17. #17
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    >> Well, you can skip some bits but how'd that help?
    >
    > I meant you could be applying some function(s) of the delayed bits.

    And I already said that it should be done after correlation analysis
    of optimized coefficients, not by correlations that you _think_
    there'd be.

    > > Another point is that estimation mapping is different
    > > from update mapping...
    >
    > I never took this into account. One could interpret the PAQ implementation like
    > this: the automata itself represents a static update mapping, while the APMs
    > try to dynamically model the estimation mapping.

    Yes, but paq uses this just to fit counters into bytes.

    > > So for k bits "delay" there's at least 2*2^k+2*2^(k+1)+2^(k+1)-1 parameters.
    >
    > How do you get so much parameters?

    As I said... there're N*2^k estimation parameters, N*2^(k+1) update mapping
    parameters (including the just processed bit) and 2^(k+1)-1 initial probability
    values for every context of lengths 1..k.
    Well, also estimation and update mappings can have different contexts, but
    you'd need enough initial values for update mapping to work.

    Also its important that all these things are linear, so its easy to
    calculate a set of coefficients for a perfect simulation of a simple
    linear-update counter.

    > I would interpret it as a clue that such changes of statistics happen frequently
    > (especially in lower order models, since the contextual seperation isn't as
    > strong as with higher orders) and the jump in compression gain is caused by the
    > ability of bit histories longer than two to distingush such changes (..000111..)

    Whatever... Its the same as any other primary context model.
    Actually fpaq0f2 has very similar performance to a traditional order1 model
    and uses the same number of counters: http://shelwien.googlepages.com/fpaq0f2.htm
    And there's no reason to think that previous data bytes are less significant
    than context history bits... in fact, its kinda obvious that source models
    don't use history bits - there're even data types with known generator, like
    executables.

    Its another case when the primary model grows huge and lacks statistics.
    Then we just have to merge some contexts and context history similarity
    is a good indicator of what to merge.
    And that's what SSE is, basically, though it uses a quantized version of
    context history aka probability.

    > Rewriting p_i(k) dependent on the last bit history updates and ordering by w
    > yields a polynomal in w, the thing you posted. This polynomal only depends on:
    > (1) the bit history b_1 .. b_k under the current context,
    > (2) the initial estimate p_0 (no influence for k->infinity ~ k>>1)
    > (3) the parameter w
    >
    > k_i(k+1) ~ sum i=1..L a_i*w^i (your polynomal approximation)
    >
    > So the polynomals coefficients a_i only depend on (1) and (2), since you ordered
    > by the powers of w.
    >
    > My question is, how can we compute the coefficients a_i? I tried to find patterns
    > substituting p_k into p_k+1 and ordering by w, ... but i was hardly able to find
    > any. There must be a way to compute a_i(k+1) = g(a_0(k) ... a_i-1(k), b_k).
    > Otherwise one can't implement it.

    Guess I was wrong to leave p0 as a parameter in that example...
    Look, imagine that we want to optimize just a single w parameter

    p' = p*(1-wr)+y*wr = p*(0.5-w)+y*(0.5+w) = w*(y-p)+(y+p)/2

    Now lets assume that p=Sum[ a[i]*w^i, i=0..k ]. Then

    p' = Sum[ (a[i]/2-a[i-1])*w^i, i=2..k ] + (a[1]/2+y-a[0])*w + (y+a[0])/2

    a'[0] = (y+a[0])/2
    a'[1] = (y-a[0]-a[1]/2)
    a'[i] = (a[i]/2-a[i-1])

    Is it clear now?

    Then, its possible to work with more parameters in the same way,
    eg. we can use the same polynomial class based on p0 for a[i] coefficients.
    But I suspect that it'd take too much memory to get a tolerable precision
    even with two variables.
    Sorry, i've been busy the last few weeks.

    Substituting the polynomal like this is easier (and cleverer ). Things are clear now, thanks again!

    Do you have any expirience/results in using different update and prediction mappings? This is something new to me.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Here's a real optimized 4-bit delayed counter.
    Not that these specific values would be of any use, though

    Code:
    P0                Xa              Xb                Ya              Yb
    110110011000000 | 001111000010000 111010110110000 | 000011000011100 111111011100100
    110110101011100 | 001011110011000 111001111110000 | 000011000001001 111111111010110
    110101110000111 | 001011100001110 111001111000110 | 000011011010101 111111001111010
    110011000000000 | 000100000000000 111001000000010 | 000011101111001 111111111111010
    110011001010001 | 001110101000001 111000101100110 | 000001011010111 111100101100110
    110010001001000 | 000101001100000 111011011011100 | 000000110010110 111110000000100
    110010100100000 | 001001010100000 111001011001011 | 000000111011010 111101111110000
    110000000000100 | 000000001100000 110011000100000 | 000000111111111 111011101111100
    110110011000000 | 001111000010000 111010110110000 | 000011000011100 111111011100100
    110110101011100 | 001011110011000 111001111110000 | 000011000001001 111111111010110
    110101110000111 | 001011100001110 111001111000110 | 000011011010101 111111001111010
    110011000000000 | 000100000000000 111001000000010 | 000011101111001 111111111111010
    110011001010001 | 001110101000001 111000101100110 | 000001011010111 111100101100110
    110010001001000 | 000101001100000 111011011011100 | 000000110010110 111110000000100
    110010100100000 | 001001010100000 111001011001011 | 000000111011010 111101111110000
    110000000000100 | 000000001100000 110011000100000 | 000000111111111 111011101111100
    
    ctx = counter context
    b = B[ctx] = bit history
    c = C[ctx] = number of bits in context (history length)
    
    if( c>=4 ) {
      p = ( c==4 ) ? P0[0] : p[ctx]
      p = Xa[b&15]*(1-p) + Xb[b&15]*p;
    } else {
      s = (1<<c); s += b & (s-1);
      p = P0[s];
    }
      
    [...]
    
    if( c>=4 ) {
      p = ( c==4 ) ? P0[0] : p[ctx]
      p[ctx] = Ya[b&15]*(1-p) + Yb[b&15]*p;
    }
    Last edited by Shelwien; 18th May 2008 at 23:55.

  19. #19
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The implementation is clear to me, i meant, how does it preform, compared to a simple counter a-la fpaq0p? (Or a counter with optimized prediction=update mapping). And how did you optimize the parameters? Have you printed the parameters binary (the patterns somehow look like binary search)?

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > The implementation is clear to me,

    Really? This initialization workaround wasn't that clear to me,
    I was actually quite disappointed with delayed counters until
    accidentally finding the real importance of initial values.

    > i meant, how does it perform, compared to a simple counter a-la fpaq0p?

    Well, you know, I don't use these counters (mine provide better compression).
    And anyway, the results I have was taken under SSE.
    But such a delayed counter can perfectly simulate the behavior of usual
    counter, so it can never perform worse.
    Likewise, using longer context (bit history) never hurts compression.

    > (Or a counter with optimized prediction=update mapping).

    There's no average comparison for that.
    I can only say that its never worse, and can be better adapted to
    any environment, so I'd probably be able to construct bit sequences
    where it'd have as much advantage as you'd like.

    Well, I also admit that improvement won't look impressive to the audience here.
    Afaik it was like this:
    371699 // normal counters
    370757 // 3bit delayed
    370606 // 4bit delayed
    But for me it was quite significant - I couldn't gain this much
    by "heavy" techniques like mixing in another submodel or expanding contexts.

    > Have you printed the parameters binary (the patterns somehow look like binary search)?

    Of course that's binary. But I keep them that way, so these was copy-pasted from the source.

    > And how did you optimize the parameters?

    My optimizer is even dumber than your implementation - it doesn't use any gradients, at least.
    Its a perl script which manipulates these bit strings in the compressor executable and runs
    it until satisfied.
    Well, its also a part of my C++ syntax extensions so can't be replaced that easily even
    if/when I'd switch to polynomial optimization.

  21. #21
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    Really? This initialization workaround wasn't that clear to me,
    I was actually quite disappointed with delayed counters until
    accidentally finding the real importance of initial values.
    What do you mean here? I haven't read anything in your source which i wouldn't have done in the same way?! The right parameters are another thing.


    Quote Originally Posted by Shelwien
    > (Or a counter with optimized prediction=update mapping).

    There's no average comparison for that.
    I can only say that its never worse, and can be better adapted to
    any environment, so I'd probably be able to construct bit sequences
    where it'd have as much advantage as you'd like.

    Well, I also admit that improvement won't look impressive to the audience here.
    Afaik it was like this:
    371699 // normal counters
    370757 // 3bit delayed
    370606 // 4bit delayed
    But for me it was quite significant - I couldn't gain this much
    by "heavy" techniques like mixing in another submodel or expanding contexts.
    Improving the overall performance in the domain of .1 - 1% is significant, at least not by adding models, etc.


    Quote Originally Posted by Shelwien
    > And how did you optimize the parameters?

    My optimizer is even dumber than your implementation - it doesn't use any gradients, at least.
    Its a perl script which manipulates these bit strings in the compressor executable and runs
    it until satisfied.
    Well, its also a part of my C++ syntax extensions so can't be replaced that easily even
    if/when I'd switch to polynomial optimization.
    And your script does a binary search? Somehow your parameter patterns look like a result of this (simply the "look and feel") and since you output binary strings instead of using the actual values...

    The gradient is just another source of information for the optimization... well, i don't see what should be wrong with this optimization approach. My counters aren't as good as yours, but this is related to the optimization target, not the optimization technique.

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > Really? This initialization workaround wasn't that clear to me,
    > > I was actually quite disappointed with delayed counters until
    > > accidentally finding the real importance of initial values.
    >
    > What do you mean here? I haven't read anything in your source

    Well, its just some example code, not "my source".
    Originally X and Y had fixed 2 MSBs (00 and 11), and there
    was a single multiplication in extrapolation/update etc.

    > which i wouldn't have done in the same way?!

    I don't say its not trivial.
    But then again, there's a lot of space for alternative
    implementations. Like, I doubt you'd write it the same
    way with P0[0].
    Also, about the thing I mentioned, its quite appealing
    to use the same mappings for incomplete contexts, by
    padding them with zeroes or something, like its always
    done with contexts and counters. Without the P0 array.
    At least thats how I implemented it first and compression
    difference was quite unexpected. In fact, usual combinatorical
    counters (with variable-rate update) worked better.

    > Improving the overall performance in the domain of .1 - 1%
    > is significant, at least not by adding models, etc.

    Well, yes. Specifically here it was like going from 203k to 202k
    on book1 without text tricks.

    > And your script does a binary search?

    It doesn't. What it does is imho closer to genetic optimization.
    Also this representation (and optimizer itself) was first used
    for context quantization and other discrete values, and only after
    that I found that its good enough for usual parameters too.
    Well, it isn't that good with update rates optimization, but I have
    some tricks to overcome that.

    > Somehow your parameter patterns look like a result of this
    > (simply the "look and feel") and since you output binary strings
    > instead of using the actual values...

    Add binary "line numbers" (0000..1111) to that table and
    you might see why. That'd be history bits, I mean.

    > The gradient is just another source of information for the
    > optimization... well, i don't see what should be wrong with
    > this optimization approach.

    1. Processing the whole sample for each measurement is dumb.
    2. As I said, my optimizer is even dumber than yours in that
    because it works only with actual parameter values, not gradients.

    But I can't do much about that, as gradients are not applicable
    for discrete parameters, and its troublesome to automate their
    calculation for a practical model and even more troublesome to
    manually write another model specifically for optimization.

    > My counters aren't as good as
    > yours, but this is related to the optimization target, not
    > the optimization technique.

    What I'm basically saying is that your "optimization technique"
    is more advanced, but maybe not as widely applicable.

    Btw, my C++ preprocessor has separate modes for release versions,
    and optimization builds, where these binary patterns/values are
    embedded into executable as is and accessible to optimizer.
    Last edited by Shelwien; 19th May 2008 at 04:06.

  23. #23
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The first thing which came to my mind was to adaptivley learn the mapping for the bit histories of a length < 4. Especially in higher order contexts this is significant, or at least having a "better than p=.5"-mapping for short bit histories.

    Quote Originally Posted by Shelwien
    Add binary "line numbers" (0000..1111) to that table and
    you might see why. That'd be history bits, I mean.

    p = Xa[b&15]*(1-p) + Xb[b&15]*p;
    It's obvious. But i'm queit suprised, that the two mappings sometimes differ that much (e.g. X, Y at the 4th line).

    Quote Originally Posted by Shelwien
    1. Processing the whole sample for each measurement is dumb.
    2. As I said, my optimizer is even dumber than yours in that
    because it works only with actual parameter values, not gradients.

    But I can't do much about that, as gradients are not applicable
    for discrete parameters,
    Well it is the way things are done in (mixed) integer programming. Find a continuous solution, keep discretizing it and keep track of how the discretization hurts the optimum (e.g. with some search tree). Since the influence of my rounding parameter is tiny compared to the learning rate finding a good enough solution using this iterative optimization and the run a brute force search around the found minimum in parameter space. Basically its a question of what you numerically call "continuous" and discrete.

    Quote Originally Posted by Shelwien
    and its troublesome to automate their
    calculation for a practical model and even more troublesome to
    manually write another model specifically for optimization.
    My sources are designed to be extensible. The optimizer classes arguement is a target function, which can return its value and gradient at a desired point in parameter space. This can be replaced. There's a similar mechanism within the cost function to replace a certain model.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > The first thing which came to my mind was to adaptivley
    > learn the mapping for the bit histories of a length < 4.
    > Especially in higher order contexts this is significant, or
    > at least having a "better than p=.5"-mapping for short bit
    > histories.

    1. Adaptation would work like that only for stationary contexts.
    In fact, there's a lot of cases where _any_ adaptive mapping
    would be bad. But even if it improves compression, I doubt
    that you'd be able to detect an optimal configuration among
    all mapping's states during processing of nonstationary data.

    2. Optimal static mapping is even harder to find if its used
    under SSE or mixing or both. But local entropy optimization
    won't work for sure.

    3. Anyway, if an adaptive mapping's snapshot works well
    as a static mapping, imho it means that adaptive mapping
    itself would work even better.

    > > Add binary "line numbers" (0000..1111) to that table and
    > > you might see why. That'd be history bits, I mean.

    > p = Xa[b&15]*(1-p) + Xb[b&15]*p;
    > It's obvious. But i'm queit suprised, that the two
    > mappings sometimes differ that much (e.g. X, Y at the 4th line).

    Its an extrapolation at the _end_ of 0100 sequence, and
    update at its start - just before 1.
    Well, here're sorted tables:
    Code:
    Xa                   Xb                     Ya                   Yb                  
    000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
    000100000000000 011 111000101100110 100 | 000000111011010 110 111100101100110 100
    000101001100000 101 111001000000010 011 | 000000111111111 111 111101111110000 110
    001001010100000 110 111001011001011 110 | 000001011010111 100 111110000000100 101
    001011100001110 010 111001111000110 010 | 000011000001001 001 111111001111010 010
    001011110011000 001 111001111110000 001 | 000011000011100 000 111111011100100 000
    001110101000001 100 111010110110000 000 | 000011011010101 010 111111111010110 001
    001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011
    Like that its more visible how it depends on bits in different positions
    and number of 1s in the context.

    > Basically its a question of what you numerically call "continuous" and discrete.

    Well, _there're_ some more discrete things than others.
    For example, a bit in a binary number can determine whether
    two adjacent columns in the table are merged into single context, or not.

    Of course that can be somewhat changed eg. with mixing, but
    resources don't allow to mix _all_, so...

    > > and its troublesome to automate their
    > > calculation for a practical model and even more troublesome to
    > > manually write another model specifically for optimization.
    >
    > My sources are designed to be extensible. The optimizer
    > classes arguement is a target function,

    Is it a template argument or member function argument?

    > which can return its value and gradient at a desired point in parameter space.
    > This can be replaced. There's a similar mechanism within the
    > cost function to replace a certain model.

    I still think that this is possible only for a simple CM where counters
    are directly used for coding. And even there it probably causes a significant
    slowdown... if you don't use different classes for optimization and
    real compressor.

  25. #25
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    With adaptivley learning (i meant something like a stationary counter (n1, n) -> (n1+y, n+1), for each partitial context (here: len<4), which is optimal for a stationary source. After enough learning this counter should provide a near optimal (same as minimizing coding cost) solution, since (n1+y)/(n+1) ~ n1/n when n1, n >> 1. The learned adaptive mapping (n>>1) could be used like your p0[s]. I don't know if online learning is worse than offline learning here.

    Quote Originally Posted by Shelwien
    > > Add binary "line numbers" (0000..1111) to that table and
    > > you might see why. That'd be history bits, I mean.

    > p = Xa[b&15]*(1-p) + Xb[b&15]*p;
    > It's obvious. But i'm queit suprised, that the two
    > mappings sometimes differ that much (e.g. X, Y at the 4th line).

    Its an extrapolation at the _end_ of 0100 sequence, and
    update at its start - just before 1.
    Well, here're sorted tables:

    Xa Xb Ya Yb
    000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
    000100000000000 011 111000101100110 100 | 000000111011010 110 111100101100110 100
    000101001100000 101 111001000000010 011 | 000000111111111 111 111101111110000 110
    001001010100000 110 111001011001011 110 | 000001011010111 100 111110000000100 101
    001011100001110 010 111001111000110 010 | 000011000001001 001 111111001111010 010
    001011110011000 001 111001111110000 001 | 000011000011100 000 111111011100100 000
    001110101000001 100 111010110110000 000 | 000011011010101 010 111111111010110 001
    001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011

    Like that its more visible how it depends on bits in different positions
    and number of 1s in the context.
    Could you please explain this again and mark the bits. ATM i don't know what you mean.

    Quote Originally Posted by Shelwien
    > Basically its a question of what you numerically call "continuous" and discrete.

    Well, _there're_ some more discrete things than others.
    For example, a bit in a binary number can determine whether
    two adjacent columns in the table are merged into single context, or not.
    What do you mean exactly? The contextual input seperation? The mapping isn't directly involved in the optimization maths - it only selects bit models. The whole math thing is related to bit modelling, which is continuous.


    Quote Originally Posted by Shelwien
    Is it a template argument or member function argument?

    I still think that this is possible only for a simple CM where counters
    are directly used for coding. And even there it probably causes a significant
    slowdown... if you don't use different classes for optimization and
    real compressor.
    Most of the stuff in the optimizer is bloated up to allow easier extension and maintainance (virtual functions, some abcs, ...). No template arguments, since i don't require (very) fast code. Most of the time is eaten up by calculations anyway, which aren't directly involved with modelling, like gradient calculations, some 64 bit divisions, etc.
    Of course i seperate the optimizer and the actual implementation of the compressor.

    Yesterday i wanted to see, how well the paq state mapping does and if it can be improved, in most cases the optimizer tends to diverge here and a short brute force search revealed, that there's not much chance of improvement. Somehow Matt's y-p>>8 was a quiet good choice. I tried orders 0-2 only. The 15 bit parameter quantization seemed to be to croase here, since changing a parameter by a quantization level often caused the gradients sign to swap.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > which is optimal for a stationary source. After enough learning this counter should
    > provide a near optimal (same as minimizing coding cost)
    > solution, since (n1+y)/(n+1) ~ n1/n when n1, n >> 1.

    Of course it would converge somewhere, but at best that would give us an optimal
    probability for static encoding of the sequence.
    1. "At best" = only for stationary contexts... which don't actually exist as it seems.
    2. We are actually talking about mapping optimization, and that mapping is
    recurrent and nonlinear. Well, there may be some sense in optimizing
    the extrapolation mapping like that... after finishing the optimization
    of update mapping and initial values.

    Also imho the whole idea of markov model convergence is misleading.
    I mean, it really has sense only for determined state machine,
    but is commonly applied to the cases where probability "converges"
    to some value far from 0 or 1, and that is considered "stationary".
    But how it can be truly stationary if the data was generated with
    a pretty much determined source (like eg. executables are) and
    probability distribution looks like some curve only because
    the context is far from what it should be.
    So that's why there're no stationary data and probability
    distributions can only be locally optimal.

    > The learned adaptive mapping (n>>1) could be used like your p0[s].

    As I said, these P0's have much greater effect than expected.
    At least, statistics should be collected among similar contexts,
    but that wouldn't be optimal either

    > I don't know if online learning is worse than offline learning here.

    Adapation is only good when it catches the local dependencies.
    But in this case it would be too hard to trace the dependency
    of current estimation back to P0.

    > > 000000001100000 111 110011000100000 111 | 000000110010110 101 111011101111100 111
    > [...]
    > > 001111000010000 000 111011011011100 101 | 000011101111001 011 111111111111010 011
    > >
    > > Like that its more visible how it depends on bits in different positions
    > > and number of 1s in the context.
    >
    > Could you please explain this again and mark the bits. ATM i don't know what you mean.

    3-bit columns are contexts, most recent bit on the right.
    And also I forgot to mention that my example worked with probability of 0.

    > > Well, _there're_ some more discrete things than others.
    > > For example, a bit in a binary number can determine whether
    > > two adjacent columns in the table are merged into single context, or not.
    >
    > What do you mean exactly? The contextual input seperation?
    > The mapping isn't directly involved in the optimization
    > maths - it only selects bit models. The whole math thing is
    > related to bit modelling, which is continuous.

    I was talking about my optimizer, which has to optimize such
    discrete parameters as a context quantization map, so is unable
    to use gradients.

    Btw, have you tried using gradients in the model?
    That might be interesting.

    > Most of the stuff in the optimizer is bloated up to allow
    > easier extension and maintainance (virtual functions, some
    > abcs, ...). No template arguments, since i don't require
    > (very) fast code. Most of the time is eaten up by
    > calculations anyway, which aren't directly involved with
    > modelling, like gradient calculations, some 64 bit
    > divisions, etc.

    I still don't understand how you're going eg. optimize
    a counter under SSE with it.

    > Of course i seperate the optimizer and the actual
    > implementation of the compressor.

    I don't see how is it "of course" - imho that totally slows
    down the development and also introduces the possibility of
    bugs due to release and optimization models mismatch.
    On the other hand, my optimizer allows to fully retune
    the production model to the new sample data, including
    new context quantization, in 3-5 days and mostly automatically.
    And also it could be much faster too, if not for my modest
    computational resources.

    > Yesterday i wanted to see, how well the paq state mapping
    > does and if it can be improved,

    The biggest problem with Matt's models is that they're too solid -
    its almost impossible to replace some element with a completely
    different implementation and keep the same results.

    > changing a parameter by a quantization level often caused the gradients sign to swap.

    That's basically why I don't use gradients at all
    Last edited by Shelwien; 20th May 2008 at 23:17.

  27. #27
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    1. "At best" = only for stationary contexts... which don't actually exist as it seems.
    If the contextual seperation is strong, this might hold as an approximation. Perfect context should give uniform bit histories, in practise we are far from perfect (as your example - x86 data with order-n contexts).


    Quote Originally Posted by Shelwien
    2. We are actually talking about mapping optimization, and that mapping is
    recurrent and nonlinear. Well, there may be some sense in optimizing
    the extrapolation mapping like that... after finishing the optimization
    of update mapping and initial values.
    In my previous post i didn't think of seperate mappings (concerning this). BTW: How do you *seperately* formulate/work out an optimized update/prediction mapping? This might be some kind of misunderstanding, but you first generate and optimize a mapping. This will be the update mapping. Based on this mapping you optimize the "prediction" mapping? I can't imagine another way of doing this.


    Quote Originally Posted by Shelwien
    Also imho the whole idea of markov model convergence is misleading.
    I mean, it really has sense only for determined state machine,
    but is commonly applied to the cases where probability "converges"
    to some value far from 0 or 1, and that is considered "stationary".
    But how it can be truly stationary if the data was generated with
    a pretty much determined source (like eg. executables are) and
    probability distribution looks like some curve only because
    the context is far from what it should be.
    So that's why there're no stationary data and probability
    distributions can only be locally optimal.
    True, since "perfect" contextual seperation is impossible (knowing every property of the source), in general at least. Global optimality is impossible anyway.


    Quote Originally Posted by Shelwien
    As I said, these P0's have much greater effect than expected.
    At least, statistics should be collected among similar contexts,
    but that wouldn't be optimal either
    I don't expect you optimized the P0's for every (or some larger amount) of contexts within a model? One P0 set per update/prediction mapping for every model?


    Quote Originally Posted by Shelwien
    3-bit columns are contexts, most recent bit on the right.
    And also I forgot to mention that my example worked with probability of 0.
    I tried to read the numbers in both directions and didn't get it I'll reread everything tomorrow. Too late to get useful thoughts at this time.


    Quote Originally Posted by Shelwien
    I was talking about my optimizer, which has to optimize such
    discrete parameters as a context quantization map, so is unable
    to use gradients.
    Ok, interesting. Do you mean contexts or bit history quantization, or both? Can you tell me something more about this? And show some results on "general data", i don't know which data you deal with.


    Quote Originally Posted by Shelwien
    Btw, have you tried using gradients in the model?
    That might be interesting.
    You mean on online optimization of the parameters or add the gradient itself as an additional degree of freedom? Atm i haven't and i don't expect it to be practical (computatin effort vs gain). But i might do, it would be interesting to see what happes.


    Quote Originally Posted by Shelwien
    I still don't understand how you're going eg. optimize
    a counter under SSE with it.
    Having optimized the models which generate the input probability distribution you basically have the same situation as with context mappings - map some input (context, probability, ...) to some output (probability). The input basically is a number which selcts a bit model.
    My optimization target is the estimated entropy H(source) ~ sum i=1 to n -ln p_i. Where p_i is the estimated entropy in the i-th coding step. p_i is related to "some output" (see above) and the connected "some input" is the context under the i-th step.

    Good night

    EDIT: i said, that i have to write some more or less short text. I've completed a section about arithmetic coding. Comments and suggestions are welcome. I can't go too deep into detail, since the whole work is limited to about 20 pages and this part (together with others) is just "accessories". While rereading several times i found lots of mistypes. Every review will help. Thanks in advance!
    Attached Files Attached Files
    Last edited by toffer; 23rd May 2008 at 01:25.

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > after finishing the optimization of update mapping and initial values.

    > In my previous post i didn't think of seperate mappings (concerning this).

    Fix your dictionary, please. Starting from sep_a_rate and quite!=quiet.

    > How do you *seperately* formulate/work out an optimized update/prediction mapping?
    > This might be some kind of misunderstanding, but you first
    > generate and optimize a mapping. This will be the update
    > mapping. Based on this mapping you optimize the "prediction"
    > mapping? I can't imagine another way of doing this.

    My optimizer just randomly modifies parameters until it finds some metric improvement
    (btw, that metric isn't necessarily only a code length. In some cases it includes
    other components, like a function of model memory size).
    And parameter subsets may be optimized separately or not depending on the requirements,
    but either way it continues iteratively until there're no improvements for some long
    enough time.

    > > Also imho the whole idea of markov model convergence is misleading.
    >
    > True, since "perfect" contextual seperation is impossible
    > (knowing every property of the source), in general at least.
    > Global optimality is impossible anyway.

    That's exactly what I meant... its misleading.
    Because there're no probabilities in the source.
    Well, probabilities are applicable with analog source,
    but even text is not like that.

    > > As I said, these P0's have much greater effect than expected.
    > > At least, statistics should be collected among similar contexts,
    > > but that wouldn't be optimal either

    > I don't expect you optimized the P0's for every (or some
    > larger amount) of contexts within a model?

    Why, of course its supposed to be optimized for every context separately.
    In this model for a codec which I'm currently supporting, there're
    48 counter tables, 24 SSE1 tables and 12 SSE2 tables, each with its own
    optimized set of parameters (including context).

    Also this is a specialized model for a known data type, so I'm sure
    that a proper universal model would have 10x of that (or more).

    > One P0 set per update/prediction mapping for every model?

    There's no real reason to use the same counter parameters for
    different contexts.
    However I don't widely use delayed counters yet, I think that
    should wait until successful implementation of polynomial (or another)
    fast optimizer.
    The problem there is that I cannot just replace all the counters
    with delayed ones, though that would surely be the best for compression.
    But each delayed counter requires two additional multiplications and
    two table lookups, so its too slow... and searching for a new balanced
    model structure with current methods would take a lot of time.

    > Ok, interesting. Do you mean contexts or bit history quantization, or both?

    Both.

    Here's a context declaration and code generated for it:

    Index y
    ych: ch, 1!1
    yfb: fb, 1!00100010
    yp0: p01, 1!0000000000010000
    y_s: As, -7!000000000000000
    y_p: p, -7!010000000000001
    y_q: q, -7!001000000000111
    yr1: r1, -7!000000000000000
    y_r: r , -7!100000000000000
    yr2: r2, -7!000000000000000
    y_col: col, 1!0000000000000000000

    y = 0;
    y = y*2 + ((ch>0+(1-1)));
    y = y*3 + ((fb>2+(1-1))+(fb>6+(1-1)));
    y = y*2 + ((p01>11+(1-1)));
    y = y*3 + ((p>1+(-7-1))+(p>14+(-7-1)));
    y = y*5 + ((q>2+(-7-1))+(q>12+(-7-1))+(q>13+(-7-1))+(q>14+(-7-1)));
    y = y*2 + ((r >0+(-7-1)));

    p,q,r are some adjacent values (its a table),
    ch is a channel, col is a table column,
    and fb,p01 are scaled probabilities from another counters.

    > And show some results on "general data", i don't know which data you deal with.

    I didn't make any "universal" models since ash, so no "general data" for you.
    Thing is, I really want to create a proper model for text, but it gets too complex.

    > You mean on online optimization of the parameters or add the
    > gradient itself as an additional degree of freedom?

    Yeah, If a probability can be a context, why probability gradient cannot?
    Somehow that reminds me of my idea with using interval arithmetics for more
    precise statistics.

    > > I still don't understand how you're going eg. optimize a counter under SSE with it.

    > Having optimized the models which generate the input
    > probability distribution you basically have the same
    > situation as with context mappings - map some input
    > (context, probability, ...) to some output (probability).
    > The input basically is a number which selcts a bit model.
    > My optimization target is the estimated entropy H(source) ~
    > sum i=1 to n -ln p_i. Where p_i is the estimated entropy in
    > the i-th coding step. p_i is related to "some output" (see
    > above) and the connected "some input" is the context under
    > the i-th step.

    Seems like a misunderstanding again...
    You see, if you have a prediction p from some model, already
    optimized by entropy of p sequence, that doesn't mean that
    SSE(p) sequence would be optimal right away.
    In fact, my counters, which was optimized under SSE, are
    no good without SSE,
    And SSE is not a continuous function despite its interpolation,
    so you cannot really calculate a gradient for it.

    > preview.pdf (63.7 KB, 10 views)
    > Comments and suggestions are welcome.

    Don't call the result of arithmetic coding a "floating-point number".
    Its a rational number, or real, but not FP, as that implies limited
    precision.

    Also the description is more or less traditional, but I think that
    its impossible to really write a rangecoder using it.

    > While rereading several times i found lots of mistypes.

    Yeah, there're lots, but I'm too lazy to extract that text from pdf,
    and I'd say the same as whatever spellchecker anyway.
    Last edited by Shelwien; 23rd May 2008 at 19:29.

  29. #29
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    The problem there is that I cannot just replace all the counters
    with delayed ones, though that would surely be the best for compression.
    But each delayed counter requires two additional multiplications and
    two table lookups, so its too slow
    ...
    I'm suprised to hear something like this from you

    BTW exactly this slowness was the reason for my counters to have such a wierd parameter d. Even the single multiplication is notable. I didn't want to add additional overhead.

    Quote Originally Posted by Shelwien
    Index y
    ych: ch, 1!1
    yfb: fb, 1!00100010
    yp0: p01, 1!0000000000010000
    y_s: As, -7!000000000000000
    y_p: p, -7!010000000000001
    y_q: q, -7!001000000000111
    yr1: r1, -7!000000000000000
    y_r: r , -7!100000000000000
    yr2: r2, -7!000000000000000
    y_col: col, 1!0000000000000000000

    y = 0;
    y = y*2 + ((ch>0+(1-1)));
    y = y*3 + ((fb>2+(1-1))+(fb>6+(1-1)));
    y = y*2 + ((p01>11+(1-1)));
    y = y*3 + ((p>1+(-7-1))+(p>14+(-7-1)));
    y = y*5 + ((q>2+(-7-1))+(q>12+(-7-1))+(q>13+(-7-1))+(q>14+(-7-1)));
    y = y*2 + ((r >0+(-7-1)));
    What do the bit patterns in the declarations mean? I can see how you describe your quantization and i think i know where the numbers come from (comparisions = bit positions, multipliers make room for the amount quantization level's per quantization input); Maybe i should ask, what kind of data you work with?

    Quote Originally Posted by Shelwien
    Seems like a misunderstanding again...
    What i wanted to point out here is that a context model and a sse map are implemented in the same fashion (a mapping of a context to a bit model). So after optimizing the counters of the model, i can optimize the sse parameters. A sse implementation can have more parameters (like for interpolation, etc...) but atm i'm just focusing on optimizing bit models, so these are the only degree of freedom in my sse implementation.

    I don't expect anybody to be my spellchecker, i'll fix the article. Thanks for reading.

    BTW: my reason to be so lazy is that i'm running a gentoo distribution. Installing new things require to compile them.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > > But each delayed counter requires two additional multiplications and
    > > two table lookups, so its too slow...
    >
    > I'm suprised to hear something like this from you

    Well, I had gone too far even without that already,
    by using SSE2 instead of mixers. So eventually I'd
    be probably forced to do some speed optimizations.
    For now, luckily, psychoacoustic part is much slower.

    > BTW exactly this slowness was the reason for my counters to
    > have such a wierd parameter d. Even the single
    > multiplication is notable. I didn't want to add additional overhead.

    Unary coding allows to use relatively expensive formulas.
    Eg. if you use bitwise encoding with a single multiplication per bit,
    and 50% of bytes in your data have the same value, then you
    can use 4 multiplications for separate modelling of that value
    and still have the same number of multiplications (and supposedly speed)
    with significantly better compression.
    Its also possible to encode a whole string like that, if it
    appears frequently enough.

    > (comparisions = bit positions, multipliers make room for the amount quantization level's
    > per quantization input);

    Exactly.

    > Maybe i should ask, what kind of data you work with?

    That doesn't really matter... its fairly generic.
    These p,q,r,col could be replaced with distance to last context occurence,
    escape count, sum of bits in previous byte and last 3 bits of current
    data byte's offset, and after optimization it would work pretty well.

    And there're different kinds of data I work with, mostly originating from
    different kinds of psychoacoustic functions of audio samples.

    > > Seems like a misunderstanding again...
    >
    > What i wanted to point out here is that a context model and
    > a sse map are implemented in the same fashion (a mapping of
    > a context to a bit model). So after optimizing the counters
    > of the model, i can optimize the sse parameters. A sse
    > implementation can have more parameters (like for
    > interpolation, etc...) but atm i'm just focusing on
    > optimizing bit models, so these are the only degree of
    > freedom in my sse implementation.

    And still I don't understand, are you measuring entropy of counter
    or of counter mapping. for optimization?

Page 1 of 2 12 LastLast

Similar Threads

  1. New 7zip SFXs optimization !!!
    By Yuri Grille. in forum Data Compression
    Replies: 6
    Last Post: 4th May 2009, 23:42
  2. LZ77 speed optimization, 2 mem accesses per "round"
    By Lasse Reinhold in forum Forum Archive
    Replies: 4
    Last Post: 11th June 2007, 21:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •