1. Hi!

Somehow got into an argument about model orders and such.
That is, my opponent was saying that single context is enough
for "most cases" - he made an exception for text -
and at least when a proper initialization is used
(inheritance from lower orders, actually) and even PPM is
a waste of time (although its a trade-off for speed comparing
to normal context mixing). So I tried to prove that there's
good enough gain from using multiple models simultaneously.
And, after some messing around with copy/paste, here's my
comparison:

-------------------------------------------------- --------
(compressed size, (de)coding time)

mcpcom.exe wcc386.exe

11223040 536624 uncompressed
8175655 5.469s 398842 0.312s plain order0
5714465 5.391s 318126 0.328s plain order1
5658022 9.063s 310300 0.469s mix(order0,order1)
4980971 8.062s 314199 0.625s plain order2
4913790 8.422s 299592 0.640s order0->order1->order2 switch
4755565 20.187s 291268 1.234s mix(mix(order0,order1),order2)

4976344 3.703s 295342 0.281s PPMd vJ -m256 -o2

Well, in the end looks like all this only confirmed his opinion ,
will find something of interest there.

Btw, I also tried using the mixer ripped out from paq8 (included),
but wasn't able to make it do what I wanted. Wonder if anybody
could explain if I made a mistake somewhere or maybe that mixer
is just that efficient as it is. 2. I think I should make some experiments with order-1-0 model.

Using two predictors a la FPAQ0P we can mix two probabilities:

For example:
p1 - order-1 probability
p0 - order-0 probability

p = ((p1 * 2) + p0) / 3

or

p = ((p1 * 3) + p0) / 4

here we can eliminate devision, since division by four can be replaced with shift by 2.

i.e.

p = ((p1 * 3) + p0) >> 2

etc.  3. That's static mixing, its not cool Also doesn't improve compression much anyway. 4. Btw, fpaq0p results for the same files:

mcpcom wcc386
8199543 401051 // fpaq0p result
2.203s 0.140s // times by included binary
1.969s 0.125s // compiled with IC10 /ML /O3 /QxK /Qipo
4.703s 0.265s // compiled with IC10 /MD /O3 /QxK /Qipo (like executables in order_test1.rar) 5. I'm just thinking about my LZPM and its decompression speed. Adaptive weighing as done with PAQ6 may seriously slowdown the process.

Fixed weight mixing as done with PAQ1 is also slow.

I think I should carefully look at yours. Can you post please a formula for your mixing?  6. Originally Posted by Shelwien
Btw, fpaq0p results for the same files:

mcpcom wcc386
8199543 401051 // fpaq0p result
2.203s 0.140s // times by included binary
1.969s 0.125s // compiled with IC10 /ML /O3 /QxK /Qipo
4.703s 0.265s // compiled with IC10 /MD /O3 /QxK /Qipo (like executables in order_test1.rar)
Looking at this I think eventually I must get into IC10. By the way, do you use Visual C++ 2005 or ? As far as I know with latest VC all libraries are MT (multi-threaded) i.e. single threaded and the _CRT_DISABLE_PERFCRIT_LOCKS should be defined.

Anyway, FPAQ0P is faster. LZPM uses even far more optimized version of this type of coder...  7. Originally Posted by encode
By the way, do you use Visual C++ 2005 or ? As far as I know with latest VC all libraries are MT (multi-threaded) i.e. single threaded and the _CRT_DISABLE_PERFCRIT_LOCKS should be defined.
Just because:
icl: warning: option /Qvc8 or higher used with /ML[d] is not supported

(ICL 9.1) 8. 1. Mixing itself can't be slow, its just a single multiplication.
But additionally updating another counter _is_ slow.

2. Look at Update() at sh_mixer.inc, I think its fairily obvious.
Also I think that paq's mixer is really dumb and its not my mistake that it
doesn't work as good as mine 3. I don't use Studio, but I ripped out console compilers from all of them.
And so tend to use recent compilers with VC6 libs - really helps with size
and using msvcrt.dll instead of msvcr9.dll etc 4. I didn't optimize anything, not only for speed, but for size too, in fact.
Its what i've got by gathering some libs and copy-pasting from other projects.
So I think that making a faster coder is no problem.
Actually, some old coders like from http://compression.ru/sh/coders6c2.rar might do.

5. IC10 looks like it has a completely rewritten (again) code generation engine.
So its better to use it together with IC9.1 - sometimes 9.1 is better, sometimes not.
Though IC10 might do much better for newer CPUs. 9. OK, thanks for reply! Will start diggin'...  10. Hi!

My expirience shows that mixing itself isn't slow, but the weight adaption is time consuming (cmm2 dropped from ~700kb/s to 420 on my pc). Mixing and higher order models are definitley worth the effort, just have a look at lpaq (strip off the match model). Most compression comes from orders 3,4 and higher ones, where you have to adapt quicker, the lower orders can adapt faster and give a decent baseline of compression. Your mixing seems to be hierarchical, like mix( p0, mix( p1, p2 ) ), which gives two averages instead of three. Implicitly this gives higher weights to lower orders (the orders in the "outer mix", here p0) , try to average over more than two probabilities at once or swap the mixes. Has anybody tried this hirarchical stuff before? Maybe this can be made efficient? I got a nice compression improvement in selecting the mixer by a small context (few msbs from the last byte, coding order and some other synthetic stuff, like steady state information). I only had a brief look at your paq-rip-off. You only use two mixers, independent of context? You shoud at least select a mixer by the coding order, i noticed that this is the most important mixing context (which should be obvious, after a second thought).

Looking forward to further discussions.

n8 11. > My expirience shows that mixing itself isn't slow, but the weight adaption
> is time consuming (cmm2 dropped from ~700kb/s to 420 on my pc).

Not exactly sure what are you talking about, but in my implementation
mixer uses the same update as a counter, so with mixing 2 models you
need to update 3 counters, though there're trade-offs like PPM with

Anyway, I was initially talking about custom models for custom data,
like audio/video codecs and such. And its still remains the question
whether these kinds of bruteforce approach, like combining results of
multiple models etc, are worth their speed impact.
Eg. is 2% smaller file worth 2x slower decompression speed or not.
(It seems that even 100x slower compression speed is considered acceptable
for almost any visible size reduction, if decoding speed remains the same.)

> Mixing and higher order models are definitley worth the effort, just have a look at
> lpaq (strip off the match model).

I won't If you don't know me, I'm not anyone new to the scene, so you don't need
to explain stuff like that For example, here's my compressor from 2003:
http://compression.ru/sh/ash04.rar (04 with sources, remains obscure for some reason)
http://compression.ru/sh/ash04a.rar (04 binary with optimized parameters)

And also there's this "compressors.txt" file in my archive from the first post,
with some statistics on that, and it shows that eg for executables using more
than order2 "doesn't worth the speed impact".
That is, I do personally agree with you, but am unable to persuade these who don't.

> Most compression comes from orders 3,4
> and higher ones, where you have to adapt quicker, the lower orders can
> adapt faster and give a decent baseline of compression.

> Your mixing seems to be hierarchical, like mix( p0, mix( p1, p2 ) ),
> which gives two averages instead of three. Implicitly this gives higher
> weights to lower orders (the orders in the "outer mix", here p0),

It doesn't. Because it gives higher weights to the model which is more
efficient entropy-wise. Also the mixers are initialized with the weight
which doesn't include the first probability at all.

> try to average over more than two probabilities at once or swap the mixes.

I did it like that 5 years ago, thanks.
And now I think that its better to use the mathematically valid mixer

> Has anybody tried this hirarchical stuff before?
> Maybe this can be made efficient?

Well, it _is_ efficient.
I think you'd see it if you tried to reach the same compression level
with "dumb" contexts like used in my examples (whole previous byte and such).

I did try to prove it your way, though, by stripping lpaq, but seems
that is has too many "unsportsmanlike" tricks in it, so I did manage
to disable models except order1 and 2 (despite the #defines ),
and text/exe preprocessing, but there're too much stuff still left
as even the hash-function and counters are somewhat tuned to the "real data".
So I hope that for objective mixer comparison some "paq-way" representative
can quickly compose a fpaq0-or-something based direct analog of my coders,
because it'll take several hours for me both to tweak my coders that way
or to break down lpaq even further.

> I got a nice compression improvement in selecting the mixer by a small context
> (few msbs from the last byte, coding order and some other synthetic stuff,

Well, thats normal. In my examples mixers have some context too btw.

> You only use two mixers, independent of context?

I use it the same way as my mixer.
That is, I written a wrapper that allows to comment out
"sh_mixer.inc" and uncomment "paq8_mixer.inc" and it'll work and
even give some improvement over unmixed contexts, but
much worse than my mixer.

> You shoud at least select a mixer by the coding order,
> i noticed that this is the most important mixing context
> (which should be obvious, after a second thought).

Well, thats only if you use the same mixer array for mixing different orders.
Which isn't obvious. Instead, it's more obvious imho that you'd need different
contexts for mixing different orders - at least different quantization.
Anyway, I guess you didn't look at my sources.

Thanks, though I don't actually use this kind of approach for a long time
already. But I needed some examples with clear distinction between using
and not using different orders, and with orders in their historical sense.

> Looking forward to further discussions.

Okay.  12. I had no look at your source, sorry and maybe i didn't read your posts "exact" enough, i apologize (and i didn't know you too ). I can't look at your sources right now, since i'm at work (blocking .zips, etc), btw i'm new to the scene.

You do mixing like "counter" updates? What is a "counter" in your designations? Do you refer to a probability with adjustment by the error, or a counter like counting the occurences of 0/1 bits.

I tired to make a mixer based on this idea with poor results, nice speed though. For mixing i use an optimization approach (minimizig the entropy), i can't see how this can do better?! (concerning the update rule - simply like a counter/bit model, this isn't optimal). The optimization (weight adjustment) after every bit can suffer from (i don't know how to clearly say it, language barriers :shame "local noise", which aggravates convergence. PAQ's mixer is a neural network, but not that diffrent from weighted averaging (expect the stretch/squash transformation preceeding and succseeding mixing) with adjustment based on the gradient. Matt modified the gradient descent to minimize ld( pbit ), rather than standard back propagation (minimizing RMSE). One problem you always face to is getting stuck in a local minimum.

Could you please explain "efficient entropy wise". Better predictions lower the entropy - that is clear. My first thought was - the "inner" mixer joins the higher orders to a probability like (a*p1+b*p2) / (a+b) = p' and now you mix again with p0, the same way. The inner "mix" completly ignores the relation to the lower order - this is looked "suspicious" to me.
Something which comes to my mind now was an experiment to improve mixing speed, by only using 3 or 4 active models out of 6 to 8 in mixing, e.g. orders 654, when coding at order 6 and completly ignoring lower ones - which improved speed AND compression.
Maybe this is exactly what you meant!?

Mixer initialization isn't that critical in my experience.

I agree with you, that a slight compression gain compared to a huge amount of extra time isn't worth the effort. The general order-n models are somehow universal (since data is stored in bytes), but special models will definitley do better - like a bi- or trigram word for text, sourrounding pixels for images and so on. I think a specialized model along with maybe order210 will give nice results; and higher orders are definitely not worth the effort here. 13. > and i didn't know you too

Well, its no wonder as I've got a compression-related job
which blocks me from advertising my inventions, and nothing
interesting was happening in the field, beside Matt's works,
which are great as programming examples, but imho a complete
dead end considering further development of compression theory
(maybe not for Matt himself, but surely for people trying to follow).
And just recently I accidentally found this forum and discovered
that maybe there's some interesting development again.

> You do mixing like "counter" updates?
> What is a "counter" in your designations?
> Do you refer to a probability with adjustment by the error,
> or a counter like counting the occurences of 0/1 bits.

Well, by counter I mean a statistic counter like these "states".
Actually my counter is derived from the same origin as states -
integer count of occurences that is.
Btw, here's my first compressor with more straight-forward
counters than these derived from "ash", which I used in
"order_test1.rar" examples.
http://compression.ru/sh/ppmy_9a8.rar

So, it evolved like this:
1. P0 = bitcount/(bitcount+bitcount);
bitcount[bit]++;
2. P0 = bitcount/(bitcount+bitcount);
(bitcount[bit]*=(1-1/InvSpeed)++;
3. P0 = bitcount;
(bitcount*=(1-1/bitcount))+=(bit?0:1/bitcoun t);
(bitcount*=(1-1/InvSpeed))++;
4. P0 = bitcount;
wr = Tbl[InvSpeed][bitcount];
(bitcount*=(1-wr))+=(bit?0:wr);
bitcount++;

So  here is a nonstationarity countermeasure, and other steps
are mostly speed optimizations. And Matt made another step here,
by quantizing such a counter into a single byte - which takes
some redundancy. So of course his approach has its place in the
industrial compression schemes (like h264), but I think it really
complicates the development of new models.

> I tried to make a mixer based on this idea with poor results, nice speed
> though. For mixing i use an optimization approach (minimizig the entropy),

minimization, which accumulates all the possible codelengths for a given
set of weight values and uses the weight with minimal length for actual
encoding:

for( w=0,i=0; w<=1; w+=1/63,i++ ) {
p = p1*(1-w) + p2*w;
(len[i]*=(1-1/InvSpeed)) += -log2(bit?1-p );
}

Here's a working example from that time:
http://compression.ru/sh/o1bfa.rar

Well, this is what I usually call "entropy-wise"... but my mixer from
the first post is based on a newer idea... and has even better peformance,
though I was surprised when compared it with a BFA mixer and found that.

And that new idea is really simple: we update the mixed probability as
if it was from a normal counter, and then "back-propagate" the change
into reachable parameters... weight, this time.
Well, it looks like this (here's a "translated" version from my sh_mixer.inc):

int w;
iMixer::Process( bit, p1, p2, InvSpeed, Limit ) {
Counter tmp;
tmp.P = p1+(p2-p1)*w; // mixed probability of 0
tmp.Encode( bit ); // encode the bit with mixing result
tmp.Update( bit, InvSpeed ); // update tmp counter like usual
// speed opt - no sense to change weight when p1 and p2 are too close
if( ABS(p2-p1)>Limit ) {
// change weight so that we'll get an "updated" value from repeated mixing
w = (tmp.P-p1)/(p2-p1);
w = max(0,min(1,w));
}
}

> PAQ's mixer is a neural network, but not that diffrent from
> weighted averaging (expect the stretch/squash transformation preceeding and
> the gradient descent to minimize ld( pbit ), rather than standard back
> propagation (minimizing RMSE). One problem you always face to is getting
> stuck in a local minimum.

Well, I think that neural network origin is what makes Matt's mixer so
inefficient in my example. That is, maybe he did manage somehow to make
it output something useful (its always possible when you build a model
from a given set of functions comparing to trying to insert a function
into a working model), but it surely doesn't adapt well to some abstract
probabilities, out of paq context.

> Could you please explain "efficient entropy wise". Better predictions lower
> the entropy - that is clear. My first thought was - the "inner" mixer joins
> the higher orders to a probability like (a*p1+b*p2) / (a+b) = p' and now
> you mix again with p0, the same way. The inner "mix" completly ignores the
> relation to the lower order - this is looked "suspicious" to me.

Well, my 2-1-0 mix looks like this:

// (de)coding a bit from "flag" variable.
// TTable_NN,MMM are counter update mode settings (kinda random here)
// also "100" in mixer calls is a limit value = 0.003 in 0..1 interval
p1 = order0[z].P;
p2 = order1[p][z].P;
p3 = order2[pp][p][z].P;
p4 = p1 + (p2-p1)*w1[p][z].w;
w2[p][z].Process( flag, TTable_5,1,100, p4,p3 );
order0[z].Update( flag, TTable_20, 700 );
order1[p][z].Update( flag, TTable_10, 600 );
order2[pp][p][z].Update( flag, TTable_8, 400 );
w1[p][z].Process( flag, TTable_5,1,100, p1,p2 );

> Something which comes to my mind now was an experiment to improve mixing
> speed, by only using 3 or 4 active models out of 6 to 8 in mixing, e.g.
> orders 654, when coding at order 6 and completly ignoring lower ones -
> which improved speed AND compression.
> Maybe this is exactly what you meant!?

I didn't, Though I'm pretty sure that you'd get a valid result from
any set of estimations you'll run through my mixer.

> Mixer initialization isn't that critical in my experience.

It is quite significant with wide enough mixer context.

> I agree with you, that a slight compression gain compared to a huge amount
> of extra time isn't worth the effort.

Well, actually you agree with my opponent here What I think is that its always useful to have the best result you're able to
reach within common sense restrictions (eg no supercomputers etc).
Because only when you have values to compare, you'd be able to make a
correct decision on where to stop.
And the current problem is that paq8 isn't quite that. Its obviously more
useful than some widespread archivers for data redundancy analysis, but it's
too easy to make a very simple model with better results for custom data -
and no wonder as paq8 uses lots of tricks for common files compression, but
its compression classes, which aren't particularly good even in theory, are

> The general order-n models are
> somehow universal (since data is stored in bytes), but special models will
> definitley do better - like a bi- or trigram word for text, sourrounding
> pixels for images and so on. I think a specialized model along with maybe
> order210 will give nice results; and higher orders are definitely not worth
> the effort here.

Well, it depends on the target.
That is, yes, if you aim for best places in composite comparisons (size+speed),
but this approach completely fails in data analysis. 14. @encode: wonder if there's a way for having my idents back and no smilies. 15. I have to state that i never saw this weighting approach before. Looks interesting to me, maybe i'll give it a try. To resume your "binary" (two inputs) weighting (correct me, if i'm wrong):

w0 = some initial weight

update - binary averaging:
p' = p1 + (p2-p1)*wk = (1-wk)*p1 + wk*p2
(saves one multiplication, nice )
note: (p' - p1)/(p2-p1) = wk
adjust p' in the direction of the prediction, maybe like ilia said:
p'' += (p* - p') / S
p* = optimal prediction, probability = 0 or 1 depending on the bit
and extract the new weight:
w(k+1) = (p''-p1)/(p2-p1)

I think the main problem is the division - a speed impact, this could be improved by using a (nonlinear quantized) lookup table for 1/(p2-p1) scaled by some factor.

But i can't see a "clear"mathematical motivation behind this - like minimization of coding cost. It is clear, that this (somehow - but how, mathematically) pushes wk in the direction of the more accurate prediction.

Your BFA is locally optimal (per bit coding step) but looses relation to the other orders. This reminds me of lz77 parsing: optimal match length (viewed as entropy) for the current position (greedy; the "innermost mix()"), or optimal length for the current (innermost mix())+ the next (the next outer to innermost mix()), etc... (a few bytes look ahead or even optimal parsing of a whole file/block). Ilia is the expert here I don't think that a NN mixer is inefficient - it does, what it is forced to do mathematically - minimizing coding cost by gradient decsent.

Btw Matt doesn't loose the ability to change adaption speed. The bit probabilities aren't directly calculated by counter values. Instead the counter state is mapped to a probability, which is adjusted. The adjustment speed can be modified, in fact it is decreased (somehow) proportional to the frequency of the corresponding state's occourance.

To correct my previous statement: i agree with your conclusion. Using specialized modelling will better than general modelling. I do general modelling, since i'm implementing some compressor (hobbyist) - i don't do data analysis. 16. Originally Posted by Shelwien
@encode: wonder if theres a way for having my idents back and no smilies.
Not quite understand the question. If its about forum engine - its just as is.
Forum should remember you via cookies.
To insert a quoting press the "Quote" link upside the messages, or use the "" tags.
Smilies are always enabled.  17. 2Shelwien
By the way, do you know more efficient model updates schemes for pure order-N coders? Can you recommend something for my LZPM?
Current well known schemes are:
1.
if (y) pr += ((4096 - pr) >> 5);
else pr -= (pr >> 5);

2.
pr += (((y << 12) - pr) >> 5);
// or
pr += ((((y << 12) - pr) >> 5) + (pr < 204 );

3.
if (y) pr += ((65536 - pr) >> 5);
else pr -= (pr >> 5);
// (P() = pr >> 4)
Uwe proposed to use different shift values depending on the current probability value - i.e. more aggressive update then probability is more skewed.
I just tested LZPM with order-1-0 coder and simple mixer. Well, it helps mainly with binary files, at the same time the code should be fully rewritten and not as simple.
Just interesting to ask the expert like you.
Who didnt know Eugene Shelwien is the worlds biggest expert in data compression and math. Its proud and pleasure to see such people on this forum!  18. Taking a few minutes of my break i derived the following (using an update rule, like ilia posted):

p' = pa * w(k) + pb * (1-w(k))
p* = optimal choice 0 or 1
p'' = p' + (p*-p') * alpha
w(k+1) = (p'' - pa) / (pb - pa)
adjusted by alpha*100% of prediction error

...calculations, simplify:

w(k+1) = (1-alpha)*w(k) + alpha * (p* - pb)/(pa - pb)

(maybe there's a mistake, i only wrote it down quickly)

Maybe the second term (p*-pb)/(pa-pb) can be viewed as some approximization of the gradient regarding ld( p(bit) )?! I would interpret it, as model b's prediction error normalized by the compliance of both predictions.

Any ideas?

Greets 19. >> wonder if theres a way for having my idents back and no smilies.

{
test
}
"test" is supposed to have 2 spaces at the line start.

> Not quite understand the question. If its about forum engine - its just as is.
> Forum should remember you via cookies.
> To insert a quoting press the "Quote" link upside the messages, or > use the "" tags.
> Smilies are always enabled.

Well, not my problem if it messes up my explainations, then. 20. Will look at miniBB FAQ, however, AFAIK, there is no solution for ident...  21. @encode:
> By the way, do you know more efficient model updates schemes for pure
> order-N coders? Can you recommend something for my LZPM?
> Current well known schemes are:
>
> 1. if (y) pr += ((4096 - pr) >> 5); else pr -= (pr >> 5);

So, let's rewrite this for pr in [0..1]:
wr = 1/32
pr = pr*(1-wr) + y*wr

I think that with this you'll be suddenly able to
tune the counter much better than like quoted above.
Like, the number in 1/32 can even be dynamic,
eg. number of context occurences.
And it all can be implemented with one multiplication,
which is not any worse than shifts nowadays.

> 2. pr += (((y << 12) - pr) >> 5);

This one is a speed optimization of (1).
With a small number of such updates per iteration, if()
version might be faster for skewed pr (far from 0.5),
due to CPU's branch prediction, but I suppose that
(2) is better for common use anyway.

> 2a. pr += ((((y << 12) - pr) >> 5) + (pr < 204 );

wr = 1/32
pr = pr*(1-wr) + y*wr + pr*wr/64
pr = pr*(1-wr*63/64) + y*wr

So guess this is basically a usual update, but with slightly
different speed and an extra possibility of going out of
[0..1] interval, so additinal check is needed.

> 3. if (y) pr += ((65536 - pr) >> 5); else pr -= (pr >> 5);
> // (P() = pr >> 4)

Well, this one is (1) with 16bit precision instead of 12bit.
I can say that my tests showed that with probability
precision from 12 to 16 bits you can "get lucky" with any
depending on the optimization target.

To be specific, I have results like this:
370648 370607 // 15 bits
370652 370607 // 12 bits
370646 370628 // 11 bits
371075 370708 // 8 bits
And its not about simple change of shift value - complete
model reoptimization (contexts and other parameters) was
performed for objective comparison (numbers in first column
are results before optimization).
So I think that you need 12bit precision, but anything more
is masked by the "model noise".

Here's a list of "error weight" values for different precisions
(bits lost due to given order of redundancy in 1Mbit),
kinda supports my statement too.

In:= N[Table[{i, 1000000*Log[2, 1/(1 - 2^-i)]}, {i, 8, 15}]]

Out= {{8, 5646.56}, {9, 2820.52}, {10, 1409.57}, {11, 704.613},
{12, 352.263}, {13, 176.121}, {14, 88.0578}, {15, 44.0282}}

> Uwe proposed to use different shift values depending on the current
> probability value - i.e. more aggressive update when probability is more
> skewed.

Well, its basically what SSE/APM does...
So obviously you can use a static map and include other contexts too.
If only there was a good way for optimization of such things,
except bruteforce...

> I just tested LZPM with order-1-0 coder and simple mixer. Well, it helps
> mainly with binary files, at the same time the code should be fully
> rewritten and not as simple.

Why, surely where you see the difference depends on your model...
Also some more things to mention:
1. PPM compresses better than LZP and nobody proved that LZ is necessarily faster.
2. Bit tweaking and table lookups are good for embedded solutions,
but can cause even worse slowdown than a bunch of divisions on current x86 arch.
3. You don't need to do all things the same way as Matt does. For example,
unary symbol coding (symbol can as well be an offset in LZ) is much faster
than binary coding with modeling of each bit.

So, say, what do you think about ppmonstr?.. 22. Ah, forgot one more thing.
My counters do actually have one more parameter,
tuning of which helps too. Its kinda like this:

wr = 1/Speed
w0 = 1/Weight // order(-1) weight
pr = pr*(1-wr) + (y*(1-w0)+(1-y)*w0)*wr

dp = w0*wr
pr = pr*(1-wr) + y ? wr-dp : dp

So your (2a) is imho a glitchy version of this. 23. @Shelwien:

Hi,
Even though I'm not an expert on data compression, your explanations are very clear at my point of view. Thanks a lot. I'm desinging an archive format for game / simulation softwares. It's name is Bit Archive Format. I'm trying to use ROLZ based approach on compressing side. I've found that ROLZ is very fast and efficient. I don't like PPM or pure CM. But, I'll try to use your mixing ideas. I'm sure this will help a lot for encoding literal/matching length symbols instead of bare order-0 range coding. As "encode" said before, order 1-0 mixing seems very good.

Main design goal is effective compression (can be very slow and resource hungry), fast decompression. Game files are generally in binary format which aligned with 32 bits data (lots of int and floats). Do you have any benefit for me?

Thanks a lot
Osman Turan 24. Btw, about these table lookups you all seem to like so much.
And it doesn't look very good for lookup-only compressor
implementations as they would obviously have L1 stuffed
with counter values and such - even more so when using
hashtables for context lookups.
Wonder if somebody could run this on some Core Duo,
lookups should be even slower (relatively) there.

Test result for P4 3.1:

16k: DivConst(imul)=0.015s, Div=0.078s, Lookup=0.015s
32k: DivConst(imul)=0.031s, Div=0.157s, Lookup=0.015s
64k: DivConst(imul)=0.094s, Div=0.312s, Lookup=0.032s
128k: DivConst(imul)=0.172s, Div=0.625s, Lookup=0.078s
256k: DivConst(imul)=0.390s, Div=1.282s, Lookup=0.531s
512k: DivConst(imul)=0.812s, Div=2.532s, Lookup=1.562s
1024k: DivConst(imul)=1.609s, Div=5.875s, Lookup=4.219s
2048k: DivConst(imul)=3.187s, Div=10.969s, Lookup=15.547s

So
L1Lookup = 1/2*Mul = 1/10*Div
L2Lookup = 2*Mul = 3/5*Div
DDR2Lookup = 5*Mul = 3/2*Div 25. Ive just run it on my asus based laptop. Here is detail about computer and test results:

Core 2 Duo T7500 @ 2.20 GHz
2 GB RAM

16k: DivConst(imul)=0.016s, Div=0.046s, Lookup=0.016s
32k: DivConst(imul)=0.031s, Div=0.094s, Lookup=0.015s
64k: DivConst(imul)=0.047s, Div=0.187s, Lookup=0.032s
128k: DivConst(imul)=0.109s, Div=0.343s, Lookup=0.078s
256k: DivConst(imul)=0.219s, Div=0.717s, Lookup=0.156s
512k: DivConst(imul)=0.453s, Div=1.388s, Lookup=0.328s
1024k: DivConst(imul)=0.889s, Div=2.793s, Lookup=0.920s
2048k: DivConst(imul)=1.825s, Div=5.600s, Lookup=2.262s 26. @osmanturan:

1. Thanks for testing, though 2M is too small to make it out of
CoreDuo's L2, I guess. But even here multiplications look
comparable to lookups... and they promised to make divisions
2x faster in penryn...

2.
> I'm trying to use ROLZ based approach on compressing side. I've found that
> ROLZ is very fast and efficient. I don't like PPM or pure CM.

Well, CM surely is not the best method if you aim for fast (de)compression.
But PPM worth considering, I think. Actually, PPM with unary encoding isn't
that much different from LZ, only difference is that you kinda use unary
encoding for match length in PPM. Guess, somebody should try optimizing
PPM by speed instead of improving LZ's compression...

> But, I'll try
> to use your mixing ideas. I'm sure this will help a lot for encoding
> literal/matching length symbols instead of bare order-0 range coding. As
> "encode" said before, order 1-0 mixing seems very good.

There's a good trick even within a single order.
Like, you can have two counters - with "fast" update and with "slow",
so after mixing the result would be close to optimal both for
stationary and nonstationary data.
Actually, Matt has explained this in paq source.

> Main design goal is effective compression (can be very slow and resource
> hungry), fast decompression.

Well, your work would be more original/promising if you tried to take some
good CM model and make it asymmetric. For example, it can be a static model
(Bulat did that already, though), or realtime tuning of fast simple model
to the data.

> Game files are generally in binary format which aligned with 32 bits data
> (lots of int and floats). Do you have any benefit for me?

Well, for numbers you surely should consider at least something like
http://en.wikipedia.org/wiki/Regression_analysis

Also unary coding is much better than binary unless the data is really random.

And for floats the best thing seems to transform them to fixedpoint form
and use interval encoding.

Actually, I did something similar for CalgaryCorpus/geo compression
(it mostly contains floats in some ancient format)
http://compression.ru/sh/fltgeo3b.rar
The model there is wrong in many ways, but still its how you supposed to
compress floats. Also it appears that it still compresses geo file better
than paq8... quite surprisingly:

geodump.ha 40047
GEO.paq8o6 43911
GEO.pmm 49638 27. Btw, to russians here: wonder if everybody seen this:
http://compression.ru/sh/2003.htm
(as lots of stuff I linked in this thread is mentioned there). 28. 16k: DivConst(imul)=0.016s, Div=0.016s, Lookup=0.000s
32k: DivConst(imul)=0.015s, Div=0.047s, Lookup=0.000s
64k: DivConst(imul)=0.031s, Div=0.079s, Lookup=0.015s
128k: DivConst(imul)=0.063s, Div=0.172s, Lookup=0.031s
256k: DivConst(imul)=0.109s, Div=0.344s, Lookup=0.062s
512k: DivConst(imul)=0.219s, Div=0.672s, Lookup=0.172s
1024k: DivConst(imul)=0.437s, Div=1.375s, Lookup=0.453s
2048k: DivConst(imul)=0.969s, Div=2.781s, Lookup=1.344s The Q6600 @2400 on my side. 29. 2.2 GHz Athlon-64 3500+, 2 GB, WinXP (32 bit), compiled with g++ -O2 -Os -march=pentiumpro -fomit-frame-pointer -s

16k: DivConst(imul)=0.219s, Div=0.125s, Lookup=0.016s
32k: DivConst(imul)=0.187s, Div=0.188s, Lookup=0.031s
64k: DivConst(imul)=0.359s, Div=0.375s, Lookup=0.063s
128k: DivConst(imul)=0.750s, Div=0.734s, Lookup=0.141s
256k: DivConst(imul)=1.562s, Div=1.531s, Lookup=0.422s
512k: DivConst(imul)=3.125s, Div=3.047s, Lookup=0.968s
1024k: DivConst(imul)=6.265s, Div=6.110s, Lookup=2.484s
2048k: DivConst(imul)=12.562s, Div=12.266s, Lookup=6.328s

Edit: my bad. I looked at assembler source (-S) -Os (optimize for size) prevents g++ from optimizing a division by a constant to a multiplication. Repeating the test,

C: mp>g++ -O2 divtest.cpp

C: mp>a
16k: DivConst(imul)=0.031s, Div=0.203s, Lookup=0.047s
32k: DivConst(imul)=0.063s, Div=0.219s, Lookup=0.046s
64k: DivConst(imul)=0.047s, Div=0.360s, Lookup=0.078s
128k: DivConst(imul)=0.109s, Div=0.719s, Lookup=0.187s
256k: DivConst(imul)=0.375s, Div=1.485s, Lookup=0.453s
512k: DivConst(imul)=0.750s, Div=2.984s, Lookup=1.031s
1024k: DivConst(imul)=1.500s, Div=5.938s, Lookup=2.687s
2048k: DivConst(imul)=3.016s, Div=11.953s, Lookup=6.547s

C: mp>g++ -O2 -Os divtest.cpp

C: mp>a
16k: DivConst(imul)=0.218s, Div=0.141s, Lookup=0.016s
32k: DivConst(imul)=0.187s, Div=0.172s, Lookup=0.047s
64k: DivConst(imul)=0.359s, Div=0.360s, Lookup=0.078s
128k: DivConst(imul)=0.750s, Div=0.719s, Lookup=0.156s
256k: DivConst(imul)=1.562s, Div=1.516s, Lookup=0.422s
512k: DivConst(imul)=3.110s, Div=3.031s, Lookup=1.000s
1024k: DivConst(imul)=6.234s, Div=6.078s, Lookup=2.422s
2048k: DivConst(imul)=12.516s, Div=12.203s, Lookup=6.531

C: mp>g++ divtest.cpp

C: mp>a
16k: DivConst(imul)=0.063s, Div=0.265s, Lookup=0.032s
32k: DivConst(imul)=0.062s, Div=0.266s, Lookup=0.078s
64k: DivConst(imul)=0.109s, Div=0.516s, Lookup=0.156s
128k: DivConst(imul)=0.234s, Div=1.047s, Lookup=0.328s
256k: DivConst(imul)=0.547s, Div=2.125s, Lookup=0.703s
512k: DivConst(imul)=1.078s, Div=4.265s, Lookup=1.516s
1024k: DivConst(imul)=2.157s, Div=8.531s, Lookup=3.578s
2048k: DivConst(imul)=4.313s, Div=17.156s, Lookup=8.500s

I used -Os because it makes paq8 faster for some reason. But paq8 doesn't use integer division.

BTW, this is with 128K L1 cache, 512K L2 cache, 2 GB DDR2 3-3-4-8. 30. @toffer:

> w(k+1) = (p''-p1)/(p2-p1)
>
> I think the main problem is the division - a speed impact, this could be
> improved by using a (nonlinear quantized) lookup table for 1/(p2-p1) scaled
> by some factor.

Well, that'd be a minor problem if you'd manage to gain any worthy results
out of this > But i can't see a "clear" mathematical motivation behind this - like
> minimization of coding cost. It is clear, that this (somehow - but how,
> mathematically) pushes wk in the direction of the more accurate prediction.

Okay, we don't have any objective metric for update optimization -
because the only real metric, size of compressed data, can't be
completely recomputed on every iteration.
So actually I was talking about
1. Update like this: pr = pr*(1-wr) + y*wr
is mathematically optimal using the codesize sum with exponential weights
(That is, if we count it like len=len*(1-wr)-log2(y*pr+(1-y)*(1-pr))
2. My mixer works completely the same as a counter, if both given
probabilities are constant.

But of course, I can't call it an ideal mixer or anything like that...
It'd be plain wrong considering that I don't actually use it in my
models now . Just that imho its better than paq's mixer.

> Your BFA is locally optimal (per bit coding step) but looses relation to
> the other orders.

Yes, but its better than nothing anyway.
And you always can calculate estimates for all weight combinations etc Also, the same approach applies to any parameters from non-recurrent
functions, so I once made a BFA-driven ppmy version.

Btw did you see bfa2 includes with log() replaced with division?
Funny thing is, its not some random empiric optimization, it just
uses a derivative of length function by w variable > I don't think that a NN mixer is inefficient - it does, what it is forced
> to do mathematically - minimizing coding cost by gradient decsent.

Well, I can believe that it might reach some optimum (local probably)
on stationary data, but it just isn't adapting fast enough for real environment.

> Btw Matt doesn't loose the ability to change adaption speed. The bit
> probabilities aren't directly calculated by counter values. Instead the
> counter state is mapped to a probability, which is adjusted. The adjustment
> speed can be modified, in fact it is decreased (somehow) proportional to
> the frequency of the corresponding state's occourance.

Well, SSE isn't actually used like that... it does partial compensation,
but its kinda the same like saying that you can map any hash value back
to source string. So if you're able to change both counter behavior and
SSE parameters, it works much better than with only SSE parameters.

> ==========
>
> Taking a few minutes of my break i derived the following (using an update
> rule, like ilia posted):
>
> p' = pa * w(k) + pb * (1-w(k))
> p* = optimal choice 0 or 1
> p'' = p' + (p*-p') * alpha
> w(k+1) = (p'' - pa) / (pb - pa)
> adjusted by alpha*100% of prediction error
>
> ...calculations, simplify:
>
> w(k+1) = (1-alpha)*w(k) + alpha * (p* - pb)/(pa - pb)
>
> (maybe there's a mistake, i only wrote it down quickly)

No, its the same thing. Actually, I had initially written it like that
too, but current version appears slightly better due to different
rounding or something like that.

> Maybe the second term (p*-pb)/(pa-pb) can be viewed as some approximization
> of the gradient regarding ld( p(bit) )?! I would interpret it, as model b's
> prediction error normalized by the compliance of both predictions.
> Any ideas?

Well, afair it was like this:

In:= Solve[ D[y*Log[p1*(1-w)+p2*w] + (1-y)*Log[1-p1*(1-w)-p2*w],w]==0, w]
Out= { w -> (p1 - y)/(p1 - p2) }

With y = bit value and p1,p2 = probabilities of y==1. #### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•