1. Hi!

On my recent work i discovered something, which seemed odd to me. My first try to build a match model was explicit modelling of a match length under some context (orders 0-4, current match length). I discovered, that all of these work far worse than simply setting one of the mixer's inputs to 1, if you found a long match. The threshold is a length of 8, in my current implementation. My mixer is based on an iterative minimization, it's no NN.

Any experience? Is this normal?!

Greets

2. Already CMM1 is enough complex as compressor seeing the source-tails! Not seeing the source tails of CMM2 is really difficult to give an opinion, nevertheless I think that could lend you a hand those that have given a contribution to PAQ that seems similar to your program!

3. CMM1 was my first try and it was based on PAQ1 (basically reading Matt's report and writing a program, that did, what i understood), CMM2 and 3 use a different architecture. If you would compare it to PAQ, it would be "most" similar to PAQ7.

A describtion of my weighting scheme:

Let w_i be the weights of the mixer's i-th input and p_i the model's prediction. p is the mixed prediction:

s = sum from i=1 to n w_i
p = 1/ s * (sum from i=0 to n w_i*p_i)

Adjustement is one step in weight space reducing coding cost (ld 1/px), where x is the coded bit:

w_i = w_i + alpha * (p - p_i) / ((x - p) * ws)

The second term in truncated, if it exceeds a certain range. I only combine the 4 most predictive models, and if i've got a match of at least 8 characters, p_0 is set to 1 (this is the odd thing).

BTW: i plan to make everything open source, when i release CMM3.

One could try to take PAQ7 and replace the match model's input with a static one as i said above. Maybe something weird will happen. I don't know. If i've got some spare time during the next week, i'll try it.

4. You didn't explain what kind of probabilities as weights are these,
so again I don't understand anything.

So the only thing I can say is that SSE is much more useful for
merging the predictions than mixing.

5. Same mistake

I do order-n modelling, where n is 0...4, 6 (i tried 0-5, but 0-4,6 works better). A mixer has 4 inputs here, where the topmost 4 models are selected, e.g. seen a byte at the order 6 context i use orders 6432. Probabilities are scaled down to 15 bits and weights have 14 bits precision for weighting. Weights are selected by a small context (coding order + 6 of theprevious character's bits). A probability estimation is created by mapping an ordern-n context's bit history (coded by a state machine, similar to paq, but the design behind this is a bit diffrent) to a bitmodel. A bitmodel is updated by k * prediction error; these proabilities are used for input. After mixing there's another SSE stage, the probability is adjusted by a SSE map selected dependend upon bit position, a measrue for previous prediction errors and match length fitted into a few bits.

My "problem" is that modelling the match length and using this as an input for the mixer input works worse than simply adding a static probability of one, e.g. order 6, instead of using orders 6432 as inputs i use 643,(match probability). Setting match probability to 1 works better.

Any suggestions?

How would you use SSE for mixing?

6. > I do order-n modelling, where n is 0...4, 6 (i tried 0-5, but 0-4,6 works
> better).

That's ok, though 0-6 should be even better if mixing is right.

> A mixer has 4 inputs here, where the topmost 4 models are selected,
> e.g. seen a byte at the order 6 context i use orders 6432.

Hope you don't use the same weights for different orders...

> A probability estimation is created by mapping an ordern-n context's bit
> history (coded by a state machine, similar to paq, but the design behind this is
> a bit diffrent) to a bitmodel. A bitmodel is updated by k * prediction error;
> these proabilities are used for input.

Wonder if you checked these estimations without secondary model...
its not optimal, but usually more stable when counters are good.

> After mixing there's another SSE stage,

Afaik "after mixing" is wrong idea as SSE obviously doesn't like
distribution changes.

> My "problem" is that modelling the match length and using this as an input for
> the mixer input works worse than simply adding a static probability of one,
> e.g. order 6, instead of using orders 6432 as inputs i use 643,(match
> probability). Setting match probability to 1 works better.
> Any suggestions?

You still didn't explain what do you call "modelling the match length and using this as input"
"This" = what? Long match probability? And you're mixing it with bit==1 probability?

> How would you use SSE for mixing?

Well, ppmonstr simply uses things like (p_submodel>0. in SSE context.
Further generalization of that is multidimensional (eg. 2D) interpolated SSE,
which is what I normally use instead of simple mixing, as it shows significantly
better results even comparing to my mixer, which is in turn better than whatever

7. Originally Posted by Shelwien
Thats ok, though 0-6 should be even better if mixing is right.
I want to keep the number of models low to keep memory limits in a "acceptable" range.

Originally Posted by Shelwien
Hope you dont use the same weights for different orders...
Originally Posted by toffer
Weights are selected by a small context (coding order + 6 of theprevious characters bits).

Originally Posted by Shelwien
Wonder if you checked these estimations without secondary model...
its not optimal, but usually more stable when counters are good.
These preidctions yield worse result without SSE. Compression is hit by about .04 bpc on my testset.

Originally Posted by Shelwien
Afaik "after mixing" is wrong idea as SSE obviously doesnt like
distribution changes.
I use several SSE functions selected by a context, so there shouldnt be a large hit, if the context seperates distinct distributions well enough.

Originally Posted by Shelwien
You still didnt explain what do you call "modelling the match length and using this as input"
"This" = what? Long match probability? And youre mixing it with bit==1 probability?
I tried modelling the probability for a match as follows: L is the current match length. I modelled P(L+1 | L ) and P(L+1 | order x context), where x is 0-4 (tried all). I also tried mixing sever of these as a combination.
When i found a match of the length 8, i discard the lowest orders input p_0 and replace it with p_0 = next bit==1 ? P(L+1 | L) | 1-P(L+1 | L). As i said i treid several diffrent variants for input. But setting p_0 = next bit==1 ? 1 : 0 gives better results.

Multidimensional SSE for mixing? You mean using each models prediction as an input? Im unsure here, what you actually mean.

I use gradient descent, since it can be implemented using vector math, which is fast!

8. >> Hope you don't use the same weights for different orders...
>
> Weights are selected by a small context (*coding order *+ 6 of theprevious
> character's bits).

I just tried to ensure that its order and not context rank (like in "topmost")
or something like that.

>> Wonder if you checked these estimations without secondary model...
>> its not optimal, but usually more stable when counters are good.
>
> These predictions yield worse result without SSE. Compression is hit by about
> .04 bpc on my testset.

I meant that if you're not using the whole-model bruteforce parameter optimization,
then its better to ensure that submodels are optimal for their specific cases.

>> Afaik "after mixing" is wrong idea as SSE obviously doesn't like
>> distribution changes.
>
> I use several SSE functions selected by a context, so there shouldn't be a large
> hit, if the context seperates distinct distributions well enough.

That depends on what you call "large", but its not a matter of context.
Its a fact that SSE over a counter with static update works better than
over a counter with dynamic update, and functions with more complex
relation to the original data (like the mix or switch or another SSE) are
even less efficient, so that SSE could even add redundancy.
In other words, SSE adapts relatively slow and so its results are rather bad
when nature of index probability changes frequently.

>> You still didn't explain what do you call "modelling the match length and using
>> this as input"
>> "This" = what? Long match probability? And you're mixing it with bit==1 probability?
>
> I tried modelling the probability for a match as follows: L is the current match
> length. I modelled P(L+1 | L ) and P(L+1 | order x context), where x is 0-4
> (tried all). I also tried mixing sever of these as a combination.
> When i found a match of the length 8, i discard the lowest order's input p_0 and
> replace it with p_0 = next bit==1 ? P(L+1 | L) | 1-P(L+1 | L). As i said i
> treid sever diffrent variants for input. But setting p_0 = next bit==1 ? 1 : 0
> gives better results.

I'm still not sure what relation this has to a bit probability...
Is that "next bit" a last bit of match?
If it is, that might have some sense in principle, so then you simply made
a bad model there. Consider testing your compressor with match length increase
as the only predictor and make sure that it shows at least some compression.

Somehow it seems like you didn't notice that you can separately tune the model
parts... basically the idea is that if you have model A, then you can optimize
its parameters to one dataset (say, stationary) and call the result B, and
also make C the same way using other dataset (say, nonstationary), and then
mix(B,C) using a good mixer would "always" (of course there'd be exceptions)
perform better than simple A.
Then, its also always better to optimize all parameters of mix(B,C) at once,
but that takes much more resources too.
Well, what I'm trying to say is that its a good strategy to ensure that
every building block in your model performs optimally in its place - its
kinda guarantees that they'd show better results when combined.

> Multidimensional SSE for mixing? You mean using each model's prediction as an
> input? I'm unsure here, what you actually mean.

SSE is probability mapping.
So multidimensional SSE is similar mapping of several probabilities into one.

> I use gradient descent, since it can be implemented using vector math,
> which is fast!

Well, I already said multiple times that normally you get necessary functionality
first (compression level here), then apply the speed optimization techniques.
Otherwise you simply won't know which redundancy you trade for which speed gain.

9. If you've got a match, say ...ab|0001011 (in the history buffer) when coding ...ab|0001101 where ab is the octet context and 0001011, 0001101 are the following bits in the history buffer/look ahead buffer respecitvely (in terms of an lz like implementation). If i'm currently coding bit , then the "next bit", the predicted one, is 1.

Using vector math is a design constraint (at least for me), since it can be implemented using simd stuff like mmx. This will always be faster than using multiple instructions which appear using other mixing schemes.

I know about viewing models seperatley, since this is the real "niceness" of context mixing. This combines the ability of having diffrently adapting models and a seperation of contextual alphabet and coding alphabet (per model).

I wonder if you tried multidimensional SSE. Even using just 5 bits per input would require 2^20 SSE functions (in my case), which is quiet a mess. Maybe This can be cascaded to reduce memory requirements and improve adaption speed (This might somehow conflict with your argument of the SSE's input distributions)

10. > Using vector math is a design constraint (at least for me), since it can be
> implemented using simd stuff like mmx. This will always be faster than using
> multiple instructions which appear using other mixing schemes.

Well, you know, I suspect that my mixer could be called at every 4th
update and result still would be better... and faster.

> I wonder if you tried multidimensional SSE.

I didn't really need anything more than 2D, but that is quite common in
my models.

> Even using just 5 bits per input
> would require 2^20 SSE functions (in my case), which is quiet a mess. Maybe This
> can be cascaded to reduce memory requirements and improve adaption speed (This
> might somehow conflict with your argument of the SSE's input distributions)

I use SSE with linear interpolation and never needed more than 6 quants
per probability... 6x6=36 if its SSE2.

11. Originally Posted by Shelwien
Well, you know, I suspect that my mixer could be called at every 4th
update and result still would be better... and faster.
This is speculative.

I might try other things, if im confident with my present results.

My SSE does linear interpolation too, but i need at least 16 vertices to get good results, even with quantization. Any details about this? Maybe your SSE context is larger than mine (ive got 3*3*5 = 45 SSE functions), moving information from the actual SSE function into the context? Or do you restrict the probability range to apply SSE to e.g. 0..0.3 and 0.7...1.0?

One thing ill try is to select SSE vertices per context, e.g. the "middle" range where p=.4....5 under an order 2 context, the closer p approaches 0 or 1, the higher the context will be for p. Dont know how well this will do.

Concering your conciderations i might swap the SSE and mixing stage or somehow merge them. As i said in a post before, ive got 8 unused bytes per nibble model.

12. >> Well, you know, I suspect that my mixer could be called at every 4th
>> update and result still would be better... and faster.
>
> This is speculative.

Yes, but I meant it as a general example of how faster things are not
always faster.

> My SSE does linear interpolation too, but i need at least 16 vertices to get
> context is larger than mine (i've got 3*3*5 = 45 SSE functions), moving
> information from the actual SSE function into the context? Or do you restrict
> the probability range to apply SSE to e.g. 0..0.3 and 0.7...1.0?

No range limits.
And I guess that my SSE contexts are larger than that, but even a single
mapping without context usually improves compression, just not that much.
The difference is in update probably...

You can look at my SSE in ash sources:
http://compression.ru/sh/ash04.rar /src/sse_data.inc
Its update is based on the same idea as that mixer.
And you can see its effect by comparing results with /s0 and /s1 - that
option sets a rank limit for SSE application so with /s0 it isn't used
at all and with /s1 its only applied to last symbol encountered in maxorder
context (which has rank 0).

Also its really pitiful, but I hadn't been able to improve that function since then.
Of course it now looks totally different, but mathematically is the same.

> One thing i'll try is to select SSE vertices per context, e.g. the "middle"
> range where p=.4....5 under an order 2 context, the closer p approaches 0 or 1,
> the higher the context will be for p. Don't know how well this will do.

I doubt that it'd be compatible with interpolation...
might work for normal SSE though.

> Concering your conciderations i might swap the SSE and mixing stage or somehow
> merge them. As i said in a post before, i've got 8 unused bytes per nibble model.

Well, I forgot already, but seems that ash did apply SSE to mixed probabilities too.
So I guess it depends on how dynamic your mixing is.

13. I only had a short look at the sources during breakfast. You are using floating point math? Could you explain how you form the SSE context?

struct {
uint f1:3; // char order function
uint f2:2; // (c>64) combination
uint f3:1; // [Ord] LES check
uint f4:1; // [Ord-1] LES check
uint f5:3; // [Ord-2+] LES check comb.
uint f6:1; // sse_prevOk.b0
uint f7:1; // [Ord] determination check
};

btw: Why do all experts write horrible code (giving variable names a single letter and not "significant" names like codingOrder or something...)

14. > You are using floating point math?

Was at that time.
It really simplifies the research.
But completely incompatible even with different compiler options.

> Could you explain how you form the SSE context?

I thought we were talking about SSE itself...
And also ash has a byte-oriented model and keeps counters in a
suffix tree, so I doubt you'd find something useful.

> struct {
> uint f1:3; // char order function
> uint f2:2; // (c>64) combination
> uint f3:1; // [Ord] LES check
> uint f4:1; // [Ord-1] LES check
> uint f5:3; // [Ord-2+] LES check comb.
> uint f6:1; // sse_prevOk.b0
> uint f7:1; // [Ord] determination check
> };

"LES" is a last encoded symbol.
And "sse_prevOk" is a bit history in SSE context.

> btw: Why do all experts write horrible code (giving variable names a single
> letter and not "significant" names like codingOrder or something...)

1. Guess you don't know how a really horrible code looks
I think that my coding style is relatively clean actually.
2. Ash wasn't designed to teach somebody with its source or anything.
It was simply a techniques test after successful reverse-engineering of ppmonstr.

15. Originally Posted by Shelwien
I thought we were talking about SSE itself...
Indeed, but i was interested in this "les", now i know what it is. I could include some measure of predictibility (like rank) in the SSE context.

Floating point math and bytewise modelling suprised me - since it *may* be slower and i thought context mixing is done best using a binary alphabet, since you only need to combine several models prediction for one symbol.

Coding style - maybe im used to too much eye candy

BTW: the initial reason for my post still seems to be unresolved. I tried coding using only P( len+1 | len ) and i can say that it shows compression, good results for longer lengths (which should be obvious). This seems to be even more odd since a static mixing input of p=1 shows significant better results Im still confused...

Any other points for discussion / especially about SSE?

Am i insane or did you post a link to some ppmonstr stuff?!

16. >> I thought we were talking about SSE itself...
>
> Indeed, but i was interested in this "les", now i know what it is. I could
> include some measure of predictibility (like rank) in the SSE context.

Not sure what are you talking about again.
Measure of predictability is code length aka likelihood.

> Floating point math and bytewise modelling suprised me - since it *may* be
> slower

Its not necessarily slower and there is a whole FP-only vector instruction set.
And model logic is much more clear in FP, so its really good for research.
The only drawback is that one is basically unable to compile encoder and decoder
separately as they won't be compatible due to different compiler optimizations.

> and i thought context mixing is done best using a binary alphabet,

I'm sure that's wrong - at least for text.
You cannot simulate a whole alphabet update with binary or unary updates,
and its noticeably better for text, though word alphabet would be even better.

> since you only need to combine several model's prediction for one symbol.

That stays the same everywhere

> Coding style - maybe i'm used to too much eye candy

Dunno. What _I_ really hate in that sense is STL and other complex libraries,
with which you'd never know what a given function really does.

> BTW: the initial reason for my post still seems to be unresolved. I tried coding
> using only P( len+1 | len ) and i can say that it shows compression, good
> results for longer lengths (which should be obvious). This seems to be even more
> odd since a static mixing input of p=1 shows significant better results I'm
> still confused...

1. Check if there's a significant amount of cases where match-length prediction
is better than others in the mix.
2. If there's at least 10% of such cases, there may be some error
(eg. you're trying to use probability of 0 for 1), or else you have to blame your mixer.

> Any other points for discussion

Well, I'd like to discuss many things... eg. what about entropy-coded contexts?
That is, how contexts can have internal correlations and uneven distributions,
and how to avoid that.

SSE is magical stuff, so I can discuss it forever
Eg. it works due to a index duality, similar to the wave-particle one in physics.
That is, the index (probability) you feed into SSE is a relative frequency of
event occurences, and a context (history hash) at the same time.
And SSE uses the first interpretation for merging the statistics of similar
contexts and general stabilization (initialization as a direct mapping guarantees
that secondary estimation is never significantly worse than index probability),
and still easily adapts to the cases where context interpretation is preferred.

> Am i insane or did you post a link to some ppmonstr stuff?!

Yes, ppmonstr vIr1, I decompiled it in 2002.
Had to, basically, as nobody understood how to use SSE then,
except for its inventor, and availability of some ambiguous
papers didn't help, and there was no possibility of any competition without it.
Well, I wasn't even sure that its a matter of SSE so had to decompile the
whole model. And it was worth it, because its apparently hard to eg. think
of how to merge models using SSE by yourself... and other things.
Incidentally, this decompilation project was the reason for SSE appearance in paq too.

17. I think SSE is nothing that complicated, like lots of other stuff. The most "complicated" thing in my CM implementation was to find a good state machine to represent bit histories.
But the difficult thing is to get the idea - this can be fond everywhere.

Originally Posted by Shelwien
Not sure what are you talking about again.
Measure of predictability is code length aka likelihood.
I meant the symbol rank (only rank0, the last symbol seen is allowed) as a measure of predictibility. This is good exspecially in higher order contexts

Originally Posted by Shelwien
Im sure thats wrong - at least for text.
You cannot simulate a whole alphabet update with binary or unary updates,
and its noticeably better for text, though word alphabet would be even better.
For your arithmetic coder you need two parameters specifying the symbols position in code space and its code space width, so either [low, high) or [low, low+range). How do you combine the models predictions? You cant simply join several models low, since the number of symbols (and the relation to (not) missing code space!) can/will be diffrent in some situations.

Originally Posted by Shelwien
Dunno. What _I_ really hate in that sense is STL and other complex libraries,
with which youd never know what a given function really does.
Thats a question of bugs, i sometimes observe that i tend to reinvent the wheel - having the thing in mind you just said. You dont know if its a good thing/if it works (not limited to STL, btw stl is fairly "bug free", i never encountered a STL bug). The dumb thing is, that this reinvention makes things raises the possibility of bugs, kind of vicious circle .

Originally Posted by Shelwien
Well, Id like to discuss many things... eg. what about entropy-coded contexts?
That is, how contexts can have internal correlations and uneven distributions,
and how to avoid that.
You mean the thing of using an encoder output as a context? Give me some more

Originally Posted by Shelwien
SSE is magical stuff, so I can discuss it forever
Eg. it works due to a index duality, similar to the wave-particle one in physics.
That is, the index (probability) you feed into SSE is a relative frequency of
event occurences, and a context (history hash) at the same time.
And SSE uses the first interpretation for merging the statistics of similar
contexts and general stabilization (initialization as a direct mapping guarantees
that secondary estimation is never significantly worse than index probability),
and still easily adapts to the cases where context interpretation is preferred.
The bad thing about this is that you dont know what things actually do - you can interpret it, yes, but how can two totally diffrent things explain a phenomen from two diffrent viewpoints. I think the photon-wave-dualism is an extremer example.

About PPMonstr: The link disappered?! I wanted to have a look after work, but there was nothing....

18. > I think SSE is nothing that complicated, like lots of other stuff.

1. Its naive to think that its possible to design a simple-and-clean
universal compression model with state-of-art performance.
Good model just has to be complicated these days.

2. We're talking about different kinds of SSE here.
Eg. Matt's mappings are closer to original SSE in ppmonstr
despite the interpolation. And I was talking about my SSE,
like the one used in ash, which is significantly more efficient.

> The most "complicated" thing in my CM implementation was to find
> a good state machine to represent bit histories.

And from my point of view that's not really complicated as
its basically a matter of bruteforce.

>> Not sure what are you talking about again.
>> Measure of predictability is code length aka likelihood.
>
> I meant the symbol rank (only rank0, the last symbol seen is allowed) as a
> measure of predictibility. This is good especially in higher order contexts

1. Terminology problem again - "rank" when I use it, is a derivative of some model,
and not necessarily MtF. You can rank/sort symbols by whatever metric and
it still would be "symbol ranking". So I'd call the rank you probably meant
there the "MtF rank". (* MtF = move-to-front).

2. MtF rank is not a measure of predictability anyway, just a good context.
Higher MtF rank only means that there's a high probability (unknown, too)
that probability of given symbol is greater than ones of lower ranked symbols -
what kind of measure is that?

>> I'm sure that's wrong - at least for text.
>> You cannot simulate a whole alphabet update with binary or unary updates,
>> and its noticeably better for text, though word alphabet would be even better.
>
> For your arithmetic coder you need two parameters specifying the symbol's
> position in code space and its code space width, so either [low, high) or [low,
> low+range).

That's kinda common for any arithmetic coders.

> How do you combine the model's predictions? You can't simply join
> several model's 'low', since the number of symbols (and the relation to (not)
> missing code space!) can/will be diffrent in some situations.

1. You're too biased or something. Guess you have to investigate some
prev generation CM compressors. For example, look at
http://compression.ru/sh/ppmy_3c.rar /model.inc

2. Arithmetic coders don't necessarily have to be binary.

3. When nonlinear transformations are used, like SSE, its best (but not the only way)
to apply some binary decompositions - like unary or binary or huffman or whatever.

4. When counter updates are nonlinear too, there's no more sense in
trying to keep the probabilities within one distribution, but with unary coding
each symbol still can be associated with a single counter, instead of 8 like in
binary code.

5. My point was that linear update of whole alphabet is good for modelling texts,
and its really troublesome to simulate it in a binary decomposition.
So 1-4 don't really matter .

>> Dunno. What _I_ really hate in that sense is STL and other complex libraries,
>> with which you'd never know what a given function really does.
>
> btw stl is fairly "bug free", i never encountered a STL bug

I wasn't talking about bugs (just recently reported another bug in IntelC btw,
so there're some in STL too probably). Guess you don't understand why recent
paq8 versions contain a tiny STL replacement.

>> Well, I'd like to discuss many things... eg. what about entropy-coded contexts?
>> That is, how contexts can have internal correlations and uneven distributions,
>> and how to avoid that.
>
> You mean the thing of using an encoder output as a context?

Not exactly, though that was the origin of SEE/SSE - C.Bloom had something like
that in ppmz, if I'm not mistaken.

I was talking about interval coding of contexts. As a simple example, consider
an order2 model with a counter array like o2[CNUM][CNUM]. Obviously it would
have unused space and some locations more heavily used than others.
So it would be more efficient if you'd add a static model (static, or you'd have
to deal with continuous statistics remapping) and interval encoding of context
symbols into [0..CNUM*CNUM) interval. Of course, such a context index model
doesn't necessarily have to be similar to main model, as it might have a completely
different metric - eg. cache optimization by getting more hits or something.

Then, interval coding has some additional benefits:
1. Interpolation over normal counter arrays.
2. Variable-depth contexts
3. Good compatibility with models built using interval math

> Give me some more

Well, if you don't like that, then what about parametric coding?
That is, further generalization of an idea with storing the probabilities
on encoding (instead of immediately generating the code) and directly
copying the data if model failed to compress it, or just encoding it
faster as simple encoding loop allows that.
Like that its implemented in http://shelwien.googlepages.com/order_test4.rar
And the next idea is to divide the stored probability into even more parts.
For simple example, if you have two statically mixed submodels
: p=p1*(1-w)+p2*w
then instead of storing values of p, you can store both p1 and p2, and
optimal value of w for the block. So we get a long-awaited asymmetric CM coder
with faster and simpler decoder.
And of course its tempting to apply that to all non-recurrent transformations
(and maybe to somewhat unroll the recurrent ones), so ideally you'd just store
the used counter values and optimize all the parameters at the end of the block.

>> SSE is magical stuff, so I can discuss it forever
>> Eg. it works due to a index duality, similar to the wave-particle one in physics.
>> That is, the index (probability) you feed into SSE is a relative frequency of
>> event occurences, and a context (history hash) at the same time.
>> And SSE uses the first interpretation for merging the statistics of similar
>> contexts and general stabilization (initialization as a direct mapping guarantees
>> that secondary estimation is never significantly worse than index probability),
>> and still easily adapts to the cases where context interpretation is preferred.
>

Well, here again, I was talking about my own SSE, which I'm trying and failing
to improve during last 5 years, so believe me, I know what I'm talking about.
There're ugly things like SSE coefficients getting out of [0..1] interval, and
additional coefficients for dynamic update speed, and compicated update formula
in integer version, but it gets worse if I'm trying to change anything - and
my conclusion after all these years of analysis is quoted above.

> you can interpret it, yes, but how can two totally diffrent things explain a
> phenomen from two diffrent viewpoints.
> I think the photon-wave-dualism is an extremer example.

And I think that _similarity_ is extreme, considering statistical nature of
quantum mechanics.

As to "how" - its simple. These two interpretations _do_ simultaneously apply
to the same SSE index probability, right? Counter is supposed to contain a
relative frequency by definition and its value has a high correlation with
last bits in context history, right?
Then, the data in different context instances is supposed to have different
properties (or there'd be no sense in using this context at all), So there're
context instances without significant correlations in the history, and simple
linear mapping (even static) is good for them as a compensation for
prediction imprecision. And there're some which do benefit from further
context modelling, where SSE coefficients are used just like another counter array.
Now, the problem is that this subdivision is not anything stable - the same
context can contain completely random sequences and runs of the same value.
And thats why I still cannot replace SSE with something else - damned thing
works good enough in both extreme cases and allows for smooth transitions.
But still it just has to be replaced with interval-coded contexts mix or something,
as its not perfect in any specific case - just that nothing available

> About PPMonstr: The link disappered?! I wanted to have a look after work, but
> there was nothing....

I can't keep it up anywhere, or somebody would get angry... there was a case already.
Write me a mail to shelwien at gmail.com or something.

19. PAQ holds records for compression ratio and its models arent complicated. Just a mapping chain of context to probabilities. I dont say its a simple piece of software alltogether, though.
In your answer you removed the most important thing - But the difficult thing is to get the idea

Originally Posted by Shelwien
And from my point of view thats not really complicated as
its basically a matter of bruteforce.
Right.
BTW: somewhere i read about language models, which used used symbolic KI stuff to distingush word types and semantic relations. But i havent seen any practical implementation of something like this.

Can you clarify the difference between these SSE variants? I thought they basically do the same, you just use fewer bins and different representation of vertices (despite of your implementation details like fp math).

Originally Posted by Shelwien
1. Terminology problem again - "rank" when I use it, is a derivative of some model,
and not necessarily MtF. You can rank/sort symbols by whatever metric and
it still would be "symbol ranking". So Id call the rank you probably meant
there the "MtF rank". (* MtF = move-to-front).

2. MtF rank is not a measure of predictability anyway, just a good context.
Higher MtF rank only means that theres a high probability (unknown, too)
that probability of given symbol is greater than ones of lower ranked symbols -
what kind of measure is that?
I know about that and I was talking about the mtf rank.
Why shouldnt it be a measure of predictability? As you said a higher rank implies a higher proability (dont get me wrong, in practical implementation the highest rank usally is the lowest queue position, 0). I did some experiments and i can say, that symbol ranking (mtf under some context) outputs around 70% zero ranks (on "normal", compressible data, i dont know how to express it diffrently, of course this doesnt apply to random data and it doesnt say what "normal" data is) - Peter Fenwick mentioned this in a paper about symbol ranking, too.

Originally Posted by Shelwien
1. Youre too biased or something.
This may be true.

Originally Posted by Shelwien
2. Arithmetic coders dont necessarily have to be binary.
I never said that.

3.-4.

I meant something else: as i said you need something like [low, low+range) for every coded symbol. The position in code space can arranged in any way, so low depends on how you build your model/or combine statistics. The code space "range" is the important thing. So if i combine N models predictions i need to do a weighted average of "range" for every symbol to obtain the weighted "range" and the position in code space, "low", since it depends on the othe symbols preceeding the current one. So you need to weighten at least all symbols which preceed ("whos position in code space is lower than") the coded one. Since the decoder doesnt know which symbol will be the next one, you dont know which symbols preceed the coded one. So you have to combine all symbols range and obtain high/low.

I think that combining up to (average symbols per model)*(number of models) for coding one symbol is more computation intensive than combining (number of models)*8 per symbol coding step.

Originally Posted by Shelwien
I wasnt talking about bugs (just recently reported another bug in IntelC btw,
so therere some in STL too probably). Guess you dont understand why recent
paq8 versions contain a tiny STL replacement.
Here again, you snipped off the important part: Dont do things twice, if you dont need to. The only thing which i would interpret as "replacement" in PAQ is memory allocation which guarantees cache line alignment...

Ill answer the rest later, this took quiet much time.

20. > PAQ holds records for compression ratio and its models aren't complicated.
> Just a mapping chain of context to probabilities.
> I don't say it's a simple piece of software alltogether, though.

PAQ is a best argument for _my_ point actually
Its not only complex, its also hand-tuned throughout.
If you don't believe me, try getting the same quality out of it on
the data with randomly permuted byte values -
it'd be simpler to just write a new one.

Code:
```
book1_r = bitrev(book1)
wcc386_r = bitrev(wcc386)
book1  book1_r wcc386 wcc386_r sum_diff
ash04a /s255       202932 204086  260758 261426   -0.39%
ppmonstr vJ -o128  203247 205607  236758 241486   -1.61%
paq8o6 -7          192227 202909  194062 219119   -9.25%```

Here we can see the level of tuning to "standard" data -
some "c>65" flags in ash's SSE context, much more of that
in ppmonstr's contexts, and lots of stuff in paq.

> In your answer you removed the most important thing - *But the difficult thing
> is to get the idea*

Because I wasn't talking about that.

> BTW: somewhere i read about language models, which used symbolic KI stuff
> to distingush word types and semantic relations. But i haven't seen any
> practical implementation of something like this.

There weren't any probably.
The problem is basically that there're not so many people able to write
a state-of-art CM compressor.
Basically, in that sense I'm confident only about Matt, D.Shkarin and myself
And that doesn't mean that we're especially smart or something - just because
most programmers still think that compression = LZ+huffman and don't have
enough experience with CM. And then its clearly impossible to gain anything
out of semantic correlation with likes of zlib. Even most of people here
are working on LZ and never tried to write a custom model for some specific
data type (preprocessors and paq modifications don't count).

> Can you clarify the difference between these SSE variants? I thought they
> basically do the same, you just use fewer bins and different representation of vertices

The main difference is update I guess. My update is similar to that mixer,
just that the weight is fixed here and SSE coefficients are updated instead.
But the idea is the same - to simulate an update of mixed probability as if
it was a counter.

> (despite of your implementation details like fp math).

That was 4 years ago. Now I use an integer implementation with 15bit counters,
similar to my demo coders.

> Why shouldn't it be a measure of predictability?

Well... By definition of "measure"?
The probability is measure of predictability and likelihood is, but higher
rank doesn't _always_ mean better predictability, so I can't call it a "measure" -
it doesn't have a monotonous mapping to predictability.

> highest rank usally is the lowest queue position, 0). I did some experiments and
> i can say, that symbol ranking (mtf under some context! ) outputs around 70%
> zero ranks (on "normal", compressible data,

That's normal, but doesn't really mean anything as a file recoded into
sequence of ranks would compress worse than original.

> I think that combining up to (average symbols per model)*(number of models) for
> coding one symbol is more computation intensive than combining (number of
> models)*8 per symbol coding step.

Well, you're wrong about that, as its actually faster than binary models for
redundant data. Of course we do have to compute the probabilities
of symbols with higher ranks (and its done eg. in ash), but the key is that
we only have to compute the probabilities up to the rank of current symbol,
so average amount is 1-2 such computations per symbol. And that's not anything
surprising if rank0 is the right guess 70% of the time, so you don't have to
compute any further.

>> I wasn't talking about bugs (just recently reported another bug in IntelC btw,
>> so there're some in STL too probably). Guess you don't understand why recent
>> paq8 versions contain a tiny STL replacement.
>
> Here again, you snipped off the important part: Don't do things twice,
> if you don't need to.

I skip what I don't need

> The only thing which i would interpret as "replacement" in PAQ
> is memory allocation which guarantees cache line alignment...

I meant Array and String templates and the fact that paq used to
#include STL headers but doesn't now.

> I'll answer the rest later, this took quiet much time.

Well, I'll keep you busy

21. Originally Posted by Shelwien
I was talking about interval coding of contexts. As a simple example, consider
an order2 model with a counter array like o2[CNUM][CNUM]. Obviously it would
have unused space and some locations more heavily used than others.

Ok, this is clear to me. Could you explain this further:

Originally Posted by Shelwien
So it would be more efficient if youd add a static model (static, or youd have
to deal with continuous statistics remapping) and interval encoding of context
symbols into [0..CNUM*CNUM) interval. Of course, such a context index model
doesnt necessarily have to be similar to main model, as it might have a completely
different metric - eg. cache optimization by getting more hits or something.

Then, interval coding has some additional benefits:
1. Interpolation over normal counter arrays.
2. Variable-depth contexts
3. Good compatibility with models built using interval math

Originally Posted by Shelwien
Well, if you dont like that, then what about parametric coding?
That is, further generalization of an idea with storing the probabilities
on encoding (instead of immediately generating the code) and directly
copying the data if model failed to compress it, or just encoding it
faster as simple encoding loop allows that.
Like that its implemented in http://shelwien.googlepages.com/order_test4.rar
And the next idea is to divide the stored probability into even more parts.
For simple example, if you have two statically mixed submodels
: p=p1*(1-w)+p2*w
then instead of storing values of p, you can store both p1 and p2, and
optimal value of w for the block. So we get a long-awaited asymmetric CM coder
with faster and simpler decoder.
And of course its tempting to apply that to all non-recurrent transformations
(and maybe to somewhat unroll the recurrent ones), so ideally youd just store
the used counter values and optimize all the parameters at the end of the block.
We mentioned this some time ago. My idea here was to create a weight function based on few parameters (less than the nubmer of models) to allow simpler (and faster) coding cost optimization. You mentioned to optimize over a block of certain length and storing the weights/function parameters. I think even only calculating optimal weights/weighting function parameters will improve compression, since you found a global minimum (at least for the coded block), like Blooms method for LZ parsing with adaptive literal/match coding. In my expirience weights tend to slowly grow from lower to higher orders, which mirrors the models adaption, since slower models give better predictions in the beginning (adapt faster).
Splitting the weight calculation and modelling could be benefical in such a case. Without this there wouldnt be a great speed improvement, since the main blottleneck is the cache and the coding loop mainly eats up cpu time. In general seperating memory intensive tasks like LZ preprocessing/modelling should speed things up, since there are less cache-conflicting processes running at the same time. My improvement here was to join the memory of conflicting stuff (storing lzp pointers, CM models, etc in a aligned cache chunk).
I think this point is really worth a discussing & implementation!

You got me wrong, i didnt say you dont know what you are talking about. You can interpret it. In physics this may apply more. People dont know how things really are - but they can model it and explain the models. This doesnt mean that the models *must* mirror the reality, even if they deliver good results. I think this is more a philosophical question. To get a realtion to physics: the photoelectric effect cant be explained using the wave model, only using a photon model (disregarind quantum mechanics, which somehow joins this). Other effects like interference cant be explained using a photon model.

And now answering the more recent post

Originally Posted by Shelwien
Well... By definition of "measure"?
The probability is measure of predictability and likelihood is, but higher
rank doesnt _always_ mean better predictability, so I cant call it a "measure" -
it doesnt have a monotonous mapping to predictability.
This is the point. It doesnt always apply, but in most practical cases. Context driven MTF gives fairly good compression with a simple algorithm. I dont associate any metric for "measure" here.

Originally Posted by Shelwien
Well, youre wrong about that, as its actually faster than binary models for
redundant data. Of course we do have to compute the probabilities
of symbols with higher ranks (and its done eg. in ash), but the key is that
we only have to compute the probabilities up to the rank of current symbol,
so average amount is 1-2 such computations per symbol. And thats not anything
surprising if rank0 is the right guess 70% of the time, so you dont have to
compute any further.

If you only mix the symbols present in the current (highest) coding order with lower ones you still need an escape symbol. I thought this is more related to PPM than CM, Bloom called this "blendig". Otherwise you had to mix all symbols as i said, which is larger amount of work.

22. > If you only mix the symbols present in the current (highest) coding
> order with lower ones you still need an escape symbol. I thought this
> is more related to PPM than CM, Bloom called this "blendig".
> Otherwise you had to mix all symbols as i said, which is larger
> amount of work.

There's no difference, you still have your "escape probability" in
binary CM, just in the impicit form of sum of probabilities of unseen symbols.
So, btw, approach with direct modelling of escape probabilities is more
powerful, as you can simulate your binary CM with it, and go further.

As to byte vs bit modelling, my statement stays the same: you can mix
the probabilities for each symbol sorted by rank, then assign it
to "range" if symbol matched or add to "low" if it didn't.
That's called unary coding.
Of course this could get really slow on really unpredictable data...
like 40x slower than binary model in the worst case, but still its
significantly faster for normal, redundant data.
And, btw, that's why ash has a composite model which uses unary
approach only for high ranks (up to /s limit).

23. > You got me wrong, i didn't say you don't know what you are talking about. You
> can interpret it. In physics this may apply more. People don't know how things
> really are - but they can model it and explain the models. This doesn't mean
> that the models *must* mirror the reality, even if they deliver good results. I
> think this is more a philosophical questi! on. To get a realtion to physics: the
> photoelectric effect can't be ex plained using the wave model, only using a
> photon model (disregarind quantum mechanics, which somehow joins this). Other
> effects like interference can't be explained using a photon model.

Now explain where's the difference with SSE, if we compare a frequency
interpretation with "waves" and context interpretation with "particles".

24. I didn't exactly know how you were applying mixing to an alphabet sized larger than two. Now i know that you unary encode the rank and assing the mixed code space ("range") of your model to the "hit bit" of unary encoding - thanks! It is clear, that you only need to mix the code space for the current (unary) bit - Now I got it, thanks!

Concerning SSE - i never said anything is wrong. And i didn't say there's a difference between these two interpretations (physics and SSE). I simply said that there are different interpretations which explain a phenomen even if they are not directly related to each other.

And a blockwise storage of CM parameters?

25. > Ok, this is clear to me. Could you explain this further:
>
>> So it would be more efficient if you'd add a static model (static, or you'd have
>> to deal with continuous statistics remapping) and interval encoding of context
>> symbols into [0..CNUM*CNUM) interval. Of course, such a context index model
>> doesn't necessarily have to be similar to main model, as it might have a completely
>> different metric - eg. cache optimization by getting more hits or something.
>>
>> Then, interval coding has some additional benefits:
>> 1. Interpolation over normal counter arrays.

You cannot generally get a compression improvement with mixing of
adjacent counters in a simple p[byte] array, but its possible
if you correctly sort the symbols. And that "correct sort" is
obviously done using an entropy model.
the symbols, it'd allow to use different weights for different
pairs (or any clusters) of adjacent symbols.

>> 2. Variable-depth contexts

There's no real reason to use tables and not trees of contexts, right?
All the tree-based PPM/CM compressors implement it anyway, but there's
a more general and efficient approach.

>> 3. Good compatibility with models built using interval math

Real computations are always inaccurate, more so if they're speed-optimized.
But its possible to precisely determine the interval of function values,
corresponding to given (imprecise) arguments.
And if the whole model is built like that, it allows to apply the same
statistic methods to the model's output.
Of course, somewhat similar results can be acquired using SSE, but
it'd be less efficient than interval-based SSE generalization.

26. > We mentioned this some time ago. My idea here was to create a weight function
> based on few parameters (less than the number of models) to allow simpler (and
> faster) coding cost optimization.

ppmy/ash have something like that - look at weighten.inc.

> You mentioned to optimize over a block of
> certain length and storing the weights/function parameters. I think even only
> calculating optimal weights/weighting function parameters will improve
> compression, since you found a global minimum (at least for the coded block),

Here I found some experimental sources for that (summer 2002)
Its a dynamic parameter optimization applied to a limited version
of ppmy compressor: http://compression.ru/sh/ppmy_3c.rar
You can directly compare them using ppmy /o4 option because
/oN means "use orders 0..N-1"
Quite surprisingly its even not that slow

Code:
```
book1  wcc386
ppmy3c /o4  237620 302159
o3bfa       233180 267960```

So its possible all right and of course better than blockwise-static version.
But the problem is that its eg. quite easy to use up all the allowed time in
Hutter contest even without doing such optimizations in decoder.

> I think this point is really worth a discussing & implementation!

Well, that's a clear technique, which is only a matter of programming.
Its so clear that even allows to estimate pros and cons without really
implementing it - so I never used it in practice

Also there're much less clear and interesting topics, like replacing SSE
with something less shamanic, or designing a language for model definitions.

27. I can't view packed sources, since i'm at work.

If i remember correctly your bfa was simple brute force, e.g. try all possibility of the weight parameter (you limited it to 63/64...1/64 or something?). On could use a divide an conquer scheme to find a minimum quicker: like first try w = 1/4...3/4 and then subdivide based on best partition.
But didn't you also mention, that your iterative two input mixer did better than bfa? If i implement this i don't want to add a too high overload on the encoder side to obtain a faster decoder, what is "not so slow" in seconds?

I'm the wrong person to discuss about language models, i think Matt would do a better job here.

However a SSE replacement sounds interesting. But i've got no idea where to start, i'm just a hobbyist And as i said - the most difficult thing is to get the initial idea

28. > I can't view packed sources, since i'm at work.

Should I give you a link to unrar?

> If i remember correctly your bfa was simple brute force, e.g. try all
> possibility of the weight parameter (you limited it to 63/64...1/64 or
> something?).

Yes, but it can be any set of values.

> On could use a divide an conquer scheme to find a minimum quicker:
> like first try w = 1/4...3/4 and then subdivide based on best partition.

I had such implementations too and its slow anyway as you have either
to update all the length accumulators for all values or keep the
parameters for dynamic length function estimation for what you described.
In the end, length accumulators are easier to manage.

Btw, I used binary search etc not what speed improvement, but for
better precision as it isn't limited to a fixed set of values.
But there wasn't and significant differences.

> But didn't you also mention, that your iterative two input mixer did better than bfa?

Well i didn't know about that approach (I call it "indirect update") at the time
and won't try it now on that coder as its time-consuming to implement it for a
complicated function.

Guess, indirect update would be better anyway, but its really
implementation-dependent (we have to find a smooth enough metric to optimize),
while BFA always allows to find a real global optimum, just for a limited set
of values.

> If i implement this i don't want to add a too high overload on the encoder
> side to obtain a faster decoder, what is "not so slow" in seconds?

Still 10x faster than paq8?

Also dunno if there's a misunderstanding again, but that o3bfa coder is
an example of dynamic optimization, so encoder and decoder are equally slow.

And its really easy to balance improvement/speed with this, so you don't
have to worry about my implementation details.

29. > I'm the wrong person to discuss about language models, i think Matt would
> do a better job here.

Well, probably, but he doesn't bother to discuss anything more practical
than philosophy.

And you didn't understood me actually, as I was talking about different
kind of language.
To be specific, I do already use some simple syntax for context definition
already, for example, context is defined like this:

Rate Bt_wr, 0!10111
Number Bt_mw, 10,0!00011100
Number Bt_lim, 1,0!000010

Index t
tch: ch, 1!0
t_p: Ap, 1!000010010000
t_q: Aq, 1!100000000000
tr1: Ar1, 1!000111001000
t_r: Ar, 1!010000000000
tr2: Ar2, 1!001000000000

and preprocessed from

MakeIndex t
fb = (%M%Bt[t].P*7)>>SCALElog;

to

t = 0;
t = t*3 + ((Ap>4+(1-1))+(Ap>7+(1-1)));
t = t*2 + ((Aq>0+(1-1)));
t = t*5 + ((Ar1>3+(1-1))+(Ar1>4+(1-1))+(Ar1>5+(1-1))+(Ar1>8+(1-1)));
t = t*2 + ((Ar>1+(1-1)));
t = t*2 + ((Ar2>2+(1-1)));
fb = (M_00_Bt[t].P*7)>>SCALElog;

Well, the main purpose of this is automated optimization, not
a concise and readable model description or something like that,
but it helps in that sense too.

So ideally I'd like to have a special language for whole model definition,
not only contexts, as C++ is too inconvenient for that.

However, for that I have to determine necessary sets of "perfect"
construction blocks, techniques and templates, and their options.
So its still kinda far.

> However a SSE replacement sounds interesting.
> But i've got no idea where to start, i'm just a hobbyist
> And as i said - the most difficult thing is to get the initial idea

1. SSE tracing and decomposition into a set of functions specific to
some types of input sequences, then simulation using a set of submodels
optimized for these distinct cases, then further simplification and
optimizations.
2. First replace SSE with another construction - eg. interval
SSE generalization, then deal with it... probably along the lines of (1).
To be more specific, SSE maps a "inaccurate" value into a "fixed" version.
Which can be improved if we'd know an interval containing a "correct value".

Generalized representations and additional parameters always help

30. Originally Posted by Shelwien
Should I give you a link to unrar?
This must be a joke to someone who writes a more or less complex compressor... i simply cant download several file types, including archives and executables!

Ill continue later on, no more time atm

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•