>> There's a belief that

>> "MSE criterion yields an optimal solution only if the input signal is Gaussian"

> I don't know if we both mean the same, but it is easy to proof that

Sure, with a formal definition. But the quote I posted is relatively

fuzzy in that sense.

And my point there was that actual signal is not really like that -

it may be locally similar, but isn't stationary anyway.

>> Did you ever try optimization metrics based on entropy?

> No, but on the other side that would lead to completely different predictors.

Yes, and likely better, because entropy is what you care about (=compressed size).

Btw, afaik, Matt tested both MSE- (in http://mattmahoney.net/dc/paq.html#neural)

and entropy-optimized update functions, and the latter was clearly superior.

> Was the subband-decomposition of MDCT fixed?

It does a 1024-point MDCT, then compresses the result as a 1024xN table

of integers. Surely improvements are possible there too, but LPC seems much

more promising comparing to that.

> The math behind the current paq-structure is not too hard and I

> think the many stages of refinement are only needed because of

> inappropriate modelling.

Not really. Or, rather, you just can't implement "appropriate modelling",

because it implies infinite memory/complexity.

> I like the idea of optimality in some sense, so using a a better

> algorithm for minimum search instead of gradient descent could

> possibly lead to better improvements.

Its not gradient descent. Its a bayesian inference, like

Code:

C[n,k] = n!/k!/(n-k!)
p0 = 1/C[n0+n1+1,n0+1] // probability of string with (n0+1) 0's and n1 1's
p1 = 1/C[n0+n1+1,n0] // probability of string with n0 0's and (n1+1) 1's
p = p0/(p0+p1) = (n0+1)/(n0+n1+2) // normalize because its either this or that

Or a minimum entropy extrapolation, like

Code:

f[p] = n1*Log[1-p]+n0*Log[p] + Log[p]+Log[1-p] // n1 1's, n0 0's, then 01 for extrapolation
f'[p] = (n0+1)/p - (n1+1)/(1-p)
f'[p]==0 -> p=(n0+1)/(n0+n1+2)

(In another thread here, I demonstrated derivation of paq8 mixer update in a similar way)

> I already do mixer selection via contexts but the improvements are very tiny.

How did you check that?

By manually adding context and seeing that improvement is tiny?

Actually such approach won't always work even for simple counters

(ie higher order context doesn't always mean better compression).

At least, you have to tune mixer parameters after changing its context,

and ideally you have to retune the whole model, starting with parameters

of mixed submodels (including contexts used there).

> Perhaps, if you have the time and nerves you can explain me that

> thing called "StateMap".

paq's StateMap is Matt's thing, so you can look at Matt's explanations

(there're plenty - description in paq source, DCE, http://cs.fit.edu/~mmahoney/compression/stategen.cpp, etc).

Alternatively, here's a version of ccm counter state table generator:

Code:

byte freqmask[64*64];
struct fsm {
byte s[2]; // next state after bits 0,1
byte p5; // 5-bit prob quant
byte p6; // 6-bit prob quant
word n1; // bit==1 freq
word n0; // total freq (0+1)
} cstate[256];
int Init_FSM_States( int x=0, int y=0 ) {
static int state = 0;
if( state==0 ) memset( freqmask, 0xFF, sizeof(freqmask) );
if( (x==57) || (y==57) ) x<=y ? y-- : x--;
byte& r = freqmask[y*64+x];
if( r==0xFF ) {
r = state;
int s = state++;
cstate[s].n0 = y+x;
cstate[s].n1 = y;
cstate[s].p5 = (y*4+1)*32/(x*4+1 + y*4+1);
cstate[s].p6 = (y*4+1)*64/(x*4+1 + y*4+1);
cstate[s].s[0] = Init_FSM_States( x+1, (y+3)/4 );
cstate[s].s[1] = Init_FSM_States( (x+3)/4, y+1 );
}
return r;
}

But its ad hoc both in paq and ccm, and there's a better (afaik) approach.

Suppose that we have a simple counter model with two 12-bit counters and static mixing:

Code:

unsigned r:12;
unsigned q:12;
[...]
p = (r*5+q*3)>>3; // compute the probability
rc.Encode( bit, p ); // encode bit
// update the model
if( bit==1 ) {
r += (((1<<12)-r)>>5);
q += (((1<<12)-q)>>4);
} else {
r -= (r>>5);
q -= (q>>4);
}

Now, this thing already looks relatively complex, but it only has

2^24 possible different states, so its possible to implement it

as a direct state machine:

Code:

unsigned state:24;
[...]
p = LUT[state].p; // compute the probability
rc.Encode( bit, p ); // encode bit
state = LUT[state].next[bit]; // update the model

But such a plain version is huge, so it would make sense to look

for similar states and merge them, to reduce the size of state table.

One possibility (on which afaik paq stategen is based) is to merge

states with equal coding probabilities, which would already reduce

the table size to 2^12.

But overall, I don't think that some perfect universal solution

exists for this task, and you don't have to do it in realtime anyway,

so I think that bruteforce methods would be better -

ie for all state pairs compute how much merging them

hurts compression, then merge some (by entropy metric)

and repeat until getting under state number threshold.

> That's what I understood:

> You map contexts with similar bit-histories to a counter/state and

> use that. But why is this better than just using the smaller history

> as a context?

I like to compare counters with lossily compressed codes of histories.

In other words, counters actually store more (useful) information

about context history than the same amount of most recent bits.

Actually I experimented with this, and found that I needed 80 bits

of history to simulate the counter in my coder (by allocating a new

counter, then updating it with preserved history bits, then using its prediction).

> The idea of using 64-bit low is good and I fitted the coder into my

> classes. Gives a tiny improvment size/speed-wise

Good. Btw, one very good rc-related speed optimization is to declare

the rc state as a bunch of local variables in a function, instead of

structure/class fields.

You can see it here: http://ctxmodel.net/files/fpaq0pv4b3.rar

The idea is like this: local vars can be mapped to cpu registers

and separately preserved in random stack locations when necessary,

without caring about object state consistency. But compilers don't

bother to analyze structures in that sense and so allocate them

in memory, with fields in specific order too. And when frequently-accessed

values are stored in memory, its really scary, because compiler

would store and reload the values in registers around any (non-"restrict")

pointer accesses and (not inlined) function calls.

But unfortunately C++ doesn't have local functions, so I'm still

searching for a good solution for this - the one in fpaq above

uses perl preprocessors to unroll template functions and convert them into macros,

thus simulating the local functions, but its not easy to apply to complex coders.

Another idea was to define a macro like "int& low, int& range, ..." and add it

to parameters of all related functions, but the number of function arguments

affects inlining heuristics, and compilers may not even see through it after

enough nested calls.

Do you have any workaround ideas for this?

Actually the issue with scratch vars declared as class fields is imho very common.

In some cases it can be solved by replacing iterators with coroutines, but that's

a "heavy" method which is hardly applicable to arithmetic coding.

>> Well, yes, because for now its the same closed-source as others

> If you say: "give me the code" I'll upload it immediately.

I can't say that I need it, and can't even promise to read it.

But I think that overall it could be helpful to have it available -

at least somebody would compile a faster executable for you.

Anyway, people here won't troll you for the code quality, so

why not attach the source, if you intend to open it at all?

> I really have to go deeper into this stuff again to improve

> something but I collected some simple ideas to improve it a tiny bit

Afaik the common mistake in compression is when people implement

complex high-level models (with MSE or at best approximated likelihood metrics),

but don't care to spend time on actual entropy coding.

For example, a few times I was asked to implement a postcoder for some data

and given some delta-coded samples (they noticed that values are similar

and decided to compress deltas). So I had to guess that these are deltas

(obviously nobody would tell right away) and sum them back to actual values,

to get better compression with context modelling of actual values.