> Of course I always may add some extra models and dynamic mixing to beat BWTmix...
> But it will be SLOW!
Not necessarily. Well, it might be, if you'd really just _add_ something.
But current BWTmix model is really simple, and doesn't use SSE or fsm
counters, so there're certainly ways to beat it.
> keep the most important models with a static mixing plus one SSE
Don't forget about fsm. Basically, it should be possible to approximate
a few counters (maybe even with some small contexts) + dynamic mixing + SSE
by a pair of lookup tables (p=P[state] and state=update[state][bit])
Well, with increasing original state size (all counters, mixer and SSE states)
it becomes harder to enumerate all submodel states and determine which are
relevant, but that's what makes it interesting 
And a pair of counters with static mixing is an obvious target for that.
> At the same time, experience with such over-weighed
> CM-back-ends as unreleased BCM009 and BWTmix really help
> to find out what kind of models/techniques are the best
> and how close the lighter CM is...
Well, as I was always saying... its just rational to get as
much compression as possible - basically to estimate
the data redundancy, and only after that start to optimize
on speed, knowing the cost of compression you're losing.
And BWTmix is not really "overweighted" - its just a very
straightforward implementation, which probably would be
able to reach the speed of bcm008 if optimized.
But as a result of such statements, I'm currently working
on speed optimization, instead of SSE2 and stuff, contrary
to my theory 
Well, its just a rangecoder optimization for now, so I hope
that it won't mess up the prediction stage. Btw, I managed
to reduce the number of rc calls more than in half:
http://shelwien.googlepages.com/p2.txt
Well, somehow it didn't improve the speed right away, but
that's probably because of some compiler weirdness 
It was like this:
Code:
if( DECODE ) {
dbit = rc.BProcess<1,1>( p0 );
if( dbit ) { UPDATE(1) } else { UPDATE(0) }
}
and now its like this:
Code:
if( DECODE ) {
dbitx = (p0>=hSCALE); //p0>>(SCALElog-1);
int pr = p0 + ((SCALE-p0-p0)&(-dbitx));
if( pr<M_p0lim+M_p1lim ) {
if( pr<M_p0lim ) {
if( b0count<0 ) dec_dist( rc, b0count, 0, M_ex0wr, M_ex0mw );
dbit = (b0count>0); b0count--;
} else {
if( b1count<0 ) dec_dist( rc, b1count, 1, M_ex1wr, M_ex1mw );
dbit = (b1count>0); b1count--;
}
} else {
dbit = rc.BProcess<1,1>( pr );
}
dbit ^= dbitx;
if( dbit ) { UPDATE(1) } else { UPDATE(0) }
}
And these counts are up to millions on enwik8, so
I'm not sure what exactly takes away the time saved
on rangecoder calls.
Anyway, seems like I'm going to keep experimenting with
BWT postcoders too, for a while, as there're many more ideas
which I'd like to try out
.