> compared to standard speed = 40:

> - after hit: prob = prob + (1 - prob) / 40 = 0.5125

> - after miss: prob = prob - prob / 40 = 0.4875

> speed controls how much of the error should be used for update, it isn't a

> direct update value itself.

Well, this calls for some comments...

I think that this prediction/error interpretation is bad for theory,

because its inherently associated with exact prediction of next value in series,

LMC metric, and such - while the only metric a compressor needs is a Shannon metric.

So, there're two interpretations that I generally use, depending on a case:

1. Combinatoric

For binary sequence with only added information that there're n0 zeroes and n1 ones,

there're (n0+n1)!/n0!n1! equiprobable cases, and its easily provable that a counter

like this would be optimal:

p1 = n1/(n0+n1); // static

if( y ) c1++; else c0++; p = c1/(c1+c0); // dynamic

Then, for nonstationary data with local density variations,

a weighted Shannon metric can be used, and so an

interval range sequence for optimal encoding would be

if( y ) c1=c1*w+1,c0=c0*w; else c0=c0*w+1,c1=c1*w; p = c1/(c1+c0);

And as a speed optimization, it can be rewritten to avoid

division:

p = c1/(c1+c0);

if( y ) p'=(c1*w+1)/(c0*w+c1*w+1); else p'=(c1*w+1)/(c0*w+c1*w+1);

---

if( y ) p'=(c1+1/w)/(c0+c1+1/w); else p'=c1/(c0+c1+1/w);

---

if( y ) p'=p*(c0+c1)/(c0+c1+1/w)+(1/w)/(c0+c1+1/w);

else p'=p*(c0+c1)/(c0+c1+1/w);

---

T = c0+c1

T' = c0*w+c1*w+1 = (T+1/w)/(1/w) = T*w+1

wr = 1-(c0+c1)/(c0+c1+1/w) = (1/w)/(c0+c1+1/w) = 1/T'

if( y ) p'=p*(1-wr)+wr; else p'=p*(1-wr);

p = p'; T=T' [

Then, to compensate for supposed completely random bits

(not correlated with recent history), the prediction of

counter can be further improved by static mixing with

a fixed "order-(-1)" distribution:

P = p*(1-mw)+0.5*mw

which means that supposedly counter's prediction "p"

would be applicable for current symbol/bit with

probability (1-mw) and its a completely random bit

otherwise. (Of course, any distribution can be used

instead of {0.5,0.5}, but lets skip it for simplicity).

As experience shows that this suggestion is widely

useful, the mix can be merged with actual counter:

P = p*(1-mw)+0.5*mw

T = T*w+1; wr = 1/T;

if( y ) (p*=(1-wr))+=wr; else p*=(1-wr);

---

P' = ((p*(1-wr)+y*wr)*(1-mw)+0.5*mw) =

= p*(1-wr)*(1-mw)+0.5*mw+y*wr*(1-mw) =

= p*((1-mw)-wr)+0.5*mw+y*wr*(1-mw) =

= ((P-0.5*mw)/(1-mw))*(1-wr)*(1-mw)+0.5*mw+y*wr*(1-mw) =

= (P-0.5*mw)*(1-wr)+0.5*mw+y*wr*(1-mw) =

= P*(1-wr) - 0.5*mw + 0.5*mw*wr + 0.5*mw + y*wr*(1-mw) =

= P*(1-wr) + 0.5*mw*wr + y*wr*(1-mw) =

= P*(1-wr) + (y*(1-mw)+0.5*mw)*wr

// y*(1-mw)+0.5*mw == y?1-0.5*mw:0.5*mw

---

T = T*w+1; wr = 1/T; omp = 0.5*mw*wr;

P = P*(1-wr)+(y?wr-omp : omp);

And that's exactly what I mostly use.

The point here is that such a counter is a generalization

of combinatorical counter and so is able to precisely simulate

the implementation with dumb counts of occurences.

And combinatorical counter is good because it allows to

efficiently encode even completely non-redundant bit sequence.

Its only redundancy is information contained in parameter values

(which include initial values for P and T).

2. Bit sequence hash

When dynamic probability mapping is used (aka SSE etc),

a hash interpretation has a lot of sense too.

Well, its obvious that in fact a "probability" updated

like that is a convolution of bit sequence and so it

highly correlates with current suffix bits, especially

when a fast "update speed" is used (and mostly it is fast enough).

Thus, because of indirect use, a counter-context like that

has a completely different optimization metric comparing to

combinatoric counter and so may (and should actually) be

implemented with different properties.

For example, the combination of a normal counter delayed by

a few bits (that is, updated with bit[x-k] when bit[x] is modelled)

and a static linear mapping ( a[bit[x-k..x-1]]*p+b[bit[x-k..x-1]]*(1-p) )

does show good results.

371699 // test with a single SSE array over normal counter array (fully tuned)

370757 // counters delayed by 3 bits and a static mapping before SSE added

And I consider this improvement fairily major as there's no model structure

change, no contexts or dynamic elements added, and parameters for the first

result were fully optimized by bruteforce search (well, for second as well).

But still, this option is highly dependent on implementation details of the

used dynamic mapping (SSE) and is not understood as well as "normal" counters.

That is, in most general case of this approach counter can be replaced by

static lookup table mapping the bit sequence in context to some precalculated

hash values. But, forgetting about the necessary size of this lookup table

(~2^40 should be a common value), there's no known way for now to optimize

the table contents except by directly checking the compression result with

different settings.

> speed should be lower for probs like 0.01 or 0.99 because with high speed

> occasional negative bit in a equal bits run would ruin the statistics.

> ie. if we have sequence:

> 0000000000000000000000100000000000000000000000000

> then on sequence:

> 0000000000000000000000

> we should consistently lower the speed because if we encounter bit 1

> it shouldn't change statistics very much because the following bits are

> 0000000000000000000000000

A simple combinatoric counter handles this situation pretty well actually.

With completely precise implementation it is able to map strings like

('0' x N1 . '1' . '0' x N2) directly to values from 0 to N-1=N1+N2

(with pre-known N).

Also, of course, sometimes counter tweaks give a gain in compression anyway.

But then, if a gain origin can be analyzed, it is usually possible to

add an explicit support for some special cases and leave the counter alone.

The quoted example is exactly the case like that.