> compared to standard speed = 40:
> - after hit: prob = prob + (1 - prob) / 40 = 0.5125
> - after miss: prob = prob - prob / 40 = 0.4875
> speed controls how much of the error should be used for update, it isn't a
> direct update value itself.
Well, this calls for some comments...
I think that this prediction/error interpretation is bad for theory,
because its inherently associated with exact prediction of next value in series,
LMC metric, and such - while the only metric a compressor needs is a Shannon metric.
So, there're two interpretations that I generally use, depending on a case:
1. Combinatoric
For binary sequence with only added information that there're n0 zeroes and n1 ones,
there're (n0+n1)!/n0!n1! equiprobable cases, and its easily provable that a counter
like this would be optimal:
p1 = n1/(n0+n1); // static
if( y ) c1++; else c0++; p = c1/(c1+c0); // dynamic
Then, for nonstationary data with local density variations,
a weighted Shannon metric can be used, and so an
interval range sequence for optimal encoding would be
if( y ) c1=c1*w+1,c0=c0*w; else c0=c0*w+1,c1=c1*w; p = c1/(c1+c0);
And as a speed optimization, it can be rewritten to avoid
division:
p = c1/(c1+c0);
if( y ) p'=(c1*w+1)/(c0*w+c1*w+1); else p'=(c1*w+1)/(c0*w+c1*w+1);
---
if( y ) p'=(c1+1/w)/(c0+c1+1/w); else p'=c1/(c0+c1+1/w);
---
if( y ) p'=p*(c0+c1)/(c0+c1+1/w)+(1/w)/(c0+c1+1/w);
else p'=p*(c0+c1)/(c0+c1+1/w);
---
T = c0+c1
T' = c0*w+c1*w+1 = (T+1/w)/(1/w) = T*w+1
wr = 1-(c0+c1)/(c0+c1+1/w) = (1/w)/(c0+c1+1/w) = 1/T'
if( y ) p'=p*(1-wr)+wr; else p'=p*(1-wr);
p = p'; T=T' [
Then, to compensate for supposed completely random bits
(not correlated with recent history), the prediction of
counter can be further improved by static mixing with
a fixed "order-(-1)" distribution:
P = p*(1-mw)+0.5*mw
which means that supposedly counter's prediction "p"
would be applicable for current symbol/bit with
probability (1-mw) and its a completely random bit
otherwise. (Of course, any distribution can be used
instead of {0.5,0.5}, but lets skip it for simplicity).
As experience shows that this suggestion is widely
useful, the mix can be merged with actual counter:
P = p*(1-mw)+0.5*mw
T = T*w+1; wr = 1/T;
if( y ) (p*=(1-wr))+=wr; else p*=(1-wr);
---
P' = ((p*(1-wr)+y*wr)*(1-mw)+0.5*mw) =
= p*(1-wr)*(1-mw)+0.5*mw+y*wr*(1-mw) =
= p*((1-mw)-wr)+0.5*mw+y*wr*(1-mw) =
= ((P-0.5*mw)/(1-mw))*(1-wr)*(1-mw)+0.5*mw+y*wr*(1-mw) =
= (P-0.5*mw)*(1-wr)+0.5*mw+y*wr*(1-mw) =
= P*(1-wr) - 0.5*mw + 0.5*mw*wr + 0.5*mw + y*wr*(1-mw) =
= P*(1-wr) + 0.5*mw*wr + y*wr*(1-mw) =
= P*(1-wr) + (y*(1-mw)+0.5*mw)*wr
// y*(1-mw)+0.5*mw == y?1-0.5*mw:0.5*mw
---
T = T*w+1; wr = 1/T; omp = 0.5*mw*wr;
P = P*(1-wr)+(y?wr-omp : omp);
And that's exactly what I mostly use.
The point here is that such a counter is a generalization
of combinatorical counter and so is able to precisely simulate
the implementation with dumb counts of occurences.
And combinatorical counter is good because it allows to
efficiently encode even completely non-redundant bit sequence.
Its only redundancy is information contained in parameter values
(which include initial values for P and T).
2. Bit sequence hash
When dynamic probability mapping is used (aka SSE etc),
a hash interpretation has a lot of sense too.
Well, its obvious that in fact a "probability" updated
like that is a convolution of bit sequence and so it
highly correlates with current suffix bits, especially
when a fast "update speed" is used (and mostly it is fast enough).
Thus, because of indirect use, a counter-context like that
has a completely different optimization metric comparing to
combinatoric counter and so may (and should actually) be
implemented with different properties.
For example, the combination of a normal counter delayed by
a few bits (that is, updated with bit[x-k] when bit[x] is modelled)
and a static linear mapping ( a[bit[x-k..x-1]]*p+b[bit[x-k..x-1]]*(1-p) )
does show good results.
371699 // test with a single SSE array over normal counter array (fully tuned)
370757 // counters delayed by 3 bits and a static mapping before SSE added
And I consider this improvement fairily major as there's no model structure
change, no contexts or dynamic elements added, and parameters for the first
result were fully optimized by bruteforce search (well, for second as well).
But still, this option is highly dependent on implementation details of the
used dynamic mapping (SSE) and is not understood as well as "normal" counters.
That is, in most general case of this approach counter can be replaced by
static lookup table mapping the bit sequence in context to some precalculated
hash values. But, forgetting about the necessary size of this lookup table
(~2^40 should be a common value), there's no known way for now to optimize
the table contents except by directly checking the compression result with
different settings.
> speed should be lower for probs like 0.01 or 0.99 because with high speed
> occasional negative bit in a equal bits run would ruin the statistics.
> ie. if we have sequence:
> 0000000000000000000000100000000000000000000000000
> then on sequence:
> 0000000000000000000000
> we should consistently lower the speed because if we encounter bit 1
> it shouldn't change statistics very much because the following bits are
> 0000000000000000000000000
A simple combinatoric counter handles this situation pretty well actually.
With completely precise implementation it is able to map strings like
('0' x N1 . '1' . '0' x N2) directly to values from 0 to N-1=N1+N2
(with pre-known N).
Also, of course, sometimes counter tweaks give a gain in compression anyway.
But then, if a gain origin can be analyzed, it is usually possible to
add an explicit support for some special cases and leave the counter alone.
The quoted example is exactly the case like that.