1. linear components

Its quite possible to implement a statistical model
only with components which use linear combinations of inputs -
there're counters and mixers like that.

2. approximation of a linear counter

The update for my usual linear counter looks like this:
p' = p*(1-wr)+(mw*bit+(1-mw)*(1-bit))*wr;
Here wr and mw are constant parameters -
wr is "update speed" and mw is the "noise level".
Commonly even simpler counters are used, just something like
p' = p*(1-wr)+(1-bit)*wr
Either way, its obvious that the p(wr) function for a given
bit sequence can be described as a polynomial.
For example, with p0=0.5 and bits 0110 it would be:
step 0: 0.5
step 1: 0.5 + 0.5*wr
step 2: 0.5 - 0.5*wr^2
step 3: 0.5 - 0.5*wr - 0.5*wr^2 + 0.5*wr^3
step 4: 0.5 + 1.0*wr^3 - 0.5*wr^4

In practice, it appeared that an order-44 polynomial could be
used to approximate the probabilities for files of reasonable size
(book1bwt,enwik7bwt were tested).

3. attempt to approximate a string probability

The compressed file size (which is somewhat incorrectly called "entropy" here)
can be estimated as sum of -log2(p) for each data bit, or as -log2 of the whole
file likelihood (the product of all p's, aka string probability).
But the product of linear-counter polynomials is also a polynomial in wr,
so it should be technically possible to build a parametric entropy(wr)
function, as entropy(wr)=-log2(string_probability).

Well, it was actually implemented, but unfortunately didn't work good enough -
instead of converging, its coefs kept growing in amplitude until the limit of precision,
and then the estimations become totally random.

4. polynomial approximation of log2(p)

The next idea was to try a different approximation method, which won't require
handling polynomials of infinite orders.
The log2(p) function was approximated with a order-15 polynomial
(using a a least-squares fit for coefs - I didn't find any series for log(x) with x in 0..1).
Initial precision seemed pretty bad (like +/-10%), but some iterations with parameter
optimizer seemed to fix it up to a reasonable level (surprisingly it wasn't overtuned to
a single file, but really improved the estimation precision for several tested files).

5. results

Now we have an implementation which actually works, kind of:
This is the entropy(wr) plots using different estimation functions:

There're still quite a few flaws:
- The polynomials are actually based on (1-wr) powers instead of wr,
so precision drops for small wr values.
Well, it shouldn't be hard to change it to powers of wr.
- It appeared necessary to use "long double" type for storage of coefs,
even for fairly small files (<1k). Probably have to find a different
representation for coefs (log space, FFT, etc).
- Due to long doubles and polynomial multiplications the process is fairly slow.

However these are technical problems, while the overall idea seemed to work -
its possible to build a reasonably small (676 coefs) _parameteric_ entropy
function which can be used to find the optimal parameter value without
having to process the data multiple times.