Imagine we have a single sequence of symbols - thinking about predicting probability of the next symbol basing on the previous occurrences naturally leads to a simple and elegant approach, probably(?) first documented in the 1991 Howard&Vitter paper ( ftp://ftp.cs.brown.edu/pub/techreports/91/cs91-45.pdf ), nicely described e.g. in this Fabian's post:

CDF[i] -= (CDF[i] - newCDF[i]) >> rate

where CDF[i] = sum_{j<i} Pr(j) is cumulative distribution function, directly operated on in arithmetic coding and rANS.

nawCDF is CDF from a new segment of data, used to update the CDF (every segment) used by EC.

The best we can do is getting to 1 symbol segments - update probability every symbol, what is done e.g. in LZNA.

Choosing the optimal rate is a nontrivial problem.

Here is my slide with examples for such symbol-wise adaptation:

The natural question is if it is the best we can do?

So this approach just averages over the past, it has no chance to exploit trend in the data, like rising probability of some symbol - that it might be better to assume a larger probability than average over the past.

Intuitively, we would like to "fit a first or second order polynomial" to the history of probability distribution of a given symbol and extrapolate to the current situation.

However, it is a nontrivial problem to fit a polynomial to a series of spikes (occurrences) ... but can be done.

My first approach, originally for describing density profile of a molecule as coefficients of polynomials (page 9 of https://arxiv.org/pdf/1509.09211 ), was to first find points describing the number of occurrences before some positions:

(x, number of occurrences or sum of weights before x)

and fit a polynomial to such points, then take a derivative of this polynomial - this derivative (still a polynomial) is some approximation of the density history, allows for extrapolation we need...

Later that year I have tried to use it to design trend-seeking predictors for data compression ... by the way finding a simpler and giving better behavior way:

smoothen the sample with e.g. Gaussian kernel ( https://en.wikipedia.org/wiki/Kernel_density_estimation ) and fit e.g. a polynomial to it ... it turns out that for mean-square fitting we can take the zero-width limit of the kernel ("fit to spikes": Dirac deltas), what turns out to lead to the optimal parameters.

It is extremely cheap and natural method for parametric density estimation, for example literally fitting a polynomial to a sample ... but surprisingly I couldn't find it it in the literature (?) so recently I have returned to it and submitted to arixiv yesterday: https://arxiv.org/pdf/1702.02144

However, while in theory it could be used for trend-seeking predictors for data compression ... it has a real problem as the fitted polynomial often extrapolates outside the [0,1] range for probability ...

I am looking for a practical way to repair it ... ?

One possibility is trying to enforce normalization of probability distribution to 1 ... but it is complex mathematically and still could extrapolate probability to negative values.

A looking much better possibility is using what Matt calls "squash/stretch" functions (bijection) between [0,1] and [-infinity,infinity], for example squash(x) = 1/(1 + exp(-x)) or using arcus tangens.

So instead of fitting polynomials, we should try to fit squash(polynomial) and then squash the extrapolation ... however, it currently would need a costly numerical integration for the fitting ...

I wanted to try to start discussion about improving the Howard-Vitter predictor.

Is it the only widely used adaptation?

Have you seen some approaches to improve it, maybe get some trend-seeking? Any ideas how to do it?