I thought that I knew what a "finite-context" model was, but now I'm not so sure. I'm not sure I properly understood what I read when I formed my understanding of the term, or whether (possibly) the author(s) were mistaken. Or maybe the term is used in two different ways by different people.

---

My understanding of a finite-context model is that it's basically a plain old PPM model, or something more or less equivalent to it---you (explicitly or implicitly) predict the next symbol based on the previous n symbols, for some finite n, in a very simple way: for whatever literal sequences of symbols you see, you predict the same symbols you've seen follow that symbol in the past, with probabilities proportional to the frequencies you've seen them follow those literal sequences in the past.

(I realize that this is not fully specified, because you get different predictions depending on what n is, and how you deal with conflicting predictions based on different-length sequences you're paying attention to, and whether higher-order models fail to predict, hence the whole PPMC fallback scheme with escape codes when you don't get a prediction based on the n previous symbols because you've never seen that sequence before, and PPM with Information Inheritance to avoid most of the escapes by using a lower-order model to predict in lieu of the higher-order model and avoid escapes, etc., etc...)

But that's not quite right, maybe, because I was under the impression that a similar static model based on global frequencies from looking a whole file is also a finite-context model. So if you just build the obvious (non-hidden) Markov model, embodying all the orders from the longest string repeat down to the shortest, that's also a finite-context model. So I've been thinking that "finite context model" basically means "non-hidden Markov model". Basically PPM or a static PPM-like model, where the probabilities are based on global frequencies rather than frequencies of things seen so far. (Or a dictionary algorithm, like LZ77 or LZ78 which is unobviously doing pretty much the same thing, or in a sense more obviously---they're clearly predicting that literal strings you've seen before are likely to repeat.)

---

But maybe it'stotally wrong, and finite-context doesnotmean predicting based on probabilities of symbols following literal sequences of symbols.

I'm trying to read a paper that talks about "finite context models" just being anything where the new context (after reading a symbol s) depends on the previous context, with an arbitrary function f(c,s) that maps the previous context c to the new one,... apparently f can be any function you want. That sounds like what "finite-context"oughtto mean, but I didn't think it was---I thought that "finite-context" was a more specific, badly-chosen term, meaning that you look at some finite literal sequence of previous symbols. (But maybe an indefinite number of them, not necessarily a fixed highest order.)

So now sounds like a finite-context model is any model with a finite number of contexts, constructed in whatever way you can think of, that predicts the next symbol based on some function or other of the previous context and the current symbol. (And some of those functions might have combinatorially explosive costs, as long Is that right?

Here's one of the ways it matters to me right now. I was under the impression that a move-to-front model isnota "finite context" model. It can predict things that PPM can't. PPM can only predict that literal strings it's seen before will repeat with some probability higher than strings it hasn't seen before. (Although since its contexts generally overlap, it may do a good job on novel strings that are locally not novel.)

But MTF can predict that some strings are more likely than others, because their constituents may repeat soon, in some order that's never been seen before. It is less "brittle" than PPM in that respect, but worse at predicting literal strings that do repeat (which they often do). Similarly---or so I thought---an algorithm that notices numerical similarities and predicts the next value using differencing (or whatever) would not be a finite-context model, either, because it's assigning probabilities to strings it may never have seen before.

Is that wrong? Is MTF a "finite-context" model, too, because the state of the move-to-front list is a function of the previous state of the move-to-front list and the current symbol? Is using differences to predict linearly-predictable numbers "finite context" as well, because you can do either by updating some model based on the previous state of the model and whatever you see next?

If so, is there a name for the distinction I'm talking about, between predicting literal string repeats (like LZ77, LZ78, and PPM) and predicting things based on noticing different kinds of patterns, like context mixers such (PAQ cmix), where you may notice skip-ngram patterns, arithmetic relationships, etc.?