What people seem to have systematically missed is that (as you've said before) MTF is just a
transform, and like any handy transform it will make some regularities clearer and others more obscure. You don't want to just use it directly as a model, but as a feature extractor to tell you what model is appropriate for the data at hand.
...
I think there's room for improvement, though, if you don't just use the output of MTF for prediction, as LZMA does (as I understand it). A better predictor could look at the actual data items, the match distances, and the MTF-of-match distribution, and combine the information intelligently. For example, in the example above, it might notice that the extra A's in the pattern are causing the problem, and the stride is really just 5, with a bobble at 4. It would know to just predict whatever's 5 bytes back, and ignore the bobble, because it doesn't actually matter. (It doesn't improve the prediction over just noticing the stride of 5.)
You could do that in an adaptive context-mixer, if you had an MTF queue of recently-seen match distances---not for LZ77 coding, but just for prediction. You'd have a predictor for each match distance, and pick which one works better. The predictor for a stride of 5 would clearly dominate the predictors for 2 and 3.