Hi
I recently took the initiative to create a deep-depth entropy encoder, building on my recent experience with order-0 entropy coders.
This new software is able to use as much as possible bytes of history as context.
It is called ILE (Infinite Length Entropy) as it is able to use unbounded context length history (which of course is extremely memory hungry, but it nonetheless works on small files, or files with little variations such as log files...). The maximum length can also be selected as an argument.
For a time, it looked as if the program was a benchmark winner, but there was a flaw: the initial version was "selecting" the context with best probability for the next byte, but the "context selector" itself ended up costing much more than initially anticipated, sometimes even more than data itself !
I naively thought that the deepest context was naturally the most relevant, but it proved not correct enough.
Therefore, i switched from context selection to context mixing, with more consistent results.
The main observed effect is to lower the cost of "bad guesses", typically to one additionnal bit per wrong guess. Results are consistent, but average.
For example Enwik8 with level 5 (=Order 4) compresses to 26MB.
With Level 6 (=Order 5) it results in 24MB.
This is merely correct looking at LTCB database.
Now, for the question :
on small enough files (or log files), i can trigger unbounded depth level.
The surprise comes from the fact it can actually decrease compression.
This is unexpected, as i though that larger contexts memorisation would only improve guess rate.
This could be due to the context mixing formula, which provides a strong bias in favor of highest-depth context.
As an insight, here is the proposed Context-Mixing formula.
At given n depth :
P(n) = Freq(n)/Total(n) x (1-Error(n)) + P(n-1) x Error(n)
The idea is that with improved number of samples, the frequencies stored at current context are more and more relevant. If not relevant enough, the previous probability of context (n-1) is used as a complement to get the final probability. This is recursive. The idea seems interesting, but results are apparently not that excellent .
Error evaluation is key to this approach. It is therefore calculated by taking in consideration the number of samples of current context. I tried several different formulaes, and mostly settled with one which provides quite "regular" results :
Error(n) = 0.5 / sqrt(Total(n))
Very deep or unbounded context depth get better result with following formula : Error = 1 / sqrt(Total). But results are worsened for smaller limits, such as L=5.
Well, anyway, it seems i'm now interested in context mixing, a new area i was not even aware of less than a week ago.
A few area of interest :
1) Error calculation does not depend on previous context frequency. But maybe it should ? After all, if current context as a very low frequency count (for example 2 samples), and previous one has a very high count (for example 5800 samples), it would seem much safer to use Context(n-1) as the main contributor to probability estimation.
There is no such relation with current formula.
(Note that indirectly, there is a relation, as P(n) depends on P(n-1), which itself depends on Total(n-1) for Error(n-1). But it is not a strong one... )
2) Error does not depend on current Context Depth level (n).
Should it ?
After all, it seems that 'highest context is most relevant' is less and less correct with higher context depths. But is it only due to the lower number of samples ?
3) There is no "backward feeding" nor learning process.
Should it ? and in which way ? Could it be integrated into "Error" for example ?
4) There is no "secondary evaluation". Note that i do not really understand what it means, except that the probability is further processed by a secondary system which "corrects" it, but it seems unclear how, why, etc.