So, I've been experimenting with the design of a new coder,
and got kinda confusing results, so comments are quite welcome.
Well, at least I've got some proof for my statement that bytewise
coder is better for text compression as texts are constructed of
symbols and symbols' specific binary codes don't mean much.
But still, my own binary models apparently have better overall
results - they're only insignificantly worse on texts, but much
better on binaries... and probably could be further improved by
separately adjusting update speed for different bit positions.
Thus, I can only guess that intersymbol correlations are actually
not as weak as I thought... meaning that not only the probability
of just encountered symbol has to be incremented, but its useful
to increment some other probabilities as well - and binary encoding
does that, although maybe not in optimal way.
Also there was an additional surprise - my PPM-like implementation,
with inheritance and all (with a special symbol in alphabet for escape
to lower order), appears worse than more "dumb" alternative, where
escape is coded separately as a binary flag. Guess that could be explained
by having additional parameters for escape encoding which can be tuned too.
To sum this up, bytewise coders are very complex and really slow at that,
so now I've even got (unnecessary) experimental proof of how better
binary coders are.
But, unfortunately, there're problems.
1. A context needs significantly more counters on average, and if one
wants to avoid allocating 255 counters for each context right after its
first occurence, the only way is hashing... which isn't convenient for
lots of tasks.
2. There's no way around unary coding when SSE is to be used.
Its pretty obvious that last symbol encountered in context has
to be processed in a special way (binary contexts etc), and this
applies to other ranks as well.
3. With text modelling, there're lots of cases where alphabet
should be restricted. For example, consider a custom model for
preprocessed text, where reduced charset is guaranteed by preprocessor.
Of course, codes of allowed symbols could be mapped to a 0..n interval,
but anyway binary encoding wouldn't be that convenient with n!=2^k,
and there might be cases where charset has to be dynamically switched,
while preserving statistics (eg. a model for sorted dictionary).
Well, any ideas on how to solve these contradictions?