I quoted the last part of icq log with links into post
comments on ctxmodel but somehow it becomes much less
attractive to me like this.
I mean, I was interested in sequential modelling of data,
not tree-based because the hierarchical processing blocks
the possibilities for syntax parsing and using the
alternative submodels.
So guess I'd go for the 2-stage architecture with context
maps + small hash, or even a tree.
Anyway, i just need a data structure which would allow me
to keep the bytewise statistics and would have a
reasonable workaround for memory overflows.
A hash is certainly not that.
And making a good tree would be too complicated, and not
really efficient, as it'd have to contain both occurence
counters and probability counters (to provide a way to
remove the contexts going out of the window).
And then there's kinda nothing else.
Though well, also string search algorithms are applicable
if its a small block.
I mean, with a block size like 1k it wouldn't be that slow
to just search for higher order contexts from the block
start.
There was also http://ctxmodel.net/rem.pl?-32
and http://shelwien.googlepages.com/stat3.txt
And today I made a dumb but working coder implementation
http://ctxmodel.net/files/tree_v1.rar
Its an order16 bytewise coder with full context statistics
(precise number of context's occurences in the already
processed data) and single rangecoder call per byte,
which is rare these days.
Alas, for now its just another remake of the same scheme
like used in ppmy/ash.
+ Its much more portable and doesn't use floats
+ Tree implementation is compatible with 64-bit systems
(should be, at least)
+ 215.7k on book1 with a very primitive mixing implementation
- No tree overflow handling. However, with precise counters
its possible to implement a sliding window.
- Even simple parsing (collection of statistics) is considerably
slow. It can be somewhat improved by using the native
32-bit pointers for the tree, and other optimizations are
possible, but something very significant has to be changed
to make it practical... like using a faster secondary
structure for global statistics.
But still, its something different from paq and provides access
to statistics like number of symbols in context and their
probability distribution, which are not easy to acquire with
bitwise hashes, but are quite useful for text compression.
So at least its applicable for hutter challenge, despite its
poor speed.