I finally made the toolkit which was mentioned in my old post here:
http://nishi.dreamhosters.com/u/contrepl_v1.rar
Current result:MM> Instead, the context model should develop lexical, semantic, and grammar models,
Btw, my approach to HP is a multi-pass scheme which can potentially
take these into account.
The idea is that instead of sequential compression of all data,
we can move some information from input data to archive iteratively,
also with custom models for specific cases.
Its supposed to be something like a controlled regexp engine -
we write a replacement regexp and its inverse - eg. s/\n/ /
and s/ /\n/ and compress the flags which tell the decoder where
to apply the inverse regexp for lossless restoring of the data.
Obviously the flags are only generated for positions where
the inverse regexp matches at all, ie for all spaces in processed
text in the example above.
Benefits are:
- reduced memory usage
- bi-directional contexts
- separately tuned models
- easy incremental development
As to grammar models, the simplest workaround here would be
eg. to replace all verbs to "verb" and all nouns to "noun"
etc, and we'd end up with plain sentence structure.
Though of course it would be also useful to generate additional
input streams (also by processing original input with regexps).
The difference here is that "c.bat" script compresses the flagsCode:c.bat c0.bat book1.out 768771 768771 // patched book1 book1.out_ari 197926 197926 // compressed patched book1 book1.flg 142173 142173 // uncompressed flags book1.flg_ari 5676 8163 // compressed flags book1.ari 205688 205688 // original book1 (compressed) book1.out+flg_ari 203610 206097 // patched book1 with restore flags (compressed)
using actual data (book1.out here) as context,
while "c0.bat" compresses it as an independent file
(best paq8 results are also around 70xx there).
Don't test this on enwik as is, since perl scripts take too much time
when there're many matches, and LF-to-space hurts compression on
enwik anyway, since its not plaintext.
There's another example in http://nishi.dreamhosters.com/u/contrepl_v0.rar
("don't" and "do not" - see book1.cfg), that should work on enwik8,
but might not improve compression.
So, now we need good examples of replacement regexps for enwik8,
which would let us incrementally extract information from enwik to flag streams,
in such a way that after replacement enwik8 could be compressed better,
and compressed flags would take less space than the gain.
As book1 example above shows, it is possible.
Meanwhile I'd continue improving the flag coder -
atm it just uses the same o5 model from mod_CM,
so higher orders, right-side context and SSE
still have to be added.
Of course it would also make sense to implement actual replacements
in C++ too (for speed), but that can wait until we have more examples.