Still a quite draft, but has more serious order-9 PPM model.
This version has a special compile for Intel Core 2 Duo CPU - utilizing some new CPU features PPMX runs faster!![]()
Still a quite draft, but has more serious order-9 PPM model.
This version has a special compile for Intel Core 2 Duo CPU - utilizing some new CPU features PPMX runs faster!![]()
Code:ENWIK8 -> 22,580,291 bytes (1.806 bpb) ENWIK9 -> 194,298,469 bytes (1.554 bpb) bookstar -> 9,613,763 bytes (2.161 bpb) book1 -> 227,280 bytes (2.365 bpb) world95.txt -> 493,122 bytes (1.320 bpb) fp.log -> 625,450 bytes (0.243 bpb) 3200.txt -> 3,927,886 bytes (1.962 bpb) bible.txt -> 764,447 bytes (1.511 bpb)![]()
It's a bit stronger, but almost twice slower (on bookstar)...
~61s vs 33.422
For comparison, performance on my machine (Cure 2 Duo @ 2.40 GHz, 2 GB RAM, Windows XP SP2):
bookstar
PPMX v0.01 - 18 sec
PPMX v0.02 - 31 sec
PPMX v0.02CORE2DUO - 26 sec
![]()
Thanks Ilia!![]()
Could you post some technical information about your current scheme? I just wonder
A quite baseline PPM, max. model order is order-9 with symbol masking (full exclusion), no SEE, SSE, etc. I use hashing like you noticed before. Escape strategy is different for each model order - PPMD-like variant with highest orders and PPMC-like with lowest ones...![]()
I started with tree... But hashing is far more flexible - I can use any models with any order, additionally I can skip some "redundant" orders. In addition, no need in model rebuilding or reset!
I'm experimenting with many PPM tricks - SEE, II (Information Inheritance), new counters, etc. etc.
I want an extra compression (to beat PPMd, finally) with the same or even faster speed. And it is possible. I already made a stronger version - with some more advanced escape strategy which runs at the same speed, but overall difference is quite small - and I kept that basic variant - for me it's easier to add real SEE to such scheme instead of rewriting codeparts...![]()
> > And do you have a plan to implement tree like approach?
> I started with tree...
Really? And what happened?
> But hashing is far more flexible -
I don't see how its more flexible.
Its just very simple to write.
And even that for bitwise models only too.
Although in fact, there won't be much
difference from a tree when you'd (finally) start
using variable-length context nodes.
The main difference would be that you'd still have
to compute the hash functions, while with a tree
its just possible to follow a pointer.
> I can use any models with any order,
Which is quite possible with a suffix tree as well.
You can store a "ode" context address into the suffix
pointer of "encode" context, and there you'd have
your escape from o6 to o3.
> additionally I can skip some "redundant" orders.
There's a matter of masked contexts though, but that'd
only require some intermediate structures, like
suffix pointer arrays or something.
And only for the cases when a lower order context
has some bits of data which are removed from a higher
order context.
Anyway, its not really necessary to regard this
structure as a tree... its just a set of context
nodes with links between them, which allow to
incrementally (with a single pointer lookup per symbol)
reach all the context nodes while parsing the data.
Technically, a hash has a benefit of not keeping
these pointers, but instead it has to be very
redundant to provide a good speed.
And another possible benefit is the ability to
separately access the symbol's nodes, thus
significantly simplifying the allocation,
but again, it'd totally slow down any normal
bytewise model.
> In addition, no need in model rebuilding or reset!
That's really the problem.
But its just a matter of more programming.
However model reset doesn't actually require any programming,
and model cutoff can be implemented by reparsing a block
of data after resetting the stats (I think that would be
faster, seeing how much time ppmd's tree processing with
-r1 takes),
And then, its also easy enough to implement a sliding window.
(though there're doubts about the performance of this approach).
In any case, like this you always know what kind of statistics
do you have and why, which is not really possible with a hash.
Last edited by Shelwien; 3rd December 2008 at 01:09.
As you mentioned before, it's not so easy to deal with trees... At some point, I could spend more time to fight with tree structures rather than experimenting with actual PPM modeling.
With my hash tables I may change one digit to choose any model order, easily skipping from order-7 to order-3, or from order-12 to, say, order-5... Just in one click! I bet it wouldn't be so easy with suffix tree... That's what I call flexibility!
Anyway, I think I should keep an eye on actual modeling, say, on things like counters or finally proper SEE implementation.
In other words, I'm too lazy to fight with complex structures, weird memory allocation... ...and prefer to spend some time for reinventing the wheel!![]()
http://cs.fit.edu/~mmahoney/compression/text.html#1943
Version .02 compresses much smaller, but slower. The core 2 duo version runs faster on my new computer (Pentium T3200 dual core, 32 bit mode) than the regular version, but the regular version has a smaller .exe so I tested with that one. I tested both on enwik8, and both run in only one core.
For a speed comparison, see lpaq9l which I tested on my old and new computers. The new one is about 10% faster on one core.
> At some point, I could spend more time to fight with tree structures
> rather than experimenting with actual PPM modeling.
Well, I got stuck at that stage myself, as there's no known "perfect"
implementation of a "statistical parser".
But its like that for a good reason too - the model design is highly
affected by the kinds of available statistics.
For example, PPM and CM approaches work with completely different
statistics, and the choice to use some nonstationary counters
instead of direct numbers of occurences basically blocks the
possibility of a sliding window implementation.
> With my hash tables I may change one digit to choose any model order,
> easily skipping from order-7 to order-3, or from order-12 to, say, order-5...
> Just in one click! I bet it wouldn't be so easy with suffix tree...
Well, just that you could do with a tree as well.
And with that kind of memory consumption like you have,
it'd be pretty much ok to have the full statistics for all
orders, but just skip some of them.
The only real complication could be with non-uniform
contexts (masked or sparse or constructed in some other way),
which you don't yet use anyway.
Though I guess its always possible to use hashes for
some submodels where they're really convenient.
(like models predicting a single symbol).
> That's what I call flexibility!
That's ok for now I guess, but somehow I think that to beat
ppmd in speed you'd need a tree anyway, and then building
a tree with exactly the same properties as your hash system
would be really problematic.
That's basically why I don't try to skip that stage myself.
And I did try out the bitwise approach too (in o6mix), just
to make sure, and it still seems that it doesn't really
work as a workaround for bytewise processing.
Of course we have some good examples of how competitive
bitwise models are at some range of speed/compression/memory
balance, but imho its also obvious that they're quite
close to the limit there, leaving very little space
for improvement. While with something like ppmd its
not obvious at all, considering fast compression, and
even less obvious with ppmonstr, considering high compression.
However bytewise approach itself is problematic too, as
larger alphabets are easily available (like unicode, or
numbers, or words in text), but anyway its still much
better than bitwise in these cases, and I don't have
even a slightest idea on how to implement a real word-wise
model (as its not only a matter of counting word occurrences
and encoding a word from a dictionary, the main problem is
that probability estimations, updates, inheritance and such,
all have to be implemented in some unknown but complex way).
> Anyway, I think I should keep an eye on actual modeling, say,
> on things like counters or finally proper SEE implementation.
So, did you already try finding a good example of context history
and separately (from PPM) developing a model for it?
In fact, its the most easy and convenient way to do it
(dumping a context history into a file and making a standalone
compressor for it) as like that you don't have to care
about a lot of stuff (of course, all the extra info, like
contexts, has to be dumped as well).
Btw, escapes can be replaced with some unused symbol, and
also its useful to actually store a lot of context histories
with terminators into a single file, and process that file
with a single simple model with flushes on these terminators.
Like that you can even have the same output size as with
a full compressor, just sequentially processing the histories
instead of doing it in parallel, and having to care about
memory management and such at the same time.
And also its possible to use other compressors for estimation.
Anyway, that's how I do it.
> In other words, I'm too lazy to fight with complex structures,
> weird memory allocation... ...and prefer to spend some time
> for reinventing the wheel!
Of course its your choice, just that any model really depends on
the kind of available statistics, and that is affected by used
data structure - like counter size, or availability of context
counters (=totals), or hash collisions.
Thus, especially if its a fast model (=simple and not very adaptive),
its better to always remember that it'd probably be almost
unusable when attached to any other source of statistics.
@Matt Mahoney:
Btw, I always wanted to ask, did/do you perform any analysis on
context histories stored in separate files?
And also are all these numeric constants found manually or
automatically?
Thanks Matt!
To be honest, I never try it. I tune and play with all model parameters on the fly with PPM modeling. I think at some stage I will try it!Originally Posted by Shelwien
BTW, I'm thinking about sparse and word models. Just in terms of PAQ, apart from standard N-gram models I may use sparse and word ones. But if with PAQ we do mixing - i.e. use predictions from ALL models simultaneously, with my PPM I should choose exact model sequence, say:Originally Posted by Shelwien
word model
order-7 model
order-5 model
sparse model
order-3 model
etc.
i.e. if model can't predict, it drop an escape symbol and go to the next model. Additionally, I may use kind of LOE heuristic (Invented by Alexander Loe) to choose the best initial model. This would be really interesting!
![]()
Still, there're some doubts about compatibility of non-ordered
submodels with escaping. Certainly LOE would be useful, but
even with it really good results are not guaranteed.
Well, I guess its still worth experimenting.
Btw for reference, the only PPM with non-ordered submodels
known to me is ppmonstr, and the sparse submodel there
predicts the next symbol instead of a probability, and
a comparison with that symbol is used in SSE contexts.
BTW, can you explain the difference between PPMDet (we keep only one symbol per context) and unary coding (PPMonstr-styled)?![]()
Unary coding loops in ppmonstr look nearly the same as these in ppmd.
Only difference is that the symbol's probability is adjusted.
And btw its not necessary to have a rc call per each symbol,Code:s = total; for( i=0; i<NSymbols; i++ ) { p0 = freq[i]/s; s -= freq[i]; // mask the processed symbol p = SSE( p0 ); flag = (c==symbol[i]); rc.Process( flag, p ); if( flag ) break; }
but then some interval operations similar to rc have to be
used anyway.
Looks like you missed my automatic parameter optimization via a genetic algorithm. Note that also quantization mappings, context masks and stuff like that can be optimized. Have a look at this: http://encode.su/forum/showthread.php?t=159
BTW: did anybody every try to integrate the optimizer into a different program?
Last edited by toffer; 3rd December 2008 at 19:01.
epm (described at http://cs.fit.edu/~mmahoney/compression/text.html#1749 ) is one of the few compressors I know that use automatic parameter optimization.
The first archiver with public parameter optimizer was, in fact, bee
by Andrew Filinsky. http://compression.ru/fa
Its been there since at least 1999-2000. And even has a config with
different parameter strings by file extension or something.
http://compression.ru/fa/files2/bee077src.rar
(0.7.8 doesn't include a compiled beeopt for some reason).
So, it seems only 5 authors use automatic parameter optimization in compression area![]()
Last edited by osmanturan; 7th December 2008 at 21:44. Reason: Spelling...
If someone wants, it's possible to continue bee .It's open-source GPLv2 and have a good GUI.
Working heavy on PPMX v0.03. I already rewritten the entire code of PPMX, added many code optimizations and new escape strategy. Making PPMX both notable faster and stronger.
Code:ENWIK8 -> 22,489,310 bytes ENWIK9 -> 193,275,280 bytes
To be continued...