Anyway, some testing results, for reference:
Test file: cracklib.txt (16,326,038 bytes)
No code has to be inserted here.
Anyway, some testing results, for reference:
Test file: cracklib.txt (16,326,038 bytes)
No code has to be inserted here.
Plain coding with a single counter table at each point, but where you take into account the actual data structure.
For example, consider order0 coding of word indexes - its still order0 coding, although the alphabet is a bit large,
but compression would be much better than order0, but you need to parse the specific data structure to find
these words and get good compression.
That's probably a dictionary.
Normally you may apply something like this http://compression.ru/sh/dictpack3.rar
before compressing a dictionary.
Can't run 16-bit programs under my 64-bit W7 (too lazy to run all that stuff under DOSBox)
There was only one DOS exe and its not the one which you actually need.
There're also sources.
Anyway, here's a version with replaced turn.exe (source included):
http://nishi.dreamhosters.com/u/dictpack3b.rar
While uploading it, I incidentally found this:
http://nishi.dreamhosters.com/u/dictpack3a.rar
http://encode.su/threads/564-a-very-...ll=1#post11200
Test file: english.dic
No code has to be inserted here.
And what if we will compress Chinese dictionary? PPM is universal, it will compress text file or dictionary on any language.
The same approach (prefix/suffix cutting) would work with any sorted wordlist, chinese or whatever
So does indirect modeling (context -> bit history -> prediction). In a sorted list it is common for many contexts to see a string of 0 bits followed by a string of 1 bits, but not the other way. An indirect model will learn different probabilities for these.
bsc english.dic 4,067,439 -> 1,177,373 (bwt)
bcm -> 1,165,350 (bwt)
zip -9 -> 1,050,725 (lz77)
ppmd -> 787,319 (ppm order 4)
zp c1 (fast.cfg) -> 561378 (CM indirect order 2,4)
ppmonstr -> 515,592 (ppm order 12)
zp c2 (mid.cfg) -> 490804 (CM indirect order 0,1,2,3,4,5,match,mix)
zp c3 (max.cfg) -> 450707 (CM, like mid.cfg + sparse, word, and SSE, 22 PAQ like components)
Last edited by Matt Mahoney; 12th July 2010 at 17:50. Reason: added more results
english.dic can be compressed from 4067439 into 546752 using bsc with ST4 transform and preceding contexts without other preprocessing filters.
Enjoy coding, enjoy life!
If you replace each newline in english.dic with 5 spaces, then bsc -m2 -cp (ST5, preceding contexts) compresses to 513841, but following contexts compresses to 1102222.
With the dictionary preprocessing example I just tried to explain that there's no reason to work too hard
on implementing a "universal" codec, especially "fast universal".
For most known formats its possible to implement some kind of preprocessing to improve compression,
and there's much more sense to tune simple postcoders to results of these than original files.
At the same time, I'm quite certain that for the best compression its necessary
to use CM with format-specific model... but that won't ever be too fast.
Continue working on PPMX. A nice trick - we keep number of symbols coded with no escapes - such thing usually called run - and if run is above some threshold we decrease the current escape probability by some amount. Many PPMs (PPMd, PPMN, ...) has a flag for SEE a la run>X. Such trick can be used for simple PPM compressors with no SEE at all adding some compresison gain.
Well, what I've got currently:
I hope it's interesting enough. At least it's a new PPM compressor that is not based on PPMd - i.e. it has completely different properties - with its storng and weak sides.
- Fast enough order-4 PPM (many times faster than PPMX 0.05)
- More stable on binary, random and analog sources like audio files - i.e. PPMX can be a few times faster than PPMd on such files
- Memory usage <70 MB
- Extremely simple implementation - the entire source is just ~8 KB
Is there really a reason to avoid using SEE? I don't think that without SEE it would work faster.
With an order-4, SEE role is not that important. Yep SEE not makes compressor too slow. Anyway, apart from SEE there are many other interesting things to play with - Recency/Deterministic scaling, etc. i.e. things that improve a symbol prediction, it's important, because with an order-4 about 97% of ENWIK8 coded with the highest order-4 context - i.e. just 3% of escaped/lower order contexts. SO. I'm thinking about cheap enough ways of improving symbol prediction. But mostly I'm about to build a strong basement - strong baseline PPM - all other stuff can be easily attached at any time...
Timings on ENWIK8:
PPMX - 20 sec
PPMX+SEE - 23 sec
I still think that PPM only makes sense in conjunction with trees.
And there's surely enough stuff to research - like various hybrid structures (eg. BWT + small tree) and cache-friendly layouts.
We should have something cache friendly. That means we should keep a symbols/counts list, ordered or not. How we would access to that lists - by hash or by tree like structure is second thing. Better if we will able to find freq, cumFreq and totFreq instantly. For sure it's tricky data structure. Current PPMX has tricky enough structure and symbol lists - in most cases getting cumFreq takes no time. So, I just don't think that a tree is the answer, or the only answer. Summarizing, efficient PPM should maintain (organized) symbol lists - to be more cache friendly. The symbol lists search can be done differently, suffix tree and symbol sorting can be inefficient - I just tested PPMd and PPMX on WAV or already compressed files - sad picture - PPMX is much faster, sometimes many times faster. Yes, on text files PPMd is usually faster, not that much though.
Did you really compare it to ppmd -o4 -r0?
Also these types of data don't mean much anyway - they'd be handled by a detector before ppmd in a real archiver.
Also CM can work with o1+ huffman codes for bitwise decomposition of symbols, so imho there's no sense to use bytewise PPM with hashtables.
There may be a place for a bitwise PPM like that though (aka DMC).
And the good thing with trees is that you can just read the pointer to next context, once you found the current symbol's node.
So its both compact (which is cache-friendly) and requires less calculations.
Carefully tested the latest Intel C++ compiler with PPMX. Still, Visual C++ with no PGO superior even compared to ICL with Profile Guided Optimization. All options are tested - as with standalone ICL, as well as with integrated ICL into Visual C++ environment - i.e. even same option set was tested and tuned. And again, using SSE2 instructions Intel C++ shows nice improvement:
ICL - 23 sec
ICL+PGO - 21 sec
MSC - 20 sec
ICL+SSE2 - 18 sec
But I'm afraid about incorrect working executable on AMD and an older machines with that SSE2 instruction set. As a note, enabling SSE2 in MSC gives nothing and slightly decreases compressor's speed. Same thing with MSC's PGO - it shows nearly no improvement over static compiling.