Page 2 of 2 FirstFirst 12
Results 31 to 50 of 50

Thread: PPMX v0.05 - new PPM-based compressor

  1. #31
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Anyway, some testing results, for reference:

    Test file: cracklib.txt (16,326,038 bytes)

    No code has to be inserted here.

  2. #32
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Plain coding with a single counter table at each point, but where you take into account the actual data structure.
    For example, consider order0 coding of word indexes - its still order0 coding, although the alphabet is a bit large,
    but compression would be much better than order0, but you need to parse the specific data structure to find
    these words and get good compression.

  3. #33
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    That's probably a dictionary.
    Normally you may apply something like this http://compression.ru/sh/dictpack3.rar
    before compressing a dictionary.

  4. #34
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Can't run 16-bit programs under my 64-bit W7 (too lazy to run all that stuff under DOSBox)

  5. #35
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    There was only one DOS exe and its not the one which you actually need.
    There're also sources.
    Anyway, here's a version with replaced turn.exe (source included):
    http://nishi.dreamhosters.com/u/dictpack3b.rar

    While uploading it, I incidentally found this:
    http://nishi.dreamhosters.com/u/dictpack3a.rar
    http://encode.su/threads/564-a-very-...ll=1#post11200

  6. #36
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Lightbulb

    Test file: english.dic

    No code has to be inserted here.

    And what if we will compress Chinese dictionary? PPM is universal, it will compress text file or dictionary on any language.

  7. #37
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    The same approach (prefix/suffix cutting) would work with any sorted wordlist, chinese or whatever

  8. #38
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    So does indirect modeling (context -> bit history -> prediction). In a sorted list it is common for many contexts to see a string of 0 bits followed by a string of 1 bits, but not the other way. An indirect model will learn different probabilities for these.

    bsc english.dic 4,067,439 -> 1,177,373 (bwt)
    bcm -> 1,165,350 (bwt)
    zip -9 -> 1,050,725 (lz77)
    ppmd -> 787,319 (ppm order 4)
    zp c1 (fast.cfg) -> 561378 (CM indirect order 2,4)
    ppmonstr -> 515,592 (ppm order 12)
    zp c2 (mid.cfg) -> 490804 (CM indirect order 0,1,2,3,4,5,match,mix)
    zp c3 (max.cfg) -> 450707 (CM, like mid.cfg + sparse, word, and SSE, 22 PAQ like components)
    Last edited by Matt Mahoney; 12th July 2010 at 17:50. Reason: added more results

  9. #39
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    english.dic can be compressed from 4067439 into 546752 using bsc with ST4 transform and preceding contexts without other preprocessing filters.
    Enjoy coding, enjoy life!

  10. #40
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    If you replace each newline in english.dic with 5 spaces, then bsc -m2 -cp (ST5, preceding contexts) compresses to 513841, but following contexts compresses to 1102222.

  11. #41
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    With the dictionary preprocessing example I just tried to explain that there's no reason to work too hard
    on implementing a "universal" codec, especially "fast universal".
    For most known formats its possible to implement some kind of preprocessing to improve compression,
    and there's much more sense to tune simple postcoders to results of these than original files.

    At the same time, I'm quite certain that for the best compression its necessary
    to use CM with format-specific model... but that won't ever be too fast.

  12. #42
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Continue working on PPMX. A nice trick - we keep number of symbols coded with no escapes - such thing usually called run - and if run is above some threshold we decrease the current escape probability by some amount. Many PPMs (PPMd, PPMN, ...) has a flag for SEE a la run>X. Such trick can be used for simple PPM compressors with no SEE at all adding some compresison gain.
    Well, what I've got currently:
    • Fast enough order-4 PPM (many times faster than PPMX 0.05)
    • More stable on binary, random and analog sources like audio files - i.e. PPMX can be a few times faster than PPMd on such files
    • Memory usage <70 MB
    • Extremely simple implementation - the entire source is just ~8 KB
    I hope it's interesting enough. At least it's a new PPM compressor that is not based on PPMd - i.e. it has completely different properties - with its storng and weak sides.

  13. #43
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Is there really a reason to avoid using SEE? I don't think that without SEE it would work faster.

  14. #44
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by Shelwien View Post
    Is there really a reason to avoid using SEE? I don't think that it can make anything faster.
    With an order-4, SEE role is not that important. Yep SEE not makes compressor too slow. Anyway, apart from SEE there are many other interesting things to play with - Recency/Deterministic scaling, etc. i.e. things that improve a symbol prediction, it's important, because with an order-4 about 97% of ENWIK8 coded with the highest order-4 context - i.e. just 3% of escaped/lower order contexts. SO. I'm thinking about cheap enough ways of improving symbol prediction. But mostly I'm about to build a strong basement - strong baseline PPM - all other stuff can be easily attached at any time...

  15. #45
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Timings on ENWIK8:
    PPMX - 20 sec
    PPMX+SEE - 23 sec

  16. #46
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    I still think that PPM only makes sense in conjunction with trees.
    And there's surely enough stuff to research - like various hybrid structures (eg. BWT + small tree) and cache-friendly layouts.

  17. #47
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    We should have something cache friendly. That means we should keep a symbols/counts list, ordered or not. How we would access to that lists - by hash or by tree like structure is second thing. Better if we will able to find freq, cumFreq and totFreq instantly. For sure it's tricky data structure. Current PPMX has tricky enough structure and symbol lists - in most cases getting cumFreq takes no time. So, I just don't think that a tree is the answer, or the only answer. Summarizing, efficient PPM should maintain (organized) symbol lists - to be more cache friendly. The symbol lists search can be done differently, suffix tree and symbol sorting can be inefficient - I just tested PPMd and PPMX on WAV or already compressed files - sad picture - PPMX is much faster, sometimes many times faster. Yes, on text files PPMd is usually faster, not that much though.

  18. #48
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Did you really compare it to ppmd -o4 -r0?
    Also these types of data don't mean much anyway - they'd be handled by a detector before ppmd in a real archiver.

    Also CM can work with o1+ huffman codes for bitwise decomposition of symbols, so imho there's no sense to use bytewise PPM with hashtables.
    There may be a place for a bitwise PPM like that though (aka DMC).

    And the good thing with trees is that you can just read the pointer to next context, once you found the current symbol's node.
    So its both compact (which is cache-friendly) and requires less calculations.

  19. #49
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by Shelwien View Post
    Did you really compare it to ppmd -o4 -r0?
    I do.

    Quote Originally Posted by Shelwien View Post
    Also these types of data don't mean much anyway - they'd be handled by a detector before ppmd in a real archiver.
    Same applies to binaries. Especially not that highly compressible binaries. So that means something.

  20. #50
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Lightbulb

    Carefully tested the latest Intel C++ compiler with PPMX. Still, Visual C++ with no PGO superior even compared to ICL with Profile Guided Optimization. All options are tested - as with standalone ICL, as well as with integrated ICL into Visual C++ environment - i.e. even same option set was tested and tuned. And again, using SSE2 instructions Intel C++ shows nice improvement:

    ICL - 23 sec
    ICL+PGO - 21 sec
    MSC - 20 sec
    ICL+SSE2 - 18 sec

    But I'm afraid about incorrect working executable on AMD and an older machines with that SSE2 instruction set. As a note, enabling SSE2 in MSC gives nothing and slightly decreases compressor's speed. Same thing with MSC's PGO - it shows nearly no improvement over static compiling.

Page 2 of 2 FirstFirst 12

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  3. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  4. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 17:03
  5. TURTLE incoming... Fast PPM file compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 104
    Last Post: 8th August 2007, 21:40

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •