Page 1 of 3 123 LastLast
Results 1 to 30 of 86

Thread: mod_ppmd

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts

    mod_ppmd

    So I randomly tried turning ppmd (ppmd_sh to be specific) into a submodel for paq,
    and the effect was surprisingly significant, so I decided to post it.

    https://sites.google.com/site/shelwi...edirects=0&d=1

    Code:
    ppmd model for paq8:
    (order=14, mem=256, cutoff=1)
    
    paq8p -7 book1:    192227 -> 191290
    paq8px69 -7 book1: 191577 -> 190870
    
    enwik8:
    17,904,748 // paq8p -7 enwik8
    17,536,045 // paq8p+mod_ppmd(o15 m1680) -7
    There's a source example in the archive - basically its usually enough to add a #include
    and ppmdModel(m); in the right place.
    But note that with the current model settings (o15 m1680 r0, see the tail of mod_ppmd.inc file, ppmdModel function)
    its necessary to compile with large-address-aware flag (-Wl,--large-address-aware with gcc)
    Attached Files Attached Files

  2. The Following 17 Users Say Thank You to Shelwien For This Useful Post:

    Bulat Ziganshin (1st June 2016),byronknoll (1st June 2016),Cyan (1st June 2016),Darek (1st June 2016),encode (2nd June 2016),inikep (1st June 2016),Intrinsic (2nd June 2016),JamesB (2nd June 2016),kaitz (1st June 2016),Matt Mahoney (2nd June 2016),Mauro Vezzosi (2nd June 2016),Mike (1st June 2016),Minimum (4th June 2016),m^3 (2nd June 2016),schnaader (2nd June 2016),Simorq (4th May 2017),xinix (2nd June 2016)

  3. #2
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Nice results! Do you also have a comparison with time/memory usage?

    I also found that adding a PPM model into cmix (http://www.byronknoll.com/cmix.html) gave substantial gains. I used my own PPM implementation - I should try replacing it with ppmd since my implementation is probably much worse!

    How do you convert the byte-level predictions from ppmd into bit predictions for paq? Is it something like: once per byte, use ppmd to get the predictions for all 256 possible byte values. Then, normalize these into bit-predictions as you see the subsequent bits for that byte?

  4. #3
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by byronknoll View Post
    I also found that adding a PPM model into cmix (http://www.byronknoll.com/cmix.html) gave substantial gains. I used my own PPM implementation - I should try replacing it with ppmd since my implementation is probably much worse!
    I studied ppmd to figure out how to implement arithmetic coding for GLZA. In general, I think it is a good implementation. A couple of things I changed for GLZA were:

    1. Doing a linear search is slow for relatively flat distributions so I augmented the Freq arrays with what is essentially a (partial) binary tree containing the sum of the Freqs for the leaves for each node. (It would be pretty easy to use this to make bit predictions, but probably not optimally).

    2. Bulat had some ideas in another thread that are useful for making PPMd's Freq updates more fair and I used some of those ideas in some of GLZA's models. For intance, replacing UP_FREQ's constant value of 20 with something like min(16, range.scale >> 9) can improve the stability of the frequency updates.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    > Nice results! Do you also have a comparison with time/memory usage?

    Not really, but its much slower than you'd normally expect from a ppm model -
    overall paq8p+mod_ppmd is around 1.5x slower than plain paq8p.
    And the memory usage is directly set by the memory parameter.
    There's also the third parameter, which determines what ppmd does
    after using up all the memory - r0 means complete reset, and r1 cuts
    some branches to free up memory.

    > How do you convert the byte-level predictions from ppmd into bit
    > predictions for paq?

    See the ConvertSQ function.
    At least, the conversion seems precise enough, bitwise version of
    ppmd_sh only adds 9 bytes to compressed book1 size (ie 209941->209952).

    > Is it something like: once per byte, use ppmd to get
    > the predictions for all 256 possible byte values.

    Yes, but luckily english texts don't use all the possible byte values,
    or it would be even slower - for example, there're only 82 of 256
    present in book1.

    > I should try replacing it with ppmd

    It'd be nice if you could test various contexts and mixes with it.
    I mean, its possible to create multiple ppmd models there, with
    different parameters or something.
    Maybe even test a delayed version - init ppm#1 at file start,
    then init ppm#2 after processing 10M, etc.
    Also I suspect that ppm only has better predictions in specific contexts -
    maybe bit7/6 of symbols or some such.

    However there could be problems with x64 - x64 build of modified paq8p
    seems to work normally (needs -fpermissive to compile though), but I'm not certain
    that it would work with actual >4G addresses.
    Maybe I'd try to fix it, if it'd appear really useful.

  6. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    byronknoll (2nd June 2016),xinix (3rd December 2016)

  7. #5
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    I made an attempt at porting ppmd into cmix. Unfortunately it looks like ppmd was designed for 32 bit pointers - it is difficult to port it to 64 bit. Does anyone know if there is already a 64 bit ppmd version?

  8. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,498
    Thanks
    741
    Thanked 664 Times in 358 Posts
    yes, the latest ppmd J is 64-bit compatible (and allow to run multiple instances): http://www.compression.ru/ds/

  9. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    > if there is already a 64 bit ppmd version?

    Unfortunately its not that simple - ppmd_sh is quite different from original ppmd, and original ppmd doesn't support creating multiple model instances.
    As to lack of x64 support - its plainly that old. But I can look into it if its really necessary.

    Anyway, I think that it can be used in cmix even as is, although maybe you'd have to allocate memory for it before any other models.

  10. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  11. #8
    Member
    Join Date
    Sep 2015
    Location
    Italy
    Posts
    239
    Thanks
    104
    Thanked 142 Times in 103 Posts
    Quote Originally Posted by Shelwien View Post
    > Can you restore your website ctxmodel.net?
    http://web.archive.org/web/201502060.../ctxmodel.net/
    Unfortunately it does seem that the hosting company didn't save any backups.
    But it'd be helpful if you could tell me what specifically do you look for.
    At the moment, I'm looking for nothing specific.
    There is information that I like to reread every so often, like delayed counters (my idea is to use them to guess the best adaption rate to update the counters knowing the last appeared bits and check if it can be better than to mix 2 counters, one with slow and one with fast adaption rate).

  12. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,498
    Thanks
    741
    Thanked 664 Times in 358 Posts
    Quote Originally Posted by Kennon Conrad View Post
    2. Bulat had some ideas in another thread that are useful for making PPMd's Freq updates more fair and I used some of those ideas in some of GLZA's models. For intance, replacing UP_FREQ's constant value of 20 with something like min(16, range.scale >> 9) can improve the stability of the frequency updates.
    do you looked into https://fgiesen.wordpress.com/2015/0...hmetic-coding/ ? Fabian has developed much better implementation of this idea, making 4-5 bit coder almost as fast as binary coder, and allowing pretty arbitrary weight functions

  13. The Following User Says Thank You to Bulat Ziganshin For This Useful Post:

    Kennon Conrad (3rd June 2016)

  14. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    > something like min(16,range.scale>>9)

    That could be interesting, maybe i'd repost ppmd_sh and somebody could try
    improving its compression?

    But for the purpose of this thread its not really the point -
    I just took ppmd_sh8 which I made in 2009 and added probability collection
    and conversion to it.
    So the next step for me would be doing the same to deppm - unfortunately,
    its much harder there to make a clean version of coding loop which won't
    modify anything (for p.d. collection).

    > do you looked into https://fgiesen.wordpress.com/2015/0...hmetic-coding/ ?

    Yes, its always nice to have some progress.
    But unfortunately these methods are not really compatible with any
    stronger models, and I don't feel like trying to compete with lots
    of people at coder speed optimization.
    Instead of that, I specialize in making custom CM coders for
    structured data and, I guess, recompression.

  15. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  16. #11
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by Shelwien View Post
    > if there is already a 64 bit ppmd version?

    Unfortunately its not that simple - ppmd_sh is quite different from original ppmd, and original ppmd doesn't support creating multiple model instances.
    As to lack of x64 support - its plainly that old. But I can look into it if its really necessary.

    Anyway, I think that it can be used in cmix even as is, although maybe you'd have to allocate memory for it before any other models.
    What are the differences between ppmd_sh and the original ppmd? Could I just add the byte-level prediction to bit-level prediction code to ppmd J, which is already 64 bit compatible?

    I tried allocating ppmd_sh memory before other models (and truncating 64 bit pointers to 32 bits for integer storage), but this still segfaults. I think ASLR (https://en.wikipedia.org/wiki/Addres..._randomization) would not make this a reliable solution: https://blogs.msdn.microsoft.com/old...09-00/?p=45181

  17. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    > What are the differences between ppmd_sh and the original ppmd?

    Basically ppmd_sh is a refactored version, with merged encoding/decoding functions,
    and everything wrapped into a class, to allow using multiple model instances at once.

    > Could I just add the byte-level prediction to bit-level prediction code to ppmd J,
    > which is already 64 bit compatible?

    Well, I guess you can try porting the probability conversion code from mod_ppmd to ppmd_J,
    but it would require a lot of work I think - there're multiple (4?) functions which traverse
    the tree, update counters, and encode something.
    And to turn it into a paq model you'd have to remove updates and rangecoder calls.

    > I tried allocating ppmd_sh memory before other models (and truncating 64
    > bit pointers to 32 bits for integer storage), but this still segfaults. I
    > think ASLR would not make this a reliable solution

    Try linking with /DYNAMICBASE:NO for now?

    Though I guess I'd try porting mod_ppmd to x64

  18. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  19. #13
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by Shelwien View Post
    >
    Try linking with /DYNAMICBASE:NO for now?
    I am compiling with g++ in Linux, so I think ASLR is controlled at the OS level rather than the compiler level. I am not very familiar with how memory allocation works in Linux - I am not even sure if ASLR is the main issue. I see the "HeapStart" variable (mod_ppmd.inc line 153) is at 0x7fff0e174010, which is already too big for 32 bits.

  20. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Well, here's a new version

    http://nishi.dreamhosters.com/u/mod_ppmd_v2.rar

    Changes:
    1. x64 allocation issue should be fixed; now all the pointers in the tree are relative to HeapStart.
    2. Tree node pointers are turned into node indices (there're also file data pointers, these are untouched).
    That is, the numbers are basically divided by 12, so much more can fit into 32 bits.
    The new limit is (maxmem/8+maxmem*7/8/12<4096*1024*1024) which is around 21.7G
    3. ppmdModel() now includes an example of two ppmd models - it hurts compression on book1 somewhat (comparing to single model),
    and I didn't test it on enwik yet.

  21. The Following 4 Users Say Thank You to Shelwien For This Useful Post:

    byronknoll (5th June 2016),comp1 (5th June 2016),Matt Mahoney (7th June 2016),xinix (3rd December 2016)

  22. #15
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Thanks, mod_ppmd_v2 works in cmix! cmix v10 compresses BOOK1 to 178010. Adding mod_ppmd_v2 as a bit-level model improves BOOK1 to 177967. What license is the mod_ppmd source code - is it GPL?
    Last edited by byronknoll; 6th June 2016 at 04:52.

  23. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Well, GPL is too much actually.
    You can check readme at http://compression.ru/ds/ppmdj1.rar for the license, but basically it says "not to misattribute authorship".
    So I'd say public domain, same as ppmd_sh.
    Just add a note that ppmd is written by Dmitry Shkarin.

    Also try removing the 2nd ppmd model from the mix, it might actually compress better :)
    Simply remove the marked lines:

    Code:
    __declspec(align(4096)) static ppmd_Model ppmd_12_256_1;
    __declspec(align(4096)) static ppmd_Model ppmd_6_32_1;          <------
    
    void ppmdModel( Mixer& m ) {
      static int init_flag = 1;
      if( init_flag ) { 
        ppmd_12_256_1.Init(15,1680,1,0);
        ppmd_6_32_1.Init(6,32,1,0);                                 <------
      }
    
      m.add( stretch(4096-ppmd_12_256_1.ppmd_Predict(4096,y)) );
      m.add( stretch(4096-ppmd_6_32_1.ppmd_Predict(4096,y)) );      <------
    
      init_flag=0;
    }

  24. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    byronknoll (7th June 2016),xinix (3rd December 2016)

  25. #17
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Here are the results for enwik8:


    cmix v10: 15587868 bytes
    33894.8 seconds
    26341664 KiB memory


    cmix v10 + mod_ppmd_v2: 15570923 bytes
    34459.95 seconds
    27310968 KiB memory

  26. #18
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    @Shelwien, would it be difficult to get byte-level predictions for 256 symbols from mod_ppmd_v2? cmix has a separate byte-level mixer (a backpropagation neural network) - compression rate would probably improve if I used mod_ppmd as a byte-level model rather than a bit-level model.

  27. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    See ConvertSQ() function.
    Basically, after you get the first bit probability estimation for a byte from ppmd_Predict(),
    there'd be the "uint sqp[256]" table filled with symbol freqs.

    Also did you use one ppmd model in your test, or two, like in v2 example?
    Because if its two, then the small gain is probably more about 2nd model hurting the results.

  28. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    byronknoll (7th June 2016),xinix (3rd December 2016)

  29. #20
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by Shelwien View Post
    Also did you use one ppmd model in your test, or two, like in v2 example?
    Because if its two, then the small gain is probably more about 2nd model hurting the results.
    For both the BOOK1 test and enwik8 test, I used one model: "ppmd_12_256_1.Init(15,1680,1,0);"

  30. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Ok then, though I expected at least 100-200k gain.

  31. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  32. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    If anyone wants to run some benchmarks for enwik9 or silesia, I will post. I don't have 32 GB to run cmix.

  33. #23
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Results courtesy of byronknoll.
    Code:
                Compression     Compressed size      Decompresser  Total size   Time (ns/byte)
    Program       Options      enwik8      enwik9     size (zip)   enwik9+prog  Comp   Decomp   Mem   Notes
    -------       -------    ----------  -----------  -----------  -----------  ------ ------  -----  -----
    cmix v10                 15,587,868  123,257,156    164,263 s  123,421,419  355721 355850  29924  66
    cmix v9                  15,627,536  123,874,398    161,911 s  124,036,309  346436 345681  26929  66
    
      Silesia dicke mozil   mr   nci ooff  osdb reym samba  sao webst x-ray  xml Compressor -options
    --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- -------------------
     30298715  1887  7549 2014   826 1329  2013  764  1696 3764  4645  3564  243 precomp v0.4.4 -cn | cmix v10
     30353202  1888  7568 2017   829 1333  2018  766  1699 3764  4657  3565  244 precomp v0.4.4 -cn | cmix v9

  34. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Thanks, but according to https://github.com/byronknoll/cmix , cmix v10 doesn't use mod_ppmd (yet?).

  35. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  36. #25
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by Shelwien View Post
    Thanks, but according to https://github.com/byronknoll/cmix , cmix v10 doesn't use mod_ppmd (yet?).
    Correct, I didn't include mod_ppmd in v10. I ran into an issue where mod_ppmd segfaults whenever it reaches the memory limit and resets. My enwik8 test didn't have that issue because the memory limit was high and it never had to reset.

    If I can fix the issue, I will try to include mod_ppmd in cmix v11. It might make sense to remove my original PPM models once mod_ppmd is added.

  37. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Ok here's v3: http://nishi.dreamhosters.com/u/mod_ppmd_v3.rar
    Now it includes an unmerged libpmd source and a standalone coder for ppmd model testing.
    Also I switched index<->pointer conversion functions to simpler ones, so now the ppmd memory limit is 4G, but hopefully it won't crash on model reset.

  38. The Following 2 Users Say Thank You to Shelwien For This Useful Post:

    byronknoll (18th June 2016),xinix (3rd December 2016)

  39. #27
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Thanks, mod_ppmd_v3 works well. I have added it to cmix: https://github.com/byronknoll/cmix/c...48a7d96652643f

    Here are the results for enwik8:

    cmix v10: 15587868 bytes
    33894.8 seconds
    26341664 KiB memory

    cmix v10 + mod_ppmd_v3: 15567067 bytes
    35400.84 seconds
    24578800 KiB memory

    This performs better than my earlier test because:
    1) I added ppmd as a byte-level model rather than a bit-level model.
    2) I reduced overall memory usage by decreasing memory used by my original PPM models.

  40. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    you can also try testing different model orders (first parameter).

    Also did you mix it on cmix level? What if you'd add it to one of paq models instead? Or all of them?

  41. The Following User Says Thank You to Shelwien For This Useful Post:

    xinix (3rd December 2016)

  42. #29
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    226
    Thanks
    108
    Thanked 106 Times in 65 Posts
    Quote Originally Posted by Shelwien View Post
    you can also try testing different model orders (first parameter).

    Also did you mix it on cmix level? What if you'd add it to one of paq models instead? Or all of them?
    Sure, I will try testing different model orders and memory limits. I will also try combining multiple ppmd models. Yes, it is mixed on the cmix-level. It would be possible to also add to the paq mixers, but I am guessing this won't make a significant difference.

  43. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,271
    Thanks
    201
    Thanked 985 Times in 511 Posts
    Made a new version, with added SSE - http://nishi.dreamhosters.com/u/mod_ppmd_v4_dll.rar
    Standalone compression improved, 209952->204730 for book1, 245370->237279 for enwik6.
    But with paq8p:
    Code:
    192,227 BOOK1.paq8p
    191,293 BOOK1.paq8p_v3
    191,309 BOOK1.paq8p_v4
    211,370 enwik6.paq8p
    210,168 enwik6.paq8p_v3
    210,166 enwik6.paq8p_v4

Page 1 of 3 123 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •