Results 1 to 14 of 14

Thread: ppmx v0.03 is here!

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

    ppmx v0.03 is here!

    Complete rewrite of the encoder this time. Mostly, it's about code optimization and parameter tuning. ppmx v0.03 is now order-10 PPM. Tried many orders/model combinations, can't find the best one - it's data dependent. I may tune for SFC - getting 12,000,000 bytes with no filters, but I loose too much with ENWIK8/9, tuning for ENWIKs I loose on other files, including SFC. Anyway, I tied to hit average optimum... Mostly I tried not to loose on LTCB and be OK with calgary.tar (795 KB)!
    Attached Files Attached Files

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My suggestions can be grouped in two items:

    1 - It's not a good idea that optimizing a byte-wise coder for mixed data types (generally binary). I think, best choice is stationary data (in the other words text: BOOK1, ENWIK8 etc). If you want to deal with binary data, you should apply filters. Else, you cannot take maximal benefit of byte-wise coding. But, you should implement a tree structure first of all. I think, you will gain compression a lot.

    2 - You can implement a optimal parsing for small fixed blocks (say 32-64 KiB). This possibly leads superior compression. You can easily undo changes in a hash table (create a unified queue for each memory modification into the hash table). So, you don't need to worry about "optimal" max order. Also, this will compensate with both kind of data types: text vs binary.

    If I were you, I would implement second suggestion. Since, it fits your current structure and you don't need to change the core scheme. You will add only a queue which stores memory modifications.
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Tree structure may get some speed gain - not extra compression. In my ppmx, with it's order skip using trees is questionable. So, will keep my hashing.

    PPM potentially is less efficient on binary data - since it uses the highest order available only.

    The answer is LOE. I will try this thing. As example, after coding byte "s" we will check each order for "s" probability - to find out what kind of model order was more accurate. After, we set this best order as initial - i.e. the higher orders will be just skipped.


  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Yet another idea is to choose one high-order model and after use reular PPM model set, in other words we may use different model sets. For example, order-9-3-2-1-0 or order-5-3-2-1-0 - here we switch between order-9 and order-5. The "best order" estimation can be cotext sensitive.

  5. #5
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode
    Tree structure may get some speed gain - not extra compression. In my ppmx, with it's order skip using trees is questionable. So, will keep my hashing.
    Surely a tree structure should cause significant speed-up. Since, random access will be highly reduced. For compression, I can say easily you will gain compression. Because, hashes tend to collide (sometimes you can gain even compression - context merging) and may kill most probable symbols' statistics at some coding stage. Also, I bet that your hashing coefficients are selected randomly - no real optimization at all, probably chosen some random prime numbers. So, instead of increasing complexity, you can even apply some optimizations for gaining extra compression with current scheme.

    Quote Originally Posted by encode
    PPM potentially is less efficient on binary data - since it uses the highest order available only.
    I think, you should look again on SFC chart for DURILCA (9828249 bytes) and PPMonstr J rev.1 (10389821 bytes) With optimal parsed PPM you will select short context at maximum on binary files. So, binary files wouldn't be a headache. Good modelled PPM + filters = good compression for me. If I didn't write LWCX, I would write a PPM codec which does near-optimal parsing and filtering

    Quote Originally Posted by encode
    The answer is LOE. I will try this thing. As example, after coding byte "s" we will check each order for "s" probability - to find out what kind of model order was more accurate. After, we set this best order as initial - i.e. the higher orders will be just skipped.
    We'll see the results

    Quote Originally Posted by encode
    Yet another idea is to choose one high-order model and after use reular PPM model set, in other words we may use different model sets. For example, order-9-3-2-1-0 or order-5-3-2-1-0 - here we switch between order-9 and order-5. The "best order" estimation can be cotext sensitive.
    You can even use some models which have short contexts as SSE context. SSE will weight the right probability in a way.
    BIT Archiver homepage: www.osmanturan.com

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Order-9 works better with ENWIKs, but order-10 is better on calgary.tar and on many other files...

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Will dig into better counters implementation, not only SEE, LOE and proper models combination.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,372
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > Tree structure may get some speed gain - not extra compression.

    Compression can change due to hash collisions.
    In general, after switch from a hash to a tree,
    the text compression would be slightly better,
    and binary data compression gets somewhat worse.

    > In my ppmx, with it's order skip using trees
    > is questionable. So, will keep my hashing.

    As I already said, when traditional prefix contexts
    are used (even if selectively), its not a problem to
    keep them linked with suffix pointers... and with
    order tables (like in ppmy/ash/tree) nothing special
    has to be done at all.

    But with your kind of hashing, where a whole symbol
    distribution is looked up by a context hash, its
    just inefficient, but doesn't affect the compression,
    I guess, so it may be reasonable to simplify the research.

    > PPM potentially is less efficient on binary data - since
    > it uses the highest order available only.

    That's not the only problem, and probably isn't even the
    main one. Also probability updates in frequency-based
    models and bitwise models are absolutely different.
    And while frequency distributions are more compatible with
    stationary data, still it seems that bitwise updates are
    much better balanced. Here's a good example:
    Code:
    book1  wcc386
    436029 411381 // o0 alphabet model, sum-optimized
    438861 401395 // o0 binary model, sum-optimized
    
    350571 320331 // binary o1 only, sum optimized
    345737 332711 // binary o1 only, book1-optimized
    353437 319029 // binary o1 only, wcc386-optimized
    345917 337759 // escaped o1, sum optimized
    345345 340459 // escaped o1, book1-optimized
    346935 337416 // escaped o1, wcc386-optimized
    As we can see, the frequency-based model is better
    for stationary data (book1) in all cases.
    But on average, the bitwise update is clearly superior,
    as it allows for a much faster and nonlinear adaptation.

    How to deal with it is still a question though.
    Its possible both to implement the bitwise update
    along with bytewise coding and distributions, and
    to improve the adaptivity using SSE or some other methods.
    But people tend to avoid thinking about it and just
    use bitwise models despite their known speed restrictions
    (each bit has to be processed, one way or another), and
    redundancy on stationary data.

    > The answer is LOE. I will try this thing. As example,
    > after coding byte "s" we will check each order for "s"
    > probability - to find out what kind of model order was
    > more accurate. After, we set this best order as initial -
    > i.e. the higher orders will be just skipped.

    I found that such statistics are useful for CM, but not
    sure about PPM. Its kinda the reverse there - you shouldn't
    update higher order contexts with "noisy" symbols
    (which can be implemented as lazy/delayed updates).

    Also, if you paid attention, you could see that order
    selection in ppmd is highly affected by alphabet sizes.
    Like, masked contexts with the same alphabet size as
    unmasked one, are simply skipped without consideration.

    > The "best order" estimation can be context sensitive.

    Such techniques do work, but kinds of statistics which
    are necessary to make them work, are also quite resource-intensive.
    Like, you basically have to estimate a few kinds of
    context sets per symbol, and accumulate the code length
    (aka log-likelihood) for each case.

    > Will dig into better counters implementation, not only
    > SEE, LOE and proper models combination.

    I also mentioned the "PPM mixed statistics problem".
    In short, unmasked and masked statistics have different
    properties, while practical PPM statistics are a random mix of both.
    Like, a context can be updated after escape (as a masked context),
    and then next time it would serve as unmasked. And then there're
    partial updates etc. So in the end, nobody knows, what kind of
    source is approximated by these submodels - there's even no
    clear "context history".

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Just tested ppmx v0.03 with huge memory (1700 MB), and order-9:

    ENWIK8
    ppmx v0.03, original: 22,572,808 bytes
    ppmx v0.03, huge-mem. order-10: 22,326,544 bytes
    ppmx v0.03, huge-mem. order-9: 22,278,828 bytes

    ENWIK9
    ppmx v0.03, original: 193,643,464 bytes
    ppmx v0.03, huge-mem. order-10: 187,400,809 bytes
    ppmx v0.03, huge-mem. order-9: 187,177,480 bytes

    In other words, with better or less greedy hashing I may gain some compression! Again thinking about complete rewrite...

  11. #11
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    873
    Thanks
    49
    Thanked 106 Times in 84 Posts
    What are the licens of PPMX

    Would i be breaking the license if i would distributed this version with a gui intended for making a cpu stress testing tool ?

    i wish to use different cpu intessive compressions utilites and do MD5 and CRC32 verification of the output.

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Well, wait some time. Currently, you may use BALZ or QUAD - both programs are finished. PPMX should replace BALZ at SF.net - i.e. it might be my flagship compressor. Just currently it is in early stage of development, PPMX v0.04 will be released soon!

  13. #13
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    873
    Thanks
    49
    Thanked 106 Times in 84 Posts
    I'll lok into BALZ and QUAD

    Actually om just looking for a lot od differnet cpu instensive CLI programs

    i would like to use RZM and CCMZ also, but I'm not very good at understanding all the different licenses

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Current PPMX v0.04 uses ~280 MB, now it is an order-12 PPM and it compresses:
    calgary.tar -> 787 KB
    book1 -> 216 KB
    world95.txt -> 474 KB


Similar Threads

  1. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 03:47
  2. ppmx v0.04 is here!
    By encode in forum Data Compression
    Replies: 62
    Last Post: 17th January 2009, 14:57
  3. PPMX v0.02 is here!
    By encode in forum Data Compression
    Replies: 26
    Last Post: 8th December 2008, 23:20
  4. PPMX - a new PPM encoder
    By encode in forum Data Compression
    Replies: 14
    Last Post: 30th November 2008, 17:03

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •