Results 1 to 12 of 12

Thread: Simple bytewise context mixing demo

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts

    Simple bytewise context mixing demo

    This is a C++ source of a simplest possible order16 bytewise mixing coder.
    Its called "green" because it doesn't use any trees.
    There're no complex data structures at all, so it should be easy to understand.
    And it still compresses book1 to 215815 bytes, which is supposed to provide
    a reasonable threshold for statistical text compression, as no dynamic mixing
    or adaptive counters are used there.

    http://ctxmodel.net/files/green/green_v0.rar

    Still, the algorithm has O(N^2)-ish complexity due to source rescans
    for high-order statistics collection, so for tests with larger files
    (like >3M) I'd suggest to use a "less ecological" coder, which
    has the same mixing function, but does use a tree for storage of
    statistics:

    http://ctxmodel.net/files/tree_v1.rar

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Code:
    <toffer> very simple, indeed
    <toffer> but that weighting formula is just ad-hoc i guess?
    <Shelwien> not quite
    <Shelwien> that weight divides the freq by total
    <Shelwien> to convert to probabilities
    <Shelwien> while the actual weight for probability distributions is (order+1)
    <Shelwien> as to 8192 - 1/8192 is the weight of order-(-1)
    <toffer> i see
    <toffer> or probability precision
    <toffer> well
    <toffer> isn't that too inaccurate?
    <toffer> anyway it'd be ok just for a demonstration
    <Shelwien> sure, if you can suggest a better simple weighting formula - you're welcome :)
    <toffer> try paq1's w ~ order^2
    <Shelwien> i tried order^2 and it worked better for low orders (like o2-o3)
    <Shelwien> but it causes some precision problems with higher orders
    <toffer> i meant actually inaccurate due to the two involved divisions
    <Shelwien> that's intended actually
    <toffer> determinisitc contexts should receive higher weights
    <Shelwien> i know :)
    <Shelwien> I have a separate mixing function in ppmy and ash
    <Shelwien> with whole tables of constants
    <Shelwien> but is there really a reason to promote that kind of approach now?
    <Shelwien> instead, i'd better implement a bytewise logistic mixer or something
    <toffer> you mean non-tree based?
    <Shelwien> i mean ad hoc weighting functions
    <toffer> i don't see any reason in it - only attracting people
    <Shelwien> anyway, if you can improve it with some simple modification - just do it :)
    <toffer> what i asked for previously is the "depth" parameter
    <Shelwien> its a rescan limit
    <toffer> but why green?
    <Shelwien> because i saved a tree with it :)

  3. #3
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Funny, currently I'm doing the same. After finetuning and some more finetuning and using 80mb of memory for calculation of escape-probabilities, I decided to try to find a simple unpolished version to mix the orders together. Book1 is doing 219000 in my case.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    http://www.ctxmodel.net/files/green/green_v1.rar
    - book1 = 214150 (with DEPTH=8192)
    - Mixup() unrolled
    - fixed the weighting function
    - some simplifications

    Made a mistake in v0 - with DEPTH=512 the actual book1
    result was 217815, not 215815 as I thought.

    > I decided to try to find a simple unpolished version to mix the orders together.

    Well, its different in my case... I just got tired of people getting scared off
    by ppmd/ppmy/ash/tree/etc data structures... and I couldn't suggest
    anything simpler, but with good compression. So now we have a simple example.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    http://www.ctxmodel.net/files/green/green_v2.rar
    v2
    book1 = 211691
    - added hardcoded coefs for the weighting function

    -----

    The weighting function is still static

  6. #6
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Shelwien,

    Nice compact sourcecode, with impressive results compared to the size and complexity of the algorithm.

    One question: How's the array coef generated?

    One comment: your example would gain power with some more comments to explain the brief notation. For example explain the theory used in the simple mixer.

    Edit: 493.187 on World95.txt for an order 16 compressor. Compared to the 211k on book1, it feels like the weighting is overoptimised on book1, because simular efficient algorithms from my side doing 211k on book1 would do around 460k on world95.txt. Maybe I'm wrong. The simplicity of the weighting would not be enough to say that speaking of optimisation of whatsoever is impossible.
    Last edited by Kaw; 22nd January 2010 at 19:06.

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Nice compact sourcecode, with impressive results compared to the size and complexity of the algorithm.
    I concur

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For modelling you generally divide a model into two components: structure and parameters. The model structure is usually fixed, but different parameter values yield different performance. In almost every engineering science it is common sense to tune model parameters to a give training set - but not in compression. I wonder why there're just few people around who realized the importance of that... Wrong model parameters introduce redundancy, thus worse compression.

    The static weights are parameters and generated via parameter optimization - minimize the entropy w.r.t. model parameters. Shelwien has a simple perl script for that. The algorithm can be described like that:

    - the model has N parameters x1, ..., xN; xi is a bitstring
    - infinitely cycle through parameters i=1..N and search along a single degree of freedom (a single xi)
    - modify the current xi by flipping bits randomly and keep it if the cost function value improves (compression)

    To do some advertisement i have a small set of rather complex genetic optimization algorithms implemented in M1. Btw my current implementation with just four active models (compared to 16) compresses book1 to 210391. I guess this is a pretty good example to show the importance of parameter tuning.

    Afair world95.txt is rather large. book1 is poorly chosen for compression of big files, since it's just ~770k. Longer inputs amortize statistics of higher order contexts, thus an optimization will give higher weights for higher orders.

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I tried compressing enwik8. After 10 hours the output file filled up my disk Here is where it got stuck when I killed it:
    Code:
    C:\tmp\green_v2>timer green c \res\enwik8 enwik8.ari
    
    Timer 3.01  Copyright (c) 2002-2003 Igor Pavlov  2003-07-10
    Encoding.
    ^Cp=30856583/100000000 out=7101125
    
    C:\tmp\green_v2>dir
     Volume in drive C is OS
     Volume Serial Number is 66E6-426E
    
     Directory of C:\tmp\green_v2
    
    01/22/2010  11:55 AM    <DIR>          .
    01/22/2010  11:55 AM    <DIR>          ..
    05/22/2000  05:47 PM           768,771 BOOK1
    01/18/2010  10:45 PM               108 c.bat
    01/22/2010  11:55 AM    22,594,666,496 enwik8.ari
    01/20/2010  05:09 PM             4,268 green.cpp
    01/20/2010  05:14 PM            21,504 green.exe
    01/19/2010  03:02 PM               715 icl.cfg
    01/20/2010  05:16 PM               371 readme.txt
    01/18/2010  10:47 PM             1,581 sh_v1m.inc
    01/18/2010  10:46 PM                87 test.bat
    01/18/2010  10:47 PM               684 timer.inc
                  10 File(s) 22,595,464,585 bytes
                   2 Dir(s)               0 bytes free

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Nice compact sourcecode, with impressive results compared
    > to the size and complexity of the algorithm.

    Why, thanks
    The idea was to get somebody else to write a better new CM, though

    > One question: How's the array coef generated?

    Bruteforce search for a day, more or less like toffer explained.
    Sometime long ago I did that manually though, its not really a
    problem with just that many coefs.

    > your example would gain power with some more
    > comments to explain the brief notation.

    There're already more comments than I usually write,
    and I really hate the "i++; // increment i" style,
    so just ask if you really don't understand something.

    > For example explain the theory used in the simple mixer.

    You see, there's no theory.

    Code:
    w = A/(B+total/(C+1))
      = A*(C+1)/(B*(C+1)+total)
     ~= W/total
    
    P[c] = Sum[ W[j] * o[j].freq[c]/o[j].total, j=0..maxorder ] / Sum[ W[j] ]
    Its just a weighted sum of context probability distributions
    with static weights.
    Its only written that way to avoid a 32-bit overflow.
    The only important point there is that weights for
    deterministic contexts are much bigger.

    > Compared to the 211k on book1, it feels like the weighting
    > is overoptimised on book1, because simular efficient
    > algorithms from my side doing 211k on book1 would
    > do around 460k on world95.txt.

    Note that its a _static_ mixing.
    Also, world95 is hardly a good example of a stationary text.

    > Maybe I'm wrong. The simplicity of the weighting would not
    > be enough to say that speaking of optimisation of
    > whatsoever is impossible.

    There's no weighting, simple or not - only static weights.
    That was the point of this demo - not the development of
    a new CM coder for my collection.

    So testing it on world95, more so enwik8, wasn't intended -
    there wasn't enough precision for that.
    But ok, here I fixed the overflows and retuned it to world95
    (btw its coefs show how weird it is).
    Also, added a modification with a nibble tree for faster testing
    (see http://ctxmodel.net/rem.pl?-31 http://ctxmodel.net/rem.pl?-32)

    http://ctxmodel.net/files/green/green_v3.rar
    Code:
               green   tree
      book1   215040   215037 
    world95   473691   471902
     enwik8        ? 21605873*
     * tested with order10, 1.2G memory
    - world95 parameter profile
    - 64-bit mixing
    - tree-based version (aka tree_v2)
    Last edited by Shelwien; 23rd January 2010 at 09:56.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    green_v3 enwik8 -> 21,819,822 in 111475 sec. (about 32 hours). Didn't test decompression.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,426
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Well, I'd prefer to see some comments on the actual algorithm,
    instead of benchmarks
    Like whether it makes sense as a CM tutorial, or whether
    its possible to further simplify it somehow.

    Also, an update with extra 2 days of world95 tuning:
    http://ctxmodel.net/files/green/green_v3a.rar
    Code:
               green   tree
      book1   214846 214843
    world95   473047 471244
     - updated world95 parameter profile

Similar Threads

  1. M1 - Optimized demo coder
    By toffer in forum Data Compression
    Replies: 189
    Last Post: 22nd July 2010, 00:49
  2. Context mixing
    By Cyan in forum Data Compression
    Replies: 7
    Last Post: 4th December 2009, 19:12
  3. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57
  4. Bytewise vs Binary
    By Shelwien in forum Forum Archive
    Replies: 9
    Last Post: 30th March 2008, 17:51
  5. QUAD-SFX DEMO
    By encode in forum Forum Archive
    Replies: 17
    Last Post: 26th April 2007, 14:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •