Results 1 to 8 of 8

Thread: Context mixing

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts

    Context mixing

    Hi

    I recently took the initiative to create a deep-depth entropy encoder, building on my recent experience with order-0 entropy coders.

    This new software is able to use as much as possible bytes of history as context.
    It is called ILE (Infinite Length Entropy) as it is able to use unbounded context length history (which of course is extremely memory hungry, but it nonetheless works on small files, or files with little variations such as log files...). The maximum length can also be selected as an argument.

    For a time, it looked as if the program was a benchmark winner, but there was a flaw: the initial version was "selecting" the context with best probability for the next byte, but the "context selector" itself ended up costing much more than initially anticipated, sometimes even more than data itself !

    I naively thought that the deepest context was naturally the most relevant, but it proved not correct enough.

    Therefore, i switched from context selection to context mixing, with more consistent results.
    The main observed effect is to lower the cost of "bad guesses", typically to one additionnal bit per wrong guess. Results are consistent, but average.
    For example Enwik8 with level 5 (=Order 4) compresses to 26MB.
    With Level 6 (=Order 5) it results in 24MB.
    This is merely correct looking at LTCB database.

    Now, for the question :
    on small enough files (or log files), i can trigger unbounded depth level.
    The surprise comes from the fact it can actually decrease compression.
    This is unexpected, as i though that larger contexts memorisation would only improve guess rate.

    This could be due to the context mixing formula, which provides a strong bias in favor of highest-depth context.

    As an insight, here is the proposed Context-Mixing formula.
    At given n depth :
    P(n) = Freq(n)/Total(n) x (1-Error(n)) + P(n-1) x Error(n)

    The idea is that with improved number of samples, the frequencies stored at current context are more and more relevant. If not relevant enough, the previous probability of context (n-1) is used as a complement to get the final probability. This is recursive. The idea seems interesting, but results are apparently not that excellent .

    Error evaluation is key to this approach. It is therefore calculated by taking in consideration the number of samples of current context. I tried several different formulaes, and mostly settled with one which provides quite "regular" results :

    Error(n) = 0.5 / sqrt(Total(n))

    Very deep or unbounded context depth get better result with following formula : Error = 1 / sqrt(Total). But results are worsened for smaller limits, such as L=5.


    Well, anyway, it seems i'm now interested in context mixing, a new area i was not even aware of less than a week ago.

    A few area of interest :

    1) Error calculation does not depend on previous context frequency. But maybe it should ? After all, if current context as a very low frequency count (for example 2 samples), and previous one has a very high count (for example 5800 samples), it would seem much safer to use Context(n-1) as the main contributor to probability estimation.
    There is no such relation with current formula.
    (Note that indirectly, there is a relation, as P(n) depends on P(n-1), which itself depends on Total(n-1) for Error(n-1). But it is not a strong one... )

    2) Error does not depend on current Context Depth level (n).
    Should it ?
    After all, it seems that 'highest context is most relevant' is less and less correct with higher context depths. But is it only due to the lower number of samples ?

    3) There is no "backward feeding" nor learning process.
    Should it ? and in which way ? Could it be integrated into "Error" for example ?

    4) There is no "secondary evaluation". Note that i do not really understand what it means, except that the probability is further processed by a secondary system which "corrects" it, but it seems unclear how, why, etc.
    Last edited by Cyan; 2nd December 2009 at 22:43.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > or files with little variations such as log files...

    In fact redundant files like these tend to also cause
    redundancy in statistics, and CM/PPM/BWT commonly have
    problems in such cases.

    > For a time, it looked as if the program was a benchmark
    > winner, but there was a flaw:

    A nice property of statistical models is that they're
    naturally symmetric. So its possible to use the same
    algorithm both to encode and decode the data.
    And I think that its always better to start with that,
    despite speed consideration etc, because at least
    you would be always able to test decompression.

    > but the "context selector" itself ended up costing
    > much more than initially anticipated,

    Being able to encode the same data in a multiple ways
    is always bad for compression.
    All LZ are like that too, though.

    > I naively thought that the deepest context was naturally
    > the most relevant, but it proved not correct enough.

    Try it with a wordlist file (like "english.dic") - its kind
    of a worst case for unbounded context schemes.

    > This is unexpected, as i though that larger contexts
    > memorisation would only improve guess rate.

    You can test PPM compressors, like PPMd, to see the same effect.
    Having more statistics is better, but harder to analyse,

    Also, I really wonder how you store the statistics?
    Probably, in a hashtable?
    In that case extra statistics would also cause more
    collisions and thus prediction errors.

    > This could be due to the context mixing formula, which
    > provides a strong bias in favor of highest-depth context.

    Why don't you try static weights first?
    Also I can suggest a simple example: http://ctxmodel.net/files/tree_v1.rar
    Mixing is in "model.inc".

    > The idea is that with improved number of samples, the
    > frequencies stored at current context are more and more
    > relevant.

    In fact, they're less relevant.
    That is, the context weight is inversely proportional to
    the number of context's occurences.

    > Well, anyway, it seems i'm now interested in context
    > mixing, a new area i was not even aware of less than a
    > week ago.

    Great

    > 1) Error calculation does not depend on previous context frequency.

    Its not really an "error" - its a probability of a context
    being _not_ optimal for prediction.

    > After all, it seems that 'highest context is most
    > relevant' is less and less correct with higher context
    > depths. But is it only due to the lower number of samples?

    Its data-specific.
    In some types of data (like english.dic) contexts of higher
    order than X are purely random.
    So you have to collect statistics on that too.

    > 4) There is no "secondary evaluation". Note that i do not
    > really understand what it means, except that the
    > probability is further processed by a secondary system
    > which "corrects" it, but it seems unclear how, why, etc.

    You can count the symbol frequencies in the case of (x<=P(n)<y),
    and then use these instead of P(n), if you know that P(n)
    is too imprecise as is.

    Here you can have some more mixing examples:
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar -- PPM
    http://compression.ru/sh/ppmy_3c.rar -- plain context mixing
    http://compression.ru/sh/ppmy_3c_sse.rar -- good as SSE intro
    http://ctxmodel.net/files/ASH/ash04.rar -- better weighting and SSE
    http://sites.google.com/site/shelwien/o3bfa.rar -- O3 ppmy with runtime bruteforce parameter opt
    Last edited by Shelwien; 3rd December 2009 at 01:39.

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks Shelwien

    It seems that what i'm trying to achieve is just another PPM implementation. Then again, i was not even able to understand what PPM is about a few days ago...

    I had some tests with PPMd/PPMs, for comparison. I found them pretty fast for the job, and their memory consumption is really reasonable given the amount of data they are analysing.
    ILE is not too far behind in term of compression rate at equivalent order. But slower, and considerably more memory hungry (not optimised).
    So i guess that continuing in the same direction can only lead to something equivalent to PPM at best.

    Also, I really wonder how you store the statistics?
    Probably, in a hashtable?
    In that case extra statistics would also cause more
    collisions and thus prediction errors.
    They are stored in a tree, no collision.
    This of course also means more complex data structure and variable memory consumption depending on data source (binary files make the tree grows much faster than texts).
    But as all this is a learning exercice, i'm actually looking for clean data for analysis. Memory optimisation is not a priority yet.

    In fact, they're less relevant.
    That is, the context weight is inversely proportional to
    the number of context's occurences.
    That part i don't understand. Or maybe we are not talking about the same.
    I agree that within a context with a lot of frequencies, each occurence has less "influence" on statistics. But that was not my point.

    The idea is that a probability based on a large number of statistics is more "reliable" than based on few exemples.
    If at depth "n", i get a 33% probability based on 3000 samples, this looks pretty solid statistic.
    If at depth "n+1" i get 33% probability based on 3 samples, this looks much more random, even though this is the same probability.

    That's all what "error" is supposed to be.

    Now, it is based on different ground than "context weighting", although it seem to serve the same purpose.
    Maybe an idea could be to mix both ....


    Why don't you try static weights first?
    Also I can suggest a simple example: http://ctxmodel.net/files/tree_v1.rar
    Mixing is in "model.inc".
    Thanks for example. I shall have a look at it.


    Additionnal though :

    i was pondering lately the principles at stake, and i wondered about the limitation of Dynamic range coding being based solely on past data.

    The main issue is that a "new" byte (never seen before at a given context depth) is going to cost considerably more. I could witness this with debug statistics, when a "new" situation arises, the new byte can cost huge number of bits.

    This is in contrast with semi-static statistics (with header), where we know in advance if a given byte is going to be present, and in which proportion. So no penalty here.

    Of course, semi-static has its own drawbacks, especially header cost, and by definition it does not adapt within a block. Therefore more blocks to adapt, so more headers, etc. Maybe one idea could be to work on dynamic block size.

    On the other hand, the main advantage of dynamic seems to be its capability to change its statistics "on the flow".
    Therefore, if we are just "accumulating" statistics, then we just miss the main advantage of dynamic. It shall morph all the time.

    At which rate, and according to which rule ?
    That is the question.

    I currently update my statistics every 10000 occurences, by dividing stored statistics by 2, thus providing newer occurence a higher "share".
    This may sounds good for Order-0 or order-1, but obviously, Order-5 statistics are going to adapt very slowly, maybe hardly ever.

    So something faster should be found, probably depending on context depth.

    One idea documented by Matt Mahoney is to update statistics when it guesses "bad". This is probably much easier to define "bad" and "good" for bit-coders. But nonetheless, it is probably something to consider for Byte-coding too...
    Last edited by Cyan; 3rd December 2009 at 15:16.

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well to get into the subject you should read about PPM. Here's the classical paper to get some better understanding:

    https://eprints.kfupm.edu.sa/33300/1/33300.pdf

    Googleing for the stuff linked at Wikipedia helps, too.

    http://en.wikipedia.org/wiki/Predict...rtial_Matching

    Shelwien has Shkarins PPMII papers available. They're a bit unreadable but form some interesting source of information, too. It describes a quite usable data structure.

    Generally longer contexts predict more accurate. But you've got the problem of a small and inaccurate statistics provided by higher order contexts. That's called the zero frequency problem. A more or less efficient solution is the usage of secondary models to cluster higher order statistics.

    Then, there's PPM* and PPMZ, which might be relevant to improve your understanding of a higher order statistics and a secondary model. Some papers by Fenwick which describe a symbol ranking algorithm and include some information about the predictivity of long, matching contexts.

    http://www.jucs.org/jucs_3_2/symbol_.../Fenwick_P.pdf
    http://www.lontra.org/pub/compressio...chRep132.ps.gz
    http://www.cbloom.com/papers/index.html
    http://look.yourself.for.ppm*.com/or_something

    You may want to have a look at the counter thread. To make your counters adaptive *per update* you can use something like

    N(s) = c*N(s) + 1,
    N(t) = c*N(t) for all other symbols t!=s
    c in (0, 1]

    where N(x) is the frequency of the symbol x.

    For statistical compressors using orders > 0 adaptivity is a must-have. Since it'd be fairly impossible to store the statistics for higher orders in a header. As you know the number of possible contexts grows exponentially (though not all are used).

    Hope it helps.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > It seems that what i'm trying to achieve is just another
    > PPM implementation.

    The key difference of PPM and CM is that PPM tries to predict
    the symbols using a single context, mostly from speed optimization
    reasons. While CM merges predictions of multiple submodels.

    > I had some tests with PPMd/PPMs, for comparison. I found
    > them pretty fast for the job, and their memory consumption
    > is really reasonable given the amount of data they are
    > analysing.

    There're some applicable ideas which can improve all the
    characteristics of PPMd and I hope to eventually find time
    for implementing these in ppmd_sh.
    For example, the symbol table pointer in context nodes is
    redundant there - it can be removed by merging the context
    node and the symbol table into a single structure.

    > So i guess that continuing in the same direction can only
    > lead to something equivalent to PPM at best.

    PPMd is very far from being perfect, and I don't see why
    you can't write a better PPM coder.

    > They are stored in a tree, no collision.

    Good. Been a long time since somebody bothered to use a
    dynamic data structure.
    But then, do you have any ideas on how to contain the
    memory usage? Just resetting the tree when there's no
    more memory is a bad choice.

    >> That is, the context weight is inversely proportional to
    >> the number of context's occurences.
    > That part i don't understand. Or maybe we are not talking about the same.

    In other words,
    when you have more samples in a context, it also becomes more likely
    that higher-order contexts would be more relevant.

    > The idea is that a probability based on a large number of
    > statistics is more "reliable" than based on few exemples.

    Sure, it may become more reliable for this specific context
    (with data from a stationary source).
    But higher-order contexts may be able to provide predictions
    of much better precision.

    > If at depth "n", i get a 33% probability based on 3000
    > samples, this looks pretty solid statistic.
    > If at depth "n+1" i get 33% probability based on 3
    > samples, this looks much more random, even though this is
    > the same probability.

    Right, but not if you have 33% probability at order-n
    and 66% (or ~100%) at order-n+1.

    > Of course, semi-static has its own drawbacks, especially
    > header cost, and by definition it does not adapt within a
    > block. Therefore more blocks to adapt, so more headers,
    > etc. Maybe one idea could be to work on dynamic block
    > size.

    There's still Bulat's static CM at least
    http://haskell.org/bz/cm.zip

    Also PPMd can be used in a static fashion too -
    with ppmtrain etc.

    > On the other hand, the main advantage of dynamic seems to
    > be its capability to change its statistics "on the flow".

    In that sense, "adaptive" might be a better term.

    > At which rate, and according to which rule ?
    > That is the question.

    That's data-specific.
    Basically you have to find some coefficients which work
    good at your testset.

    > One idea documented by Matt Mahoney is to update
    > statistics when it guesses "bad". This is probably much
    > easier to define "bad" and "good" for bit-coders. But
    > nonetheless, it is probably something to consider for
    > Byte-coding too...

    "Common sense" ideas rarely work right in compression.
    For example, LZ seems "common sense" (we replace string
    repetitions with shorter references), but competing with
    7-zip in that requires to implement some non-trivial
    parsing optimizations (we can't just use a longest
    match at each point), and even that can be easily beaten
    (in compression ratio) by a very trivial CM, which would
    just try to minimize the entropy instead of some "good" or "evil".
    Last edited by Shelwien; 3rd December 2009 at 17:31.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    By toffer's request, links to Shkarin's paper(s):
    http://ctxmodel.net/files/PPMd/ShkarinPPMII.pdf - "PPM: one step to practicality", DCC 2002
    http://ctxmodel.net/files/PPMd/ENGPAPER.txt - "PPM? Into practice!" -
    looks like a plaintext version of previous one
    http://ctxmodel.net/files/PPMd/practicalppm.pdf - "Повышение эффективности алгоритма PPM"

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks Toffer, Thanks Shelwien

    These are very interesting reading.
    Indeed, the papers from Charles Bloom are also very enlightning, and now i'm starting to understand the relation between LZP and PPM*, and by generalisation, between LZ* algorithm (and most especially ROLZ ones) and order-n entropy coders.
    So, to correct my initial vocabulary, unbounded context length already exists, and is called PPM*, and what i called "context switcher" is more properly called "escape symbol".

    The key difference of PPM and CM is that PPM tries to predict
    the symbols using a single context, mostly from speed optimization
    reasons. While CM merges predictions of multiple submodels.
    OK, so that means ILE is a CM coder now.
    I changed a bit the formula, and it slightly improved compression ratio.
    However, compressed file by ILE are still typically 4% bigger than PPMd at comparable depth level. And increasing maximum context depth still has a negative impact beyond a given threshold (file-dependant). So obviously i'm missing some improvements somewhere.

    do you have any ideas on how to contain the
    memory usage? Just resetting the tree when there's no
    more memory is a bad choice.
    Sure, this part of the program needs serious optimisation, and ideas there are. But this is currently not my main focus.
    ILE can handle any context depth with no memory reset for small files (such as calgary corpus for example), and up to order-5 with enwik8. This is enough to provide some usefull comparison with equivalent programs.
    It makes more sense to start optimizing memory after improving compression rate, because perspectives are otherwise limited.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > However, compressed file by ILE are still typically 4%
    > bigger than PPMd at comparable depth level. And increasing
    > maximum context depth still has a negative impact beyond a
    > given threshold (file-dependant). So obviously i'm missing
    > some improvements somewhere.

    1. PPMd (and practically all other PPM coders) still has problems
    with max. context order setting.
    2. A very important feature for high-order PPMs is
    so-called "inheritance" - the method of counter initialization
    when new symbol is added to context (the idea is to set it up in
    such a way that it won't hurt the compression even in the
    worst case).
    3. Symbol masking - if you "escape" from some context, you
    can't encode its symbols anymore, but there will be corresponding
    counters in lower order, which have to be "removed".
    4. Escape probability estimation. Its most important for any PPM - that's
    where secondary estimation appeared first (SEE) etc.
    5. Context updates. Updating only the coding context is not
    quite good for compression.
    6. There's no such thing as stationary data in practice.
    PPMd does rescales like every 16 symbols.

    > ILE can handle any context depth with no memory reset for
    > small files (such as calgary corpus for example), and up
    > to order-5 with enwik8. This is enough to provide some
    > usefull comparison with equivalent programs.

    Sure, but designing a tree with sliding window or something first
    might be actually a good idea.
    Because a tree designed to optimize the compression ratio likely
    won't be compatible with windowed processing.

    > It makes more sense to start optimizing memory after
    > improving compression rate, because perspectives are
    > otherwise limited.

    I thought so too before, but as a result we don't have
    any bytewise PPM/CM now which could properly handle
    the tree overflow.
    Even complicated "tree cutting" in PPMd doesn't completely
    solve the problem and as a result ppmd loses to simpler
    compressors in modern benchmarks which use large enough
    files to cause the overflows, even in the cases where
    it can easily win with smaller files.
    Last edited by Shelwien; 4th December 2009 at 19:15.

Similar Threads

  1. Logistic mixing & modeling
    By toffer in forum Data Compression
    Replies: 44
    Last Post: 3rd March 2011, 06:25
  2. Simple bytewise context mixing demo
    By Shelwien in forum Data Compression
    Replies: 11
    Last Post: 27th January 2010, 04:12
  3. Mixing strategies
    By Alibert in forum Data Compression
    Replies: 38
    Last Post: 25th June 2009, 00:37
  4. M01 - Order 01 context mixer
    By toffer in forum Data Compression
    Replies: 58
    Last Post: 17th June 2008, 19:29
  5. CMM fast context mixing compressor
    By toffer in forum Forum Archive
    Replies: 171
    Last Post: 24th April 2008, 14:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •