Results 1 to 29 of 29

Thread: A new approach?

  1. #1
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts

    A new approach?

    First of all I want to say that English is not my native language and I have never been to an English speaking country, so please be patient with me.

    Being a programmer and having almost an obsession for data compression, I decided to try to make my own piece of software to achieve compression. My goal is to use a different approach and not using an existing algorithm.
    For the actual encoding I believe that arithmetic encoding is the best option and that this encoding is near perfect, so no room for optimization here.
    That leaves me to the data modeling (if that?s the correct term) . My logic tells me that finding every possible pattern of any length in the known (past) data and the ?perfect? way to make a prediction for the next bit or byte out of this information would be a general improvement for the compression scene.
    If I?m correct PPM (bounded and unbounded) is relying on hash tables and a limited amount of memory. PPM is not able to find all the patterns of any length, except maybe for small files or very big computers.
    I like DMC and its simplicity, but the results are not perfect and it tends to use massive amounts of memory when trying to map every possible pattern.
    CM is nice and produces the best results so far, but isn?t CM only a way of reducing the prediction errors made by the combined imperfect data modeling algorithms?
    Next are the LZ-like models. Good for speed, but the compression can?t match the PPM and CM-like compression algorithms.
    BW is cool, but it won?t give you every possible pattern.
    CTW is rare. Implementations give good results but use a lot of memory and no one was able to make a popular compressor so far based on this algorithm.
    But enough about other types of algorithms. I gave my algorithm the following boundaries: For enwik8 it must be able to find all patterns within 1GB of memory. No, I?m not pursuing the Hutter Price. My algorithm is allowed to be as slow as it needs to be and it will be too slow for the 10 hour rule. Yet speed should be a bit realistic, so 1 byte per second is not my intention.

    The results so far:
    40.000.000 bytes in 24 hour on enwik8 in c# without using pointers and using Array.Resize() much too often, so little improvement is possible. A worst case test shows a usage of 1100mb of memory after 100.000.000 totally random bytes, so memory usage needs improvement and there is plenty room of improvement. The algorithm could easily be limited to about 950mb, but that would slow things down considerably. This is just a test for finding patterns. An actual prediction will slow things down even more. Book1 is processed within 10 seconds. World95.txt in about 3 minutes. After 36 hours Enwik8 reached the 50.000.000 bytes.
    For now I think part 1 of the 2 goals has reached the state of being good enough to start with part 2: the actual prediction.

    Part 1 results in a really long list of bytes usable for prediction and their pattern lengths. The list is mixed and absolutely not sorted. How should I create a good prediction using that information? I have no clue? Any of you having a good idea?

    My second question is: Should I continue my quest or is this tried before and was it a dead end?

    (After part 1 and part 2 I will try to make a 'windowed' version of the algorithm, so very large files can be processed too and small windows would give the user a choice for some speed. The code for Arithmetic compression is already finished and works perfectly.)

  2. #2
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    38
    Thanked 168 Times in 84 Posts
    Kaw, I'm not a person who involved into programming but at least I can say one thing - you found a good place to share your experience and raise your knowledge. Don't be shame to ask and maybe you'll get some good advises from Pack-Gurus. Best wishes

  3. #3
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by Skymmer View Post
    you found a good place to share your experience and raise your knowledge. Don't be shame to ask and maybe you'll get some good advises from Pack-Gurus. Best wishes
    I will and thanks for the best wishes

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I decided to try to make my own piece of software to
    > achieve compression.

    Welcome.

    > My goal is to use a different approach and not using an
    > existing algorithm.

    There's no such thing as "existing algorithm".
    Even for deflate only stream format is fixed, which leaves
    a lot of freedom for possible encoder/decoder implementations.
    So when people say that their compressor is using LZP or CM
    or even BWT, it doesn't really mean much, at least recently.

    > For the actual encoding I believe that arithmetic encoding
    > is the best option and that this encoding is near perfect,
    > so no room for optimization here.

    Well, still somehow I end up using a different rangecoder version in
    each new project... there's really a lot of possible features that
    might be necessary, that's even if we won't consider the possible
    speed optimizations, which are likely infinite.
    One example is that all the common implementations would show
    some visible redundancy on very redundant data (comparing to log2
    sum), and with large alphabets (like if you'd try to implement
    a word-based coder).

    > That leaves me to the data modeling (if that's the correct term).

    Sure, but useful models are normally far more advanced that
    match finding (and btw, I prefer to use "modelling" -
    they're both in the dictionary and mean the same, but google
    shows up more links with scientific meaning for "modelling"
    than for "modeling").

    > My logic tells me that finding every possible pattern of
    > any length in the known (past) data and the 'perfect' way
    > to make a prediction for the next bit or byte out of this
    > information would be a general improvement for the
    > compression scene.

    Well, that's a really good way to use up any available computing
    resources, but not especially helpful for compression.

    > If I'm correct PPM (bounded and unbounded) is relying on
    > hash tables and a limited amount of memory. PPM is not
    > able to find all the patterns of any length, except maybe
    > for small files or very big computers.

    I'm not sure what do you mean by "patterns" here - just substrings
    or something like regexps.
    And PPM is not limited to hashes in any way. Its just a skewed
    (for speed) CM variant, which attempts to use the statistics from
    one specific prefix context to predict the next symbol, with fallback
    to shorter prefixes. Btw, ppmd, the most widely used PPM implementation,
    is based on a suffix tree, not a hash.
    And considering "patterns", many compressors use something called
    "sparse contexts" - basically different fixed patterns than usual prefixes.
    And thus its known that possible gain from something like that
    is very small, especially for texts - on the scale of 0.1%.

    > I like DMC and its simplicity, but the results are not
    > perfect and it tends to use massive amounts of memory when
    > trying to map every possible pattern.

    DMC is basically a bitwise PPM.

    > CM is nice and produces the best results so far, but isn't
    > CM only a way of reducing the prediction errors made by
    > the combined imperfect data modeling algorithms?

    Yes, but a single perfect model for all data doesn't exist.
    (Proof: sequences generated by two different PRNG would require
    different perfect models).
    And in CM some of mixing can be avoided if you know a correct
    way to parse the data (and thus use only specific models for
    Time and String fields in database records, without mixing them).
    But even Time model would require mixing (for example, only
    Times from some limited set might occur most of the time, but
    sometimes a new one might appear - so we would have a probability
    estimation for a new Time... basically an order0 PPM).
    And with text there's no better way than to keep really a lot
    of different models, because we only know that cases which can
    be best handled by these models can appear, but we don't know
    exactly when... and if we only know a probability distribution
    for model fitting, that only means that we have to use the linear
    mix of model estimations to predict the next symbol.
    (There're CMs which don't use the linear mixing, but that only
    means that they don't produce the probabilities of model matches
    in explicit form).

    > Next are the LZ-like models. Good for speed, but the
    > compression can't match the PPM and CM-like compression
    > algorithms.

    Its really a separate modelling approach, and we
    still wait for a LZ-mixing implementation which would
    show the real LZ's compression ratio, without its usual
    multiple encoding redundancy.
    Of course, LZ won't ever be the best model for text,
    but there're some other data types, where even existing
    LZ compressors show better results than existing CM.

    > CTW is rare.

    CTW is bitwise CM with a specific weighting function.

    > For enwik8 it must be able to find all patterns within 1GB
    > of memory.

    Now, there's really no need to search anything in enwik8.
    That is, before that, you have to remove all the not-textual
    syntax from it (XML,links,images,tables,urls,unixtimes,utf8,etc),
    and then think really hard about what to consider the pattern
    in what'd be left. (Like, <noun> <verb> certainly would be
    a useful pattern, but who knows how to recognize it).

    > The results so far: 40.000.000 bytes in 24 hour on enwik8
    > in c# without using pointers and using Array.Resize()

    C# (or any other modern language) is a really bad idea
    for any compression applications. Well, its still better
    than python or ruby I guess.
    Basically, you must use C++ for all data processing
    (GUI etc doesn't matter of course)
    if you care about speed or memory usage.
    (Well, there's Intel's Fortran compiler too, but its kinda a bad joke).

    > Part 1 results in a really long list of bytes usable for
    > prediction and their pattern lengths. The list is mixed
    > and absolutely not sorted. How should I create a good
    > prediction using that information? I have no clue: Any of
    > you having a good idea?

    Good models are adaptive, you can't hope for any good results
    with globally transformed data, as there're always some local
    correlations.
    And if you still want to try, you'd need a custom model for
    any extra information you collected, and then a data coder
    which would be able to use that extra information to reduce
    the code size.
    The most known implementations of this scheme compress the
    statistics first, and then data using it. Of course, their
    compression level is rather bad, but there's some hope for
    fast compressors based on this approach.
    And "Of course" is there because its much harder to write
    a good model for datatype with complex structure and correlations
    (like statistics on substring occurences in file) than
    for a simple text, and even after succeeding in that,
    the possible compression level would be still limited by
    lots of unused correlations hidden by the transformation.

    > My second question is: Should I continue my quest or is
    > this tried before and was it a dead end?

    Considering direct competition with paq8, its fairly dead, but
    maybe you can manage to make an analysis tool, results of
    which could be used by some other CM.
    For example, it could be useful to somehow detect and describe
    the text structure, like in apache log files (fp.log), or something
    like world95.txt, with lots of articles written by fixed template.
    Last edited by Shelwien; 23rd April 2009 at 13:29.

  5. #5
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Shielwin,

    Thanks for your reply. I think you are too negative about the idea. In a past (much slower) implementation I was able to compress world95.txt to 448000 bytes and book1 somewhere below 210000 and that with an unscientific selfmade half-bayes-like weighting average. It's not that good, but it also proofs that i'm not a complete idiot.

    I call patterns the found 'substrings' equaling the bytes/bits just before the byte/bit that needs to be predicted. My experiments show that longer patterns generally predict the next bit/byte better than shorter patterns. For example in world95.txt a patternlength of 14 byte has a 99% correct rate of predicting the most significant bit and even better for the next bits.

    Thanks for explaining some more about the other compression algorithms.

    By the way: CTW is absolutely not a CM algorithm. A PPM using a tree structure and a good weighting algorithm comes closer to the explanation of CTW. (http://www.ele.tue.nl/ctw/)
    Last edited by Kaw; 23rd April 2009 at 13:42.

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi!

    Your idea really reminds me of something i did a long time ago. You may want to read about context sorting.

    http://naomi.is.s.u-tokyo.ac.jp/~sad...adaseq97.ps.gz
    http://comjnl.oxfordjournals.org/cgi...2_and_3/94.pdf

    Well, CTW is a context mixing algorithm. It combines different model's prediction (=CM). Trees and hashes are just a matter of statistic storage, thus the implementation; not the algorithm itself. It'd be possible to implement a PAQ-like CM with a tree. CM is a more general algorithm class and CTW a subset, like PAQ. Eugene's ASH is a CM algorithm, too - but completly different from CTW and PAQ. Still it mixes different model's predictions.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Thanks for your reply. I think you are too negative about
    > the idea. In a past (much slower) implementation I was
    > able to compress world95.txt to 448000 bytes and book1
    > somewhere below 210000 and that with an unscientific
    > selfmade half-bayes-like weighting average. It's not that
    > good, but it also proofs that i'm not a complete idiot.

    That's surely better than nothing, but 210k on book1
    doesn't require much of a model. Here I have a coder with
    precise occurence numbers and mixing with fixed ad hoc weights,
    and it still compresses book1 to 215.7k.
    http://ctxmodel.net/files/tree_v1.rar

    > I call patterns the found 'substrings' equaling the
    > bytes/bits just before the byte/bit that needs to be
    > predicted.

    Well, call them substrings or prefixes or contexts or n-grams
    then. "Pattern" sounds like something more advanced.
    And if its simple n-gram statistics, then most of CMs collect
    them, including the thing I linked above.
    And paq8 is beyond that, as it has a text-specific word model
    and other tricks, in addition to prefix context models.

    > My experiments show that longer patterns generally predict
    > the next bit/byte better than shorter patterns.

    Well, I'd surely appreciate if somebody would develop a new
    compressor, not bitwise and/or not hash-based.
    And n-gram statistics are certainly necessary for any statistical compressor.
    But for compression better than 206k on unfiltered book1,
    plain weighting is not enough.
    Or maybe I just mistakingly decided that you aim for that, after seeing
    "hutter prize" mentioned and all.

    > By the way: CTW is absolutely not a CM algorithm. A PPM
    > using a tree structure and a good weighting algorithm
    > comes closer to the explanation of CTW.

    And I think it is.
    For one, "Weighting" in CTW means the same as "mixing", and
    I don't remember anything similar to escapes there.
    Last edited by Shelwien; 23rd April 2009 at 14:55.

  8. #8
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    38
    Thanked 168 Times in 84 Posts
    I can't resist to my wish to give a reply for Shelwien's words...
    Fascinating

  9. #9
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by toffer View Post
    Hi!

    Your idea really reminds me of something i did a long time ago. You may want to read about context sorting.

    http://naomi.is.s.u-tokyo.ac.jp/~sad...adaseq97.ps.gz
    http://comjnl.oxfordjournals.org/cgi...2_and_3/94.pdf
    Thanks toffer!
    Thats exactly like my first implementation. The current one is a lot faster because there is not need to sort things.

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Well i did alot of stuff there. There's no need to sort actually, just read about the reinsertion process of new strings.
    You should really concider switching to C++.
    My implementation ended up to be faster than the implementation which is pointed out there, cause i managed to:

    1. merge the string buffer with the actual linked list (better cache efficiency)
    2. preform RLE compression of paths via skip links
    3. use a sliding window and delete the oldest phrase when inserting the newest (instead of a static buffer)

  11. #11
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    That's surely better than nothing, but 210k on book1
    doesn't require much of a model. Here I have a coder with
    precise occurence numbers and mixing with fixed ad hoc weights,
    and it still compresses book1 to 215.7k.
    http://ctxmodel.net/files/tree_v1.rar
    Good for you having made a compression algorithm that does not need much of a model and reached a compression of 215.7k compared to my first try that reached 205k of compression (210000b = 205k). Impressive. That will help me a lot developing my algorithm.

    Quote Originally Posted by Shelwien View Post
    Well, call them substrings or prefixes or contexts or n-grams
    then. "Pattern" sounds like something more advanced.
    And if its simple n-gram statistics, then most of CMs collect
    them, including the thing I linked above.
    And paq8 is beyond that, as it has a text-specific word model
    and other tricks, in addition to prefix context models.
    I called them patterns because my knowledge of the english language is less than english.dic

    Quote Originally Posted by Shelwien View Post
    Well, I'd surely appreciate if somebody would develop a new
    compressor, not bitwise and/or not hash-based.
    And n-gram statistics are certainly necessary for any statistical compressor.
    But for compression better than 206k on unfiltered book1,
    plain weighting is not enough.
    Or maybe I just mistakingly decided that you aim for that, after seeing
    "hutter prize" mentioned and all.
    Sure I want to beat Paq8. Who does not want that? Hutter prize? I'll quote myself for that one "No, I?m not pursuing the Hutter Price"

    Concratulations with your knowledge/skills showoff. I'll try to remember the meaningfull parts of your postings and just forget your supreme attitude.
    Last edited by Kaw; 23rd April 2009 at 15:57.

  12. #12
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by toffer View Post
    Well i did alot of stuff there. There's no need to sort actually, just read about the reinsertion process of new strings.
    You should really concider switching to C++.
    My implementation ended up to be faster than the implementation which is pointed out there, cause i managed to:

    1. merge the string buffer with the actual linked list (better cache efficiency)
    2. preform RLE compression of paths via skip links
    3. use a sliding window and delete the oldest phrase when inserting the newest (instead of a static buffer)
    Do you have a (compiled) version of the implementation to compare to my current implementation?

    I will switch to c++ when part 1 and part 2 are finished. c# helps me to be able to implement new idea's fast and I admit that c++ is a nice language, but working an entire day on c++ gives me headaches, so I use it only when absolutely necessary.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Good for you having made a compression algorithm that does
    > not need much of a model

    Well, I kinda expected such a reaction.
    That's not a compression algorithm, just a parser.
    You could look around that site at least, before posting this.
    Then you'd be able to find this http://ctxmodel.net/files/ASH/ash04a.rar
    for example (202932 on book1, using /s255 option, 203294 by default)

    > and reached a compression of
    > 215.7k compared to my first try that reached 205k of
    > compression (210000b = 205k).

    Well, obviously by 215.7k I mean 2157xx here.

    > Impressive. That will help me a lot developing my algorithm.

    I hoped it could, actually, taking into account that the tree
    implementation there is not very common.

    > I called them patterns because my knowledge of the english
    > language is less than english.dic

    Its not much of a problem, but because of that I was somewhat
    disappointed, because initially I thought that you're trying
    to find all the useful sparse contexts or something.

    > Concratulations with your knowledge/skills showoff. I'll
    > try to remember the meaningfull parts of your postings and
    > just forget your supreme attitude.

    In fact, because of some other people with known "attitude"
    I'm really trying to avoid that usually.
    Can you be more specific and quote what exactly you didn't like in my posts?

  14. #14
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Ok Shelwien, I was wrong about you. You have proven that with your last post. I'm sorry. I just felt 'nuked' by your replies.

    Since you seems to know a lot: please explain to me what sparse contexts are. Which type of 'patterns' are actual useful in your opinion and why knowing every matching substring from the past isn't useful?
    Last edited by Kaw; 23rd April 2009 at 16:12.

  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I just grabbed my backups and found an old version. It's incomplete and doesn't have the RLE path compression and a cache efficient data structure. However iit implements the stuff from the paper i posted and modifies the context sorting to work on windows.
    Oddly more recent discs are broken - after just 3 years! What a crap! When i'm at home i'll have a look at my backups on HDD maybe i can find something there. You should have a look at the move() function within dict.ccp.
    Since this one is pretty old i don't know if it's fully bug free. The latest source implemeted some kind of symbol ranking compressor which estimated R0 matches via merging statistics from different similar contexts.
    Attached Files Attached Files

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Since you seems to know a lot: please explain to me what
    > sparse contexts are.

    Well, quoting paq8.cpp:
    Code:
    - Sparse.  Usually 1 or 2 of the last 8 bytes preceding the byte containing
      the predicted bit, e.g (2), (3),..., (8), (1,3), (1,4), (1,5), (1,6),
      (2,3), (2,4), (3,6), (4,8).  The ordinary order 1 and 2 context, (1)
      or (1,2) are included above.  Useful for binary data.
    Basically "sparse contexts" = "contexts with gaps".
    So its what I'd call "patterns".

    > Which type of 'patterns' are actual useful in your opinion

    Binary data is one thing, but for texts the fixed sparse contexts
    are not especially useful (though it never hurts to add an extra
    submodel when you have a good mixer).
    However, for texts its possible to invent many specific alternative
    contexts, like the word model used in paq series, or word length
    model (=distance model for spaces), or line length model (=distance
    model for LFs), or bracket model (for quotes and things like [],{},<>,(),
    which usually occur in pairs) etc.

    > and why knowing every matching substring from the past isn't useful?

    I didn't say it isn't useful. I said that such statistics are
    "certainly necessary". Just that to beat paq, you need to take
    into account some text-specific stuff, In fact, all bitwise
    coders, including paq, are not very good with text, and paq loses
    a lot with disabled word model (or if you xor it or something).
    192296 book1.paq8p-6
    200767 book1.xorFF.paq8p-6
    (it will become even worse without some other non-prefix submodels)
    And then, context clustering is really important.
    That is, being able to derive better estimations for a context with unstable
    statistics from several "similar" contexts.
    Also, simple numbers of occurences are not good enough either,
    as even texts are not stationary enough, so the method of probability
    estimation for a given context history is important too... afair
    paq uses o1 model there in some cases.

  17. #17
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    I replaced some code to create a simple pattern finder.
    M - match
    L - lowercase match
    G - Gap (of 1 byte)
    * - Any char

    Some results above P(true) above 0.5
    Code:
    MMG*MMLM		 #1		 p(1)=1
    GLMMLM*G		 #1		 p(1)=1
    MMMGL*ML		 #3		 p(1)=1
    GLMML*MM		 #1		 p(1)=1
    MM*GLMML		 #1		 p(1)=1
    GLMMLMMG		 #1		 p(1)=1
    MMLGLG*L		 #9		 p(1)=1
    MMLGLMML		 #1		 p(1)=1
    GLMMLM**		 #1		 p(1)=1
    GLMMLMM*		 #1		 p(1)=1
    MM*GL*ML		 #4		 p(1)=1
    MMLGL*GL		 #9		 p(1)=1
    GLMMMM*L		 #1		 p(1)=1
    MMMGMMLM		 #1		 p(1)=1
    MM*GMMLM		 #1		 p(1)=1
    MMG*LMGL		 #1		 p(1)=1
    MMLGMMLM		 #1		 p(1)=1
    MMG*LMML		 #1		 p(1)=1
    MGM*LMML		 #1		 p(1)=1
    MMMGLG*L		 #6		 p(1)=1
    MGGMLLMM		 #1		 p(1)=1
    MGLMGML*		 #1		 p(1)=1
    GLMMLMG*		 #1		 p(1)=1
    MM*GLG*L		 #9		 p(1)=1
    MMG*L*GL		 #9		 p(1)=1
    MG*MLMML		 #1		 p(1)=1
    MLGMLMML		 #1		 p(1)=1
    M*GMLMML		 #1		 p(1)=1
    MMG*LG*L		 #9		 p(1)=1
    MMLGL*ML		 #4		 p(1)=1
    MMGML**L		 #1		 p(1)=1
    GLMMLMGG		 #1		 p(1)=1
    GLM*LMM*		 #1		 p(1)=1
    MGMMLMML		 #1		 p(1)=1
    MG*ML*ML		 #2		 p(1)=1
    MG*MLM*L		 #2		 p(1)=1
    MM*GL*GL		 #9		 p(1)=1
    MMMGL*GL		 #6		 p(1)=1
    MGMGL*LM		 #1		 p(1)=1
    MM*GLM*L		 #2		 p(1)=1
    MLGMLM*L		 #2		 p(1)=1
    GLMGLGMM		 #1		 p(1)=1
    MGGML*LM		 #2		 p(1)=1
    MLMGLM*L		 #1		 p(1)=1
    MGM*L*ML		 #3		 p(1)=1
    M*MGLM*L		 #1		 p(1)=1
    MMG*L*ML		 #4		 p(1)=1
    MGMMLM*L		 #2		 p(1)=1
    M*GMLM*L		 #2		 p(1)=1
    MMLGLM*L		 #2		 p(1)=1
    MGMML*ML		 #2		 p(1)=1
    GLM*LMMG		 #1		 p(1)=1
    GL*MLMMM		 #1		 p(1)=1
    GML*MMLM		 #1		 p(1)=1
    MM*GLMGL		 #1		 p(1)=1
    GLM*LMMM		 #1		 p(1)=1
    MGLMGMLG		 #1		 p(1)=1
    MMLGLMGL		 #1		 p(1)=1
    M*GML*ML		 #2		 p(1)=1
    GLMMLM*M		 #1		 p(1)=1
    MGM*LM*L		 #4		 p(1)=1
    MLGML*ML		 #2		 p(1)=1
    MMG*LM*L		 #2		 p(1)=1
    GLMMLMMM		 #1		 p(1)=1
    MMMGL**L		 #15		 p(1)=0,933333333333333
    MGMML**L		 #10		 p(1)=0,8
    MLMGL*ML		 #5		 p(1)=0,8
    M*MGL*ML		 #5		 p(1)=0,8
    M*MGLG*L		 #8		 p(1)=0,75
    MLMGL*GL		 #8		 p(1)=0,75
    M*GML*GL		 #4		 p(1)=0,75
    MMGL*GLG		 #8		 p(1)=0,75
    LMMGL*GL		 #8		 p(1)=0,75
    *MMGL*GL		 #8		 p(1)=0,75
    MLGML*GL		 #4		 p(1)=0,75
    MMGLG*LG		 #8		 p(1)=0,75
    LMMGLG*L		 #8		 p(1)=0,75
    *MMGLG*L		 #8		 p(1)=0,75
    M*MGL*GL		 #8		 p(1)=0,75
    MLMGLG*L		 #8		 p(1)=0,75
    MG*ML*GL		 #4		 p(1)=0,75
    MMMGM*LM		 #4		 p(1)=0,75
    M*GMLG*L		 #4		 p(1)=0,75
    MMGL*GL*		 #8		 p(1)=0,75
    MG*MLG*L		 #4		 p(1)=0,75
    MLGMLG*L		 #4		 p(1)=0,75
    LGMMLM*L		 #4		 p(1)=0,75
    *GMMLM*L		 #4		 p(1)=0,75
    MMGLG*L*		 #8		 p(1)=0,75
    G*MMLM*L		 #4		 p(1)=0,75
    MM*GL**L		 #33		 p(1)=0,727272727272727
    MMG*L**L		 #33		 p(1)=0,727272727272727
    MMLGL**L		 #33		 p(1)=0,727272727272727
    MGM*L**L		 #37		 p(1)=0,702702702702703
    MMMGLGL*		 #115		 p(1)=0,669565217391304
    MMMGLGLG		 #115		 p(1)=0,669565217391304
    GLMML**M		 #3		 p(1)=0,666666666666667
    MMMG*LMM		 #9		 p(1)=0,666666666666667
    MGG*LMLM		 #3		 p(1)=0,666666666666667
    MMGMMMGL		 #3		 p(1)=0,666666666666667
    M*MGL**L		 #42		 p(1)=0,666666666666667
    LGMMLMML		 #3		 p(1)=0,666666666666667
    MLMGL**L		 #42		 p(1)=0,666666666666667
    MG*GLMLM		 #3		 p(1)=0,666666666666667
    M*GGLMLM		 #3		 p(1)=0,666666666666667
    GL*ML*MM		 #3		 p(1)=0,666666666666667
    *GMMLMML		 #3		 p(1)=0,666666666666667
    G*MMLMML		 #3		 p(1)=0,666666666666667
    GLMGLMM*		 #3		 p(1)=0,666666666666667
    MMMGML*M		 #3		 p(1)=0,666666666666667
    MMMGLMGM		 #6		 p(1)=0,666666666666667
    MMM*GLMM		 #9		 p(1)=0,666666666666667
    MMMLGLMM		 #9		 p(1)=0,666666666666667
    MGMLMGMM		 #3		 p(1)=0,666666666666667
    GLMGLMMG		 #3		 p(1)=0,666666666666667
    MLGGLMLM		 #3		 p(1)=0,666666666666667
    MG*ML**L		 #37		 p(1)=0,648648648648649
    M*GML**L		 #37		 p(1)=0,648648648648649
    MLGML**L		 #37		 p(1)=0,648648648648649
    MMMMGLMG		 #27		 p(1)=0,62962962962963
    MMMMGLM*		 #27		 p(1)=0,62962962962963
    MLLGLMGL		 #5		 p(1)=0,6
    MG**LMGL		 #5		 p(1)=0,6
    MMGL*MLG		 #5		 p(1)=0,6
    *MMGL*ML		 #5		 p(1)=0,6
    MMLMGMLM		 #5		 p(1)=0,6
    MMGGMMLM		 #5		 p(1)=0,6
    M*G*LMGL		 #5		 p(1)=0,6
    MMGLGLMM		 #5		 p(1)=0,6
    M**GLMGL		 #5		 p(1)=0,6
    MLG*LMGL		 #5		 p(1)=0,6
    M*LGLMGL		 #5		 p(1)=0,6
    MMGL*ML*		 #5		 p(1)=0,6
    ML*GLMGL		 #5		 p(1)=0,6
    MM*MGMLM		 #5		 p(1)=0,6
    LMMGL*ML		 #5		 p(1)=0,6
    MMMGM*ML		 #14		 p(1)=0,571428571428571
    MMMMG*ML		 #55		 p(1)=0,563636363636364
    MMMM*GML		 #55		 p(1)=0,563636363636364
    MMMMLGML		 #55		 p(1)=0,563636363636364
    MMMGGLGL		 #100		 p(1)=0,55
    MMMMMMGL		 #84		 p(1)=0,547619047619048
    MMMLMMGL		 #86		 p(1)=0,546511627906977
    MMM*MMGL		 #86		 p(1)=0,546511627906977
    MM*MGLMM		 #11		 p(1)=0,545454545454545
    MMLMGLMM		 #11		 p(1)=0,545454545454545
    MMMGLGLM		 #11		 p(1)=0,545454545454545
    MMGLGL*M		 #11		 p(1)=0,545454545454545
    MMMGMGMM		 #35		 p(1)=0,542857142857143
    MGMGL**L		 #15		 p(1)=0,533333333333333
    MMLMMMGL		 #91		 p(1)=0,527472527472527
    MM*MMMGL		 #91		 p(1)=0,527472527472527
    MMM*G*ML		 #117		 p(1)=0,512820512820513
    MMMG**ML		 #117		 p(1)=0,512820512820513
    MMMLLGML		 #117		 p(1)=0,512820512820513
    MMM**GML		 #117		 p(1)=0,512820512820513
    MMML*GML		 #117		 p(1)=0,512820512820513
    MMMLG*ML		 #117		 p(1)=0,512820512820513
    MMM*LGML		 #117		 p(1)=0,512820512820513
    MMMMMMMM		 #823		 p(1)=0,511543134872418
    MMMLMMMM		 #859		 p(1)=0,511059371362049
    MMM*MMMM		 #859		 p(1)=0,511059371362049
    M*MMMMGL		 #92		 p(1)=0,510869565217391
    MLMMMMGL		 #92		 p(1)=0,510869565217391
    MMML*MMM		 #969		 p(1)=0,510835913312694
    MMM**MMM		 #969		 p(1)=0,510835913312694
    MMM*LMMM		 #969		 p(1)=0,510835913312694
    MMMLLMMM		 #969		 p(1)=0,510835913312694
    MMMMLMMM		 #869		 p(1)=0,508630609896433
    MMMM*MMM		 #869		 p(1)=0,508630609896433
    MMMML*MM		 #1122		 p(1)=0,505347593582888
    MMMM**MM		 #1122		 p(1)=0,505347593582888
    MMMM*LMM		 #1122		 p(1)=0,505347593582888
    MMMMLLMM		 #1122		 p(1)=0,505347593582888

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I replaced some code to create a simple pattern finder.
    > M - match
    > L - lowercase match
    > G - Gap (of 1 byte)
    > * - Any char
    > Some results above P(true) above 0.5
    > MMG*MMLM #1 p(1)=1
    > GLMMLM*G #1 p(1)=1
    > MMMGL*ML #3 p(1)=1

    Well, you certainly imply a lot here...
    I suppose, "any char" here is predicted by the pattern?
    And the sample file is book1? (doesn't look like enwik).

    But anyway, I guess it'd still be cool if you could implement
    a statistical model based on this analysis. Like, take 100
    best patterns (btw its only good if it occurs frequently;
    a single 100% match can be safely ignored) and mix their
    predictions. Its a bit tricky (and damned slow) to
    compute the estimations for patterns with non-empty suffixes,
    but you mentioned bayesian inference before, so it should be ok.
    Btw, there's already an existing compressor based on this approach:
    http://encode.su/forum/showthread.php?t=296
    In one sense, its more advanced as it allows any bit patterns
    compared to your patterns consisting of only 00,FF,DF byte masks.
    And in other sense its not, as there's no "right contexts" there.

    Also, if its about text, I'd add spaces to patterns (LF as
    space for plaintext and as a separate class for enwik), and
    punctuation marks too.
    And then, order7 context might be ok for book1, but is
    certainly not enough for enwik.

  19. #19
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    I explain some more about the above described experiment. It was only made for myself to test if other patterns then 'M' would be usefull for predictions. I think the outcome is suprising for me and wellknown for you. It's indeed a test on book1 and I believe it's more like order8 because I test the match on byte 9 and thats the match which decides a success or failure.

    The link you provided me has 1200 32bit bitmasks. I have 64k 64 bit bytemasks. 64k contexts (is that the correct term in this case?) is too much for a usefull compressor, but for experimenting I believe it is cool and somewhere usefull. You could take the 100 most succesful patterns and add them to your context mixer and that would theoretically increase compression.

    By the way: * means no match checking, just go to the next byte.
    'G' means skip 1 byte from the selected bytes you check against the 8 preceding bytes before the byte you want to predict. 'G' wasn't correctly implemented in the first version. The currect results are slightly better.

    I'm thinking of 4 more patterns. That would increase the amount of contexts for 8 bytes to 16.7M and make the software useless slow, but just for this experiment. I'll give an update tomorrow.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I believe it's more like order8 because I test
    > the match on byte 9 and thats the match which
    > decides a success or failure.

    Ah, I misunderstood then.
    Still, what do you think about a bayesian inference
    for the symbol in the middle of the pattern?
    It would be surely a "new approach" (was never
    used in compressors due to slowness).
    To be specific, I mean that statistics P(ABCD|AB*D)
    can be collected, and then used for probability
    estimation at the point where only AB are known (in sequential coding).
    For that, we have to estimate the probabilities of all
    xD pairs of next symbols (using other contexts etc), and
    then we'd have (after integrating xD probability intervals)
    another alternative estimation for the next symbol being C,
    which can be mixed with other predictions.

    > The link you provided me has 1200 32bit bitmasks.

    Well, there's a different approach too (used by toffer
    in http://encode.su/forum/showthread.php?t=159
    for example) - we can just find a good set of context
    masks for a specific compressor architecture using GA
    or other similar techiniques.

    Anyway, I tried to point here (by showing paq8k2
    results on common data) that fixed masks are not
    enough, though can have their share too.

    For example, paq uses a whole previous word
    (up to the nearest space and lowercased) as a context,
    which can't be simulated with fixed masks.

    Btw, I'd suggest looking at zpaq:
    http://encode.su/forum/showthread.php?t=281
    Its context description syntax might give you some insight,
    and also you might be able to use it for practical tests.

    > By the way: * means no match checking, just go to the next byte.
    > 'G' means skip 1 byte from the selected bytes you check
    > against the 8 preceding bytes before the byte you want to
    > predict. 'G' wasn't correctly implemented in the first
    > version. The currect results are slightly better.

    I don't understand how 'G' is different from '*' then.

    > I'm thinking of 4 more patterns. That would increase the
    > amount of contexts for 8 bytes to 16.7M and make the
    > software useless slow, but just for this experiment. I'll
    > give an update tomorrow.

    What about building a list of all existing contexts
    (there're not that much in book1) and then clustering
    them by applying some masks?
    Last edited by Shelwien; 26th April 2009 at 21:46.

  21. #21
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    I don't understand how 'G' is different from '*' then.
    * is a byte skip on both sides and 'G' is a skip on one side.

    No time for replying properly. More tomorrow.

  22. #22
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Currently the model is able to find patterns with the following characteristics:
    * - 1 byte skip (in needle and haystack)
    G - 1 byte skip in history data (haystack)
    g - 1 byte skip in current pattern (needle)
    M - 1 byte match
    L - 1 byte case insensitive match
    T - 1 byte type match (is it a consonant, vowel or other character?)
    W - multibyte word match
    S - multibyte word skip (in needle and haystack)

    It's slowwwwww... I'm going to call the software 'Gary the Snail'. It's going like 1kb a day for the first 1kb of book1.
    Last edited by Kaw; 27th April 2009 at 19:27.

  23. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    If there is a big "for loop" (I'm sure there is), you can accelerate it by using threads in C#. AFAIR, there was an easy way to modify a "for loop" into multithreaded loop (similar to OpenMP). That way you can speed up searching with almost zero effort.

    Edit:
    I've found something. If you want to use "brand-new" techniques, you can use Task Parallel Library Extensions for .NET from Microsoft. If you want to use a home-made trick, have a look at it here: http://www.codeproject.com/KB/threads/csThreading.aspx
    Last edited by osmanturan; 27th April 2009 at 19:42.
    BIT Archiver homepage: www.osmanturan.com

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    It seems like there could be equivalent different patterns.
    Also I still don't understand how 'G' is different from '*' -
    you skip a byte of pattern (the G) anyway, right?
    Last edited by Shelwien; 27th April 2009 at 19:36.

  25. #25
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Quote Originally Posted by Shelwien View Post
    It seems like there could be equivalent different patterns.
    Also I still don't understand how 'G' is different from '*' -
    you skip a byte of pattern (the G) anyway, right?
    * is a skip on both sides and 'G' is a skip on 1 side.

  26. #26
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    Suppose I have a tree path with nodes (each node represents an order n) of the current situation and every node contains a counts of bit0 and bit1. How do I mix this for a good prediction?

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > every node contains a counts of bit0 and bit1

    In such case its actually unlikely to get a good prediction
    even with very good mixing function, because of common
    data nonstationarity.
    At least, you have to use counters like
    f = f*(1-wr) + flag
    Also commonly counter parameters (wr here) have to be different
    for different orders etc.

    > How do I mix this for a good prediction?

    There's no single answer to this.
    Basically its very similar to asking how to design a good model.
    That is, the most general approach here is to treat the inputs
    as a context for a secondary model.

    And as to practical mixing functions, the NN-based approach
    seems to be most widely used (linear mixing in logistic space,
    introduced by Matt Mahoney).

    Then there're a few of other methods, mentioned in
    http://ctxmodel.net/rem.pl?-12 for example.

    And then the method with best speed/quality balance seems to
    be adaptive binary mapping (SSE2) of quantized state counters (FSM).
    Well, pairs of primary bit probability estimations are used as
    context for a simple secondary model, as I said.
    This scheme is used in toffer's coders (m1) and, probably, ccm.

    Also, there're many other ways, including CTW and PPM escaping
    methods. but they seem to have worse performance than ones listed
    above (there's no proof of their inferiority, though).

  28. #28
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    I clearly having big problems making myself understood, so I added my latest version of the pattern finder as an attachment. I believe that my bitCounters are far from stationary, but take a look for yourself
    Attached Files Attached Files

  29. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    As far as I can see, you just count context occurences there,
    like I previously supposed, and that _is_ stationary.
    Stationary doesn't mean static, its http://en.wikipedia.org/wiki/Stationary_process
    In short, with stationary model and data, the compression ratio
    variance would converge to zero.
    But in practice examples of such data are really hard to find.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •