Results 1 to 15 of 15

Thread: Help with compression

  1. #1
    Member
    Join Date
    May 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Help with compression

    Hello,
    I am new here and it seems to be good forum about compression, so I 'm asking here.

    Currently I'm preparing for my diploma thesis and one of its part is to develop my own (database-like) file format to store data. The bytes (better say "table-columns") in this file are compressed and I need to achieve very good compression with good speed. Let consider following test case:

    I have binary test file which has 592 MB uncompressed. If I use ZLIB, it will be compressed to 190 MB in 30 seconds. Speed is good, but compression isn't. BZip2 (or using some BWT implementation) and LZMA are too slow.

    Then, I have found source code for GRZipII and using ST4 + MTF + Arithmetic coding (according to source, it does RLE too), the file was compressed to 100 MB in 1.5 minute. And it's very good, but there's no documentation and neither one comment in GRZip source code, so I can't use this code.

    So I have tried to find implementation for each method. Now I use ST2 + MTF + RLE + range coder. I get 120 MB in 1 minute. I'm trying to find any solution if it's possible to use some other method to compress it more. I found library OpenBWT which contains test for some RLE0 + "Simple Structured Coding Model" and it outputs very good filesize. But it doesn't output any compressed data. Also decoder isn't there.

    And I can't find any other implementation of RLE0 nor "Simple Structured Coding Model". ST2 and Range coder can be found here - http://www.compressconsult.com/. Using any other ST or arithmetic coding doesn't help (time and also filesize is bigger)

    Could you help me, please? Thank you!

  2. #2
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Big Muscle View Post
    Hello,
    Then, I have found source code for GRZipII and using ST4 + MTF + Arithmetic coding (according to source, it does RLE too), the file was compressed to 100 MB in 1.5 minute. And it's very good, but there's no documentation and neither one comment in GRZip source code, so I can't use this code.

    Could you help me, please? Thank you!
    You could write your questions here. I will try to answer on them.
    Enjoy coding, enjoy life!

  3. #3
    Member
    Join Date
    May 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    The main thing I am interested in is RLE-0 and Simple Structured Coding Model used in OpenBWT. Does exist some good C/C++ implementation for it? Google didn't find anything, neither any description of algorithm.

    And another question could be for someone who knows GRZipII. Could he describe algorithm used there? (mainly for MTF + RLE + EC)

  4. #4
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Big Muscle View Post
    And another question could be for someone who knows GRZipII. Could he describe algorithm used there? (mainly for MTF + RLE + EC)
    I am the author of GRZpII.
    Enjoy coding, enjoy life!

  5. #5
    Member
    Join Date
    May 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Fine Could you tell what kind of algorithm is used in file MTF_Ari.c? Is it general MTF + RLE + Arithmetic coding, or some special modification of these algorithms?

  6. #6
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Big Muscle View Post
    Fine Could you tell what kind of algorithm is used in file MTF_Ari.c? Is it general MTF + RLE + Arithmetic coding, or some special modification of these algorithms?
    In general the BWT output is divides to pairs (Char, RunSize). Next for each pair Char converted to rank. The rank is MTF or WFC depending on selected algorithm. The Rank code is encoded by following algorithm:

    if (Rank < 3)
    {
    Level 0 CM order 4 encode rank
    }
    else
    {
    Level 0 CM order 4 encode ESCAPE
    Level 1 CM order 1 encode rank with Fenwick Structure Model.
    }

    Next RunSize is coded by Fenwick Structure Model. For RunSize encoding L1 context together with actual char is used.
    Enjoy coding, enjoy life!

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quite shamanic... All you need is love... Oops... All you need is proper CM, as I prove with my BCM, and will prove even more with BCM v0.08. No MTF, DC or that shamanic RLE - just pure CM at its best!


  8. #8
    Member
    Join Date
    May 2009
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    encode: is source code available, so I could try it with my files?

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Big Muscle View Post
    encode: is source code available, so I could try it with my files?
    The source is not available...

  10. #10
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by encode View Post
    Quite shamanic... All you need is love... Oops... All you need is proper CM, as I prove with my BCM, and will prove even more with BCM v0.08. No MTF, DC or that shamanic RLE - just pure CM at its best!

    According to my experiments with CM and especially with PAQ like coder the encoder speed is slow. This is not my goal. My goal is practical compressor.
    Enjoy coding, enjoy life!

  11. #11
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    49
    Thanks
    0
    Thanked 3 Times in 2 Posts
    What's the definition of shamanic? A type of neural network and a bunch of reasonable working models as input for the neural network sounds pretty shamanic in my ears. I would say that a more scientific approach would be able to calculate chances based on statistics and scientifically accepted rules and algorithms. Yet the shamanic fuzzy network approach easily wins from a straight forward approach or am I missing information? Which Bayesian network or Hidden Markov Chain can beat Paq?

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > What's the definition of shamanic?

    I'd say, acquired by trial-and-error without understanding
    the underlying correlations.

    > A type of neural network and a bunch of reasonable working
    > models as input for the neural network sounds pretty
    > shamanic in my ears.

    Not necessarily as a specific behavior can be mathematically proven
    for a specific model (asymptotical convergence or something).
    Also you won't be able to get any good results simply by
    applying an existing NN framework to compression.

    > I would say that a more scientific approach would be able
    > to calculate chances based on statistics and
    > scientifically accepted rules and algorithms.

    As to rules, yes, but there're many cases where its much easier to replace
    a complex deterministic source model by an approximate probability distribution -
    compression might become somewhat worse, but the model would be
    much easier to handle, and also much more universal.
    An example of this is executables - for many of them we know
    the specific generation algorithm (as there's a gcc source),
    but is there really a sense to write a perfect decompiler specifically
    for gcc-compiled executables just to achieve somewhat better
    compression of said executables?

    And for most of the data we don't really know how the source
    works, but do know that its not any simpler than gcc.

    As to statistics - its what all CM compressors are based on,
    even if there're no plain symbol occurence counters or unhashed contexts.
    In fact, it could be good (for compression ratio) to be able
    to use occurence counters and context histories everywhere,
    but this "everywhere" is too much for that.
    So in practice it seems to be better to store a heavily quantized
    state of occurrence counters + runlength + recent history into
    a single byte and have a rough probability estimation for 20x more
    contexts, than to keep a precise stats for whatever fits in memory.

    > Yet the shamanic fuzzy network approach easily wins from a
    > straight forward approach or am I missing information?

    Actually there're reasons and possible proof of convergence
    behind most of the techniques used in paq.
    You just don't take into account that straight event counters
    are not everything possible, as any model automatically produces
    lots of secondary events.

    > Which Bayesian network or Hidden Markov Chain can beat Paq?

    Well, all the paq elements are pretty imprecise, so I'd not say
    that its impossible. For example, paq's performance on redundant
    data can be somewhat improved just by rangecoder replacement.

    Also I think that durilca still holds the first place in LTCB
    (http://cs.fit.edu/~mmahoney/compression/text.html) and
    for compression it uses ppmonstr, which is a descendant of
    markov chains.

    Its just that any pure initial model eventually turns into
    something weird if you build a compressor from it and have
    enough patience to improve it for a few years.
    Paq is a good example of that, as even if it was based on
    NN theory originally, now I don't think that anything in
    it matches any elements used in NN. In fact, only mixer
    is similar, but its update function was considerably changed
    since NN version.

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Gribok View Post
    According to my experiments with CM and especially with PAQ like coder the encoder speed is slow. This is not my goal. My goal is practical compressor.
    BCM has PAQ-like encoder and it's quite practical... If you trying to modify a PAQ source... You must have your own CM that specially built for the BWT-output!

  14. #14
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Shelwien View Post
    >
    Also I think that durilca still holds the first place in LTCB
    (http://cs.fit.edu/~mmahoney/compression/text.html) and
    for compression it uses ppmonstr, which is a descendant of
    markov chains.
    However durilca uses twice as much memory as the other compressors in the benchmark, so the question is still open. If you look at the Hutter prize and maximumcompression.com where memory constraints are not so severe, CM wins over PPM.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > However durilca uses twice as much memory as the other
    > compressors in the benchmark, so the question is still open.

    Its also _much_ simpler than paq8, both by number of submodels
    and by preprocessor implementation (it only uses external
    preprocessing, not data-specific models).

    > If you look at the Hutter prize and maximumcompression.com
    > where memory constraints are not so severe, CM wins over PPM.

    As to CM over PPM in general its obvious, and my choice is CM too.
    But as to some new PPM vs paq8, I think there're chances.

    > However, its top position in the LTCB is probably because
    > it uses twice as much memory.

    Thats mainly because of its use of a suffix tree for statistics.
    And its known that there're at least 2 32-bit unnecessary
    pointers per context in its suffix tree - the suffix pointer,
    and symbol table pointer. By removing these it may become
    somewhat slower (or not), but much lower memory requirements
    would be certain.
    Also durilca/x64 might have these 64-bit field alignments,
    like ppmd Jr1.
    And then, it should be possible to switch it to some
    kind of a hash, in which case its memory requirements
    would become even smaller (at the expense of more speed).

    > The model itself is more primitive because it uses PPM
    > rather than context mixing,

    Well its true... to some extent.

    > which restricts the semantic model to a 1-dimensional space.

    And this is certainly not.
    Ppmonstr has its "inheritance" which is an equivalent of
    static mixing, and then there's unary SSE, and even some
    alternate models attached via SSE contexts (sparse etc).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •