Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: CM model

  1. #1
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My current candidate for cmm3 has up to 8 bytes left per nibble model for additional stuff in orders 3 and 6. I'm already storing lzp pointers, a secondary hash for fast exclusion of false matches and lzp length models (lengths from MIN_LEN...MIN_LEN+14). Any idea for additional stuff to use for modelling?

    Unfortunatly multiple lzp pointers don't pay off... (compression is slightly better, but performance dropps)

    Greets
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  2. #2
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    CMM is good!
    CMM2 is very good!
    CMM3 it will be fantastic ?

  3. #3
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks - i hope i can make something useful
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  4. #4
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Can't wait for the release of the v3 powerhouse!!!

  5. #5
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    to my opinion have only to increase the speed of compression! for the rest is indeed a presser of high-level with every type of file!

  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the praise, but...

    No ideas?

    I don't want to leave the extra space unused.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    As for me, I didn't understand what were you talking about.

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I'm modelling bytes by splitting them into nibbles, each nibble is looked up in a hashing table dependent on context - pretty much the same like paq, only minor diffrences in hashing and collision handling.

    Under each order 6 and 3 context i've got 8 bytes of unused memory per such a nibble model - maybe it is clearer now?
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > Under each order 6 and 3 context i've got 8 bytes of unused memory per such a
    > nibble model - maybe it is clearer now?

    Yes, this is much better.

    1. I don't understand why people are so obsessed with using hashes,
    especially if that's the same people who care about speed so much.
    The only real hash advantage is simplicity - never the speed, even
    if its a small hash fitting into L1.
    So it'd be much better imho if you'd forget about hashes and make
    some trees, though they're more complex to code

    Well, entropy-based hashes are better than trees, but I doubt that
    you're using that.

    2. First what comes into mind is that you can store your context
    sequence. That is, last symbols encountered in this context -
    they're very useful for SSE and also you can make a delayed update
    like that - which might improve both speed and compression too.
    Also you can store any context related parameters there - like
    mixer weights for mixing with order-(-1) or something else, or just
    some useful statistics like probability of best prediction coming
    from this context.

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That's true, hashes suffer from the small cache (and atm i'm too lazy to code tricky trees - actually i never thought about using trees). One of my concepts doesn't work with trees, a hash seeding function based on file type (allows you to optimize hash/coding related stuff and to be able to do specialized modelling).

    But i don't wont great speed - i want to achieve a good tradeoff.

    Your ideas are a bit vague, i had similar concerning weights and SSE. Looks like i'll have to try some of these.

    Entropy hashes? Give me some more info, please.

    Delayed update? Does this really help? I can think of detecting an unfrequently seen symbol (by a lru queue) and excluding it form updates. Again, could you give me any sources?

    Order-(-1) ? I don't use a special context for anything. Could this be another point for improvement?

    Sorry, if i behave a bit diffuse, i'm just collecting ideas to try.

    Thanks for any further tips.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > That's true, hashes suffer from the small cache (and atm i'm too lazy to code
    > tricky trees - actually i never thought about using trees). One of my concepts
    > doesn't work with trees, a hash seeding function based on file type (allows you
    > to optimize hash/coding related stuff and to be able to do specialized modelling).

    Hash is a direct memory lookup.
    Tree is a sequence of lookups.
    Obviously you can use a binary (or 16-way if you like nibbles) tree to associate data with hash value.
    And of course using multiple lookups only better than single lookup
    when it improves the cache hit rate.

    > But i don't wont great speed - i want to achieve a good tradeoff.

    Whatever.
    My opinion is that one has to achieve a perfect result
    (=compression ratio here), and only then start
    to optimize the resource usage.

    And I bet that you don't have even a good model template for
    a sequence of 32-bit integer numbers, for example.
    While good text models are much more complex, and even for,
    say, x86 executables you need a disassembler and pattern
    libraries.

    > Entropy hashes? Give me some more info, please.

    Well, hash is a direct mapping of data to a memory address, which
    is a number in the given interval.
    And most effective mapping of data to a number is the arithmetic code.
    So the best hash is made by using the real data statistics in a static
    model for this mapping. (Well, dynamic model would be perfect, by you'd
    need to remap whole table on every update .

    > Delayed update? Does this really help? I can think of detecting an unfrequently
    > seen symbol (by a lru queue) and excluding it form updates.

    Delayed update for binary counter:
    1. You have a counter in initial state.
    Also you have a storage for last 4 bits encountered in this context
    (and presumably the length of this bit sequence, though its not really
    necessary).
    2. Until all these first 4 bits appear in context, you don't need
    to perform any updates, as you can just look up the precalculated
    counter value.
    3. Then, to emulate your normal counter, you have to update the
    counter with first bit from the queue, and shift the queue.
    But, instead, you can use all the queued bits as a context for
    counter update parameter selection and such = profit.

    So, in the worst case delayed counter only uses some extra memory,
    and just performs slightly faster on average, while fully emulating the
    normal counter.

    > Again, could you give me any sources?

    I'm talking from my own experience as I'm not into investigating others
    works unless they show some results that I can't.
    And I'm not going to post a demo of this technique on a forum, though
    I can probably send you some example if you mail me a message.

    > Order-(-1) ? I don't use a special context for anything. Could this be another
    > point for improvement?

    Order-(-1) is context "before order0", with static (and usually equal)
    probabilities for all symbols.
    Also its very useful and you should read my posts about counters if
    you haven't.

  12. #12
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the feedback.

    I already used some technique to update bit models based on seen nibbles in an order, but i did an update every hit, not delayed (a modification of something i posted some time ago). This improved compression, but hurt speed. I will give it a try.

    You seem to be obsessed by "perfect" (or, well, good) modelling.

    I meant sources not in the sense of *.cpp.

    If you've got some other thoughts, please let me know.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > You seem to be obsessed by "perfect" (or, well, good) modelling.

    I'd not call that obsession as I'm too lazy to even take on the Hutter contest
    But I still think that its common sense to start with more precise
    algorithm and optimize its speed, not try improving the quality of
    initially fast-and-rough algorithm, because in the latter case you
    don't even know what you are aiming for.
    And in that sense I especially don't like the LZ approach as it
    gives at least 50% redundancy for most formats and that can't be
    fixed as its natural for this kind of algorithms.

    And also I have to add that a "precise algorithm" is not necessarily
    as slow as paq8 at the same compression level, because Matt made it
    in the same "speed first" way, and then tried to compensate that
    by mixing a multitude of submodels.

    > I meant sources not in the sense of *.cpp.

    I wasn't sure so answered for both meanings.

    > If you've got some other thoughts, please let me know.

    I have lots of ideas, but can't just list them all, you have
    to ask specific questions
    Eg. here's an advice to pay more attention to statistics
    initialization - most CM models are very redundant at the
    starting phase, so its better to use a "bootstrap model" first,
    while updating the main one, and fully switch when it becomes
    effective. And at least you should notice that initializing a
    bit probability in context "0000" with 0.5 is dumb.

  14. #14
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by Shelwien
    But I still think that its common sense to start with more precise
    algorithm and optimize its speed, not try improving the quality of
    initially fast-and-rough algorithm, because in the latter case you
    dont even know what you are aiming for.
    you cant make an universal most precise algorithm as strucutres od data are different from file to file.

    the best compressor that gives best average compression ratio is copy command. if you dont believe, make some files with random content an try to compress them.

    compressors just utilize specific properties of commonly used data structures and that makes them useful. you cant design good compressor with only theoretical approach. experimenting (shamanic ways) gives you chance to discover new ways of compressing data (and new properties of commonly used data).

    the only way to design a better compressor is to analyze input data more deeply.

    Quote Originally Posted by Shelwien
    And at least you should notice that initializing a
    bit probability in context "0000" with 0.5 is dumb.
    why? what if you want to compress hardly compressible data (with fairly random content)?

    in theory you shouldnt do any preinitialization (ie. you should set everything to 0.5 probability) but you should adapt quickly at the beginning of adaptation.




    good compressor is a bijective compressor (at least almost bijective - a few bytes overhead per data block is acceptable).
    Quote Originally Posted by Shelwien
    And in that sense I especially dont like the LZ approach as it
    gives at least 50% redundancy for most formats and that cant be
    fixed as its natural for this kind of algorithms.
    im using bijective variation of lzp method im encoding one flag per each byte telling if its matched by lzp model. if its not matched then its coded using other method that knows that byte predicted by lzp model wont appear - so it excludes it from statistics. so this way we have only one way to encode particualr byte, so we have bijective compressor.

    bu i agree, most lz methods are very reduntant. but they provide fast decompression and they arent so bad (look at lzma or uharc:alz with multimedia filters).

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    >> But I still think that its common sense to start with more precise
    >> algorithm and optimize its speed, not try improving the quality of
    >> initially fast-and-rough algorithm, because in the latter case you
    >> don't even know what you are aiming for.
    >
    > you can't make an universal most precise algorithm as strucutres od data are
    > different from file to file.

    Thanks, i didn't know

    ...Wonder where you've got an idea that I was talking about anything
    universal. My point is that its wrong to start making a fast and
    redundant model while not having a precise one, I didn't specify the
    kind of model, and it could be any.

    > the best compressor that gives best average compression ratio is 'copy' command.
    > if you don't believe, make some files with random content an try to compress them.

    I don't believe, because I don't have any purely random files.
    Even archives and other compressed formats contain some compressible
    headers, and could be usually recompressed with a better ratio too.

    > compressors just utilize specific properties of commonly used data structures
    > and that makes them useful. you can't design good compressor with only
    > theoretical approach. experimenting ('shamanic ways') gives you chance to
    > discover new ways of compressing data (and new properties of commonly used data).

    "theoretical approach" doesn't mean that you devise some scheme in your
    head and its supposed to be perfect.
    Instead, it suggests that if you don't understand something, you should
    set up some experiments to gather statistics that would maybe help you
    understand more.
    And with "shamanic way" you don't actively gather any knowledge, but try
    more-or-less random code patterns until something works.
    Well, it might be quite valuable if you're able to explain some result
    and replicate in other circumstances, but its more convenient with
    genetic algorithms or other bruteforce schemes, where you at least know
    that your optimal result is not scalable or anything.

    > the only way to design a better compressor is to analyze input data more deeply.

    it doesn't help much with "speed first" models.

    >> And at least you should notice that initializing a
    >> bit probability in context "0000" with 0.5 is dumb.
    >
    > why? what if you want to compress hardly compressible data (with fairly random
    > content)?

    Normally you don't.
    And I was just saying that context is context even in initialization stage, so whatever model you use, you should use it in initialization
    to compute the probability after processing the given context.
    That includes a "direct" model which would estimate a 0.5 probability
    even in context "0000", but its still dumb.

    Also if one wants to support random data, then he makes a "direct"
    submodel and assigns some weight to it. In which case a "normal"
    model, designed for redundant data, still should be initialized
    using internal context correlations.

    > in theory you shouldn't do any preinitialization (ie. you should set everything
    > to 0.5 probability) but you should adapt quickly at the beginning of adaptation.

    Good adaptability is good but is almost impossible to implement.
    Most that practically can be done is maintaining (and mixing) several
    models of different complexity and adaptation speed.
    In which case you still need to properly initialize them, though it
    will help less than with a single model.

    > i'm using bijective variation of lzp method i'm encoding one flag per each
    > byte telling if it's matched by lzp model. if it's not matched then it's coded
    > using other method that knows that byte predicted by lzp model won't appear - so
    > it excludes it from statistics. so this way we have only one way to encode
    > particualr byte, so we have bijective compressor.

    Anyway, probably there's still a possibility to decode the same file from different match sequences or something... or it won't be LZ.

    > bu i agree, most lz methods are very reduntant. but they provide fast
    > decompression and they aren't so bad (look at lzma or uharc:alz with multimedia
    > filters).

    They won't help with audio/video coding so I won't look
    (Also with uharc its kinda nowhere to look - LZMA is open-source at least)

    And I'd like to hear about any modern use cases for fast LZ-like compressors.
    Eg. ages ago I was getting my FIDO echomail via 26kbit dialup and
    ppmonstr was of great help in that sense. But I don't know who
    and why would find any real advantage in using any non-standard LZ compressor.

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by Shelwien
    Anyway, probably theres still a possibility to decode the same file from different match sequences or something... or it wont be LZ.
    no. its bijective. simple explanation of method:

    <div class=""jscript""><pre>
    a = getpredivtedvaluefromlzpmodel();
    if (inputvalue == a)
    {
    encodeflag(0);
    }
    else
    {
    encodeflag(1);
    encodeliteral(inputvalue, a);
    }
    </pre>[/QUOTE]


    where encodeliteral(inputvalue, a) is a bijective coder. it codes inputvalue knowing that value a cant appear (effectively removing it from statstics) - as a simplest example it can be an order(-1) coder with 255 size alphabet (its a 256 byte alphabet with a value removed) - ie. its unable to code value a.

    so you see that theres one way to encode a file and its still lz and its bijective.

    lz77 and lzss cant be bijective but you can easily make lzw bijective.

    how to make lzw bijective? simply we temporarily disable (remove) entries that could be coded using other codes - ie. we must always use longest match, because if we use shorter then we wont be able to encode a file.

    suppose we have dictionary:

    abcd
    ab
    cd
    a
    b
    c
    d

    suppose we have input

    abcd

    there are 3 cases:

    1. we encode it using entry abcd. ok.

    2. we encode firstly only ab part. then our dictionary look like:

    abcd
    ab
    a
    b
    c
    d

    we can only code symbol c here. so we encode c. then we have dictionary:

    ab
    a
    b
    c

    so were unable to code d.

    3. we encode firstly only a symbol. then we have dictionary:

    abcd
    ab
    cd
    a
    c
    d

    and were unable to code b.

    so we have bijective lzw!!!

    [QUOTE]<div class=""quoting"">Quoting: Shelwien</div>Wonder where youve got an idea that I was talking about anything universal. My point is that its wrong to start making a fast and redundant model while not having a precise one, I didnt specify the kind of model, and it could be any. </div>
    what do you mean by reduntant model and precise model? assuming both are bijective - ie. each input in encoder produces different output and each input in decoder is decodable and produces different output.

    edit:
    of course you must know the size of data in advance for lzw to be bijective - this is to exclude entries that are longer that unencoded part of file.

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > no. it's bijective. simple explanation of method:
    > a = getpredivtedvaluefromlzpmodel();
    > if (inputvalue == a)

    Ah. Then imho its not LZ
    Its probably symbol ranking + order0 CM
    (or not order0 depending on how you encode your "literals")

    Also then its not as fast as a normal LZ,
    and doesn't have usual LZ's assymetry bonus.

    So there's completely no sense to do it that way, as
    this method has the same speed (aka computational complexity)
    as eg. PPMd with significantly worse compression

    >> Wonder where you've got an idea that I was talking about anything universal. My
    >> point is that its wrong to start making a fast and redundant model while not
    >> having a precise one, I didn't specify the kind of model, and it could be any.
    >
    > what do you mean by 'reduntant model' and 'precise model'? assuming both are
    > bijective - ie. each input in encoder produces different output and each input
    > in decoder is decodable and produces different output.

    I call the model redundant when the same algorithm could compress better
    if better computing and storage precision was used.
    And I think that while some low level or redundancy is acceptable
    (eg. carryless rangecoder instead of full-interval one), and other
    people are ok with even higher levels (like introduced by using
    Matt's p-=p>>n counters with embedded prng), still there're levels
    where no common sense explanation is already applicable.
    And your "bijective LZ" algorithms are totally fitting example of
    the latter case, as representatives of real LZ approach at least
    have their decoding speed.

  18. #18
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Quote Originally Posted by Shelwien
    I call the model redundant when the same algorithm could compress better
    if better computing and storage precision was used.
    can you elaborate on that? give examples or something.
    i still dont understand what are you talking about.

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    >> I call the model redundant when the same algorithm could compress better
    >> if better computing and storage precision was used.
    >
    > can you elaborate on that? give examples or something.
    > i still don't understand what are you talking about.

    1. Read my topic about counters.

    2. Information is lost due to quantization (including rounding).
    Say, you're calculating the interval range r of 0..N and using
    it for arithmetic encoding. Let's suppose that average
    error e is introduced due to computational imprecision,
    such that eg. value of
    r'=floor(r/(2*e))*(2*e)
    is used instead of precise r.
    Then you'd have a redundancy of log2(N/(N-e)) bits per encoded symbol.
    Also if r is computed with precision less than log2(N) bits, the
    average error would be e=N/2^(rbits+1).

    The precise redundancy also can be calculated by feeding an interval
    [x..x+1) instead of each integer value into the model implemented
    with interval arithmetics.

  20. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    hmm, model is feeded with input data, not some intervals

    if i want to improve precision then i use larger variables, eg. 64- bit or so. what's the problem?

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > hmm, model is feeded with input data, not some intervals

    modelled sequence might be fixed, but all the statistics and
    intermediate variables can be evaluated with intervals as they have
    limited precision.

    Also in lossy methods even model input is uncertain

    > if i want to improve precision then i use larger variables, eg. 64- bit or so.

    Yes, but its not that simple.
    Better precision for variable values _allows_ to improve the overall
    precision, but that doesn't mean that you get better compression simply
    by using larger storage types - algorithms and parameters would probably
    need to be retuned.

    > what's the problem?

    The problem is that people here are busy writing fast compressors
    which don't use any sophisticated modelling, so no new developments
    are made in that direction.
    Meanwhile I want to talk about this kind of things, as there're many
    unsolved problems and I'm not that great in solving them myself.
    And speed optimization techniques don't concern me much, as they're
    bound to specific tasks and architectures so have relatively limited
    use. Also I think that optimization is not very creative area as its
    a problem of hardware/software documentation availability and coding
    time (including coding time for various tools) - and better results
    are usually achieved by investing more time than imagination or
    something like that.

  22. #22
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    the problem is that you have very good understanding of current compression methods but authors of new compressors not. instead they have new smart ideas - and they perform tuning by ad hoc manual constans changing.

    usually rewriting and redesigning some parts of algorithms gives more efficiency gain than squeezing out current model to the limit. as long as the model is bijective it's imo a good practice.

    there are many ways to develop compressors and all are good if they produce satisfactory results (in terms of efficiency or high compression ratio).

    if you're so good, why didn't you write any paq competitor?

    ok, i want to stop discussion. i think both approaches (mine and your) are fine and useful.


  23. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > they have new smart ideas -

    Really? I still haven't heard anything that wasn't known
    at least 5 years ago.
    Well, with the exception of ABS, but its unpractical, though
    interesting. Kinda similar to BWT in that sense.

    > there are many ways to develop compressors and all are good if they produce
    > satisfactory results (in terms of efficiency or high compression ratio).

    I'd say thats not exactly right.
    Of course general model design is a question of taste, but this way or
    another, the model is composed from some simple submodels designed for
    simple specific cases... and correct design of these building blocks is
    more or less fixed and mathematically provable.
    But nobody uses it.

    > if you're so good, why didn't you write any paq competitor?

    I hope I will, eventually. Well, not exactly paq replacement,
    but something for Hutter contest.
    The problem is that its not exactly question of better theory or
    something - text modelling is a really complex task that takes
    a lot of working time and can't be solved just by one or two
    simple inventions.

    But I guess main cause is a lack of motivation. Prize money is
    too small for that amount of work, and I don't have any practical
    use for text compressors, and nobody even wants to discuss interesting
    stuff with me

    > ok, i want to stop discussion. i think both approaches (mine and your) are fine
    > and useful.

    Whatever.

  24. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    But I would like to continue discussion. I put together the large text benchmark because text compression is so hard. You have to model the human brain! A few years ago I applied for a grant from NSF to offer prize money but was turned down. Then Hutter offered his own money. He was one of the few people who could see that text compression is an AI problem.

    A good text compressor should model semantics and grammar. paq8hp12 and durilca4linux both do in a primitive way. If you look at the dictionaries, you see that related words are grouped together. This allows groups of words with the same meaning to be treated as the same context.

    The modeling is primitive. For example, they can't use constraints like every sentence needs a verb and a period. I have some ideas on how to model language at higher levels using neural networks but I think it would require hundreds of GB of memory and weeks or months to run. If you look at enwik9, the biggest factor is not how sophisticated the model is, but how much memory it uses. Simple compressors like BBB do well just because they use memory efficiently. It is not so great on enwik8.

    Hutter based the contest on enwik8 to keep the computational limits reasonable. IMHO I think this is too small to solve the AI problem. I based my benchmark on enwik9 because I think you need about 1 GB to train an adult level language model. Children are exposed to 100 MB of language by around age 3. We never did resolve our differences, so that is why there are all these annoying little differences between the Hutter Prize and LTCB. I expect LTCB to be around for a long time (like the Calgary corpus), but better compression will have to wait for Moore's law.

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > But I would like to continue discussion.

    Yeah, its a good way to motivate me to do something

    > I put together the large text benchmark because text compression
    > is so hard. You have to model the human brain!

    Regrettably, text compression is not hard in the cryptographic sense,
    but just requires manually building a really complex parser and then
    a lot of relatively simple but different models for all the cases.
    You can say that these can be built automatically by statistic analysis,
    but I don't really believe in the possibility of decrypting a text in
    unknown language just based on the correlations in it.
    So I somehow feel that prize money would be less than the salary I'd get
    for my worktime, especially considering that the work would be
    compression-related anyway.

    > Then Hutter offered his own money. He was one of the few people who could
    > see that text compression is an AI problem.

    Well, I don't know about AI, but imho proper CM techniques are good for any tasks involving modelling... which is almost everything.
    So the experience gained from writing a "universal" compressor could
    technically be used for eg. building a better caching mechanism for a
    DB cluster or something.

    > A good text compressor should model semantics and grammar. paq8hp12 and
    > durilca4linux both do in a primitive way. If you look at the dictionaries, you
    > see that related words are grouped together. This allows groups of words with
    > the same meaning to be treated as the same context.

    Well, of course preprocessing is outdated already
    So it should be done with a word-based model.
    But both paq8 and ppmonstr (durilca=preprocessors+ppomonstr) are still
    inefficient even in what they already do.

    > The modeling is primitive. For example, they can't use constraints like every
    > sentence needs a verb and a period. I have some ideas on how to model language
    > at higher levels using neural networks but I think it would require hundreds of
    > GB of memory and weeks or months to run.

    Well, I don't think that semantic analysis is really needed for beating
    the current record.

    > If you look at enwik9, the biggest
    > factor is not how sophisticated the model is, but how much memory it uses. Simple
    > compressors like BBB do well just because they use memory efficiently. It is not
    > so great on enwik8.

    For that I have some idea that might work.
    Specifically, that storing statistics in memory is a time-memory
    tradeoff, so statistic storage could be treated as a caching
    system for calculation results.

    > Hutter based the contest on enwik8 to keep the computational limits reasonable.
    > IMHO I think this is too small to solve the AI problem. I based my benchmark on
    > enwik9 because I think you need about 1 GB to train an adult level language
    > model. Children are exposed to 100 MB of language by around age 3.

    I think that's kinda random as children don't usually learn from written text. And I think they couldn't even if wanted, without some external
    (to the written text) context.

    > We never did
    > resolve our differences, so that is why there are all these annoying little
    > differences between the Hutter Prize and LTCB. I expect LTCB to be around for a
    > long time (like the Calgary corpus), but better compression will have to wait
    > for Moore's law.

    Well, I'm not really interested in the AI aspect here, so imho its as
    good as any other dataset.

  26. #26
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    How does a text compressor know to assign a higher probability to "the cat caught a mouse" than "the cat caught a moose"? This is what I mean when I say text compression is an AI problem. Text is almost unique that it requires so much memory to compress well. You don't need lots of memory to compress images, audio, executables, etc.

    Hutter wrote some theoretical papers that relate to universal compression and AI. His AIXI theory equates intelligence to the ability to model the environment with the shortest possible program consistent with observation. This is just the compression problem. A universal compressor would code any string as the shortest program that outputs it. But Kolmogorov proved that the problem cannot be solved. http://en.wikipedia.org/wiki/Kolmogorov_complexity

    For example, suppose I encrypted a 1 GB block of zero bytes with AES, so the output would appear random. How would you compress it? An ideal compressor would output a small program that created the file the same way I did. That means you would have to guess the key or break the encryption.

    But anyway I agree the prize is too small. There is not much I can do about that. But I don't think this will stop people from working on the problem. Getting your name at the top of a list is enough reward for many people. The Calgary challenge has a very small prize but it has been around for 12 years so far. http://mailcom.com/challenge/
    Plus, the reward for solving AI will be much greater than just the Hutter prize

    For now I think AI is not so important for text compression, but only because computers are not powerful enough to implement it. Eventually Moore's law will change the rules and we won't be able to ignore it any longer. That is why I designed LTCB with no speed or memory limits. It is an open benchmark where anyone can submit or verify results.

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > How does a text compressor know to assign a higher probability to "the cat
    > caught a mouse" than "the cat caught a moose"?

    1. I'm not sure it should - your sentence is a good example
    2. It could also be a mistype.

    So that might be a wrong example, but I agree that there're some
    predictions that can be derived from knowing the subject's properties

    > This is what I mean when I say
    > text compression is an AI problem. Text is almost unique that it requires so
    > much memory to compress well. You don't need lots of memory to compress images,
    > audio, executables, etc.

    Well, that's clearly wrong as complete models for all of these data
    types should include a text submodel.
    Its just that structure recovery is more complex there.
    Also there's confusion due to lossy compression, which allows to achieve
    better results, but actually makes proper modelling even harder.

    Executables are kinda an exception here, but disassembly/decompilation
    and pattern analysis for code segment compression might require even
    more memory, and also executables are more of a container format
    nowadays - they usually include images, text and other kinds of stuff.

    > For example, suppose I encrypted a 1 GB block of zero bytes with AES, so the
    > output would appear random. How would you compress it? An ideal compressor
    > would output a small program that created the file the same way I did. That
    > means you would have to guess the key or break the encryption.

    If there was a demand for compression of unsalted AES-encoded redundant
    (uncompressed) data, like if MS used it in their .doc containers
    or something, I'd bet there'd be a solution already.
    Like, using a rainbow table approach or some hardware.

    Btw, have you considered using GPU for some paq version?

    > But anyway I agree the prize is too small. There is not much I can do about
    > that.

    Well, you can change dataset from enwik8 to some audio or video files...
    with lossless compression required, available compressors won't be much
    of a competition, so working on it would be worth the money

    > But I don't think this will stop people from working on the problem.
    > Getting your name at the top of a list is enough reward for many people.

    Well, A.R.'s name there might scare away some promising participants,
    as that's a guy who didn't ever write a real compressor of his own.
    That is, he does probably deserve that money for his work, but I think
    that its still you who actually holds the record.

    > The Calgary challenge has a very small prize but it has been around for 12 years so
    > far. http://mailcom.com/challenge/

    You might not know, but at the time it was very attractive for russian programmers.
    Also here's a proof that I know about it - http://mailcom.com/sh

    > Plus, the reward for solving AI will be much greater than just the Hutter prize

    Was there even a proof that a person with congenital deafblindness
    passes the Turing test? And that's what you're trying to emulate with
    compressors.

    > For now I think AI is not so important for text compression

    Well, what I actually wanted to discuss is not philosophy, but some
    practical problems. Like if there's a way to optimize the counter
    parameters without directly trying them all, or context model
    definition language syntax (I do use some via a preprocessor to C++),
    or applications of interval arithmetics, or decorrelation and
    entropy coding of context, or something else like that,
    or I'd never stop

  28. #28
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Shelwien
    If there was a demand for compression of unsalted AES-encoded redundant
    (uncompressed) data, like if MS used it in their .doc containers
    or something, Id bet thered be a solution already.
    Like, using a rainbow table approach or some hardware.
    I dont think so. The only way to compress encrypted data is to decrypt it. If you think there is an easier way, try compressing sharnd_challenge.dat from http://cs.fit.edu/~mmahoney/compression/#sharnd

    Quote Originally Posted by Shelwien
    Btw, have you considered using GPU for some paq version?
    It would be the best way to accelerate a neural language model, but it would also need a lot of memory. I havent tried it.

    But anyway I am also interested in practical compression, but first there are some theoretical problems to solve. Have you seen http://cs.fit.edu/~mmahoney/compression/rationale.html ?

  29. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    >> If there was a demand for compression of unsalted AES-encoded redundant
    >> (uncompressed) data, like if MS used it in their .doc containers
    >> or something, I'd bet there'd be a solution already.
    >> Like, using a rainbow table approach or some hardware.
    >
    > I don't think so. The only way to compress encrypted data is to decrypt it.

    That was implied. I just tried to say that there're relatively fast ways
    to crack a fixed encryption (non-parametric etc).
    http://en.wikipedia.org/wiki/Rainbow_table

    > If you think there is an easier way, try compressing sharnd_challenge.dat from
    > http://cs.fit.edu/~mmahoney/compression/#sharnd

    Well, we weren't talking about that previously.
    You use your key on each iteration there, so it might be impossible to
    crack without fully recovering the key, and keyspace is around 2^520
    (from log2(1+96+...+96^79))

    But with a keyspace up to 2^128 it could be possible.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > But anyway I am also interested in practical compression, but first there are
    > some theoretical problems to solve. Have you seen
    > http://cs.fit.edu/~mmahoney/compression/rationale.html ?

    Well, I did see it, but forgot already so had to read it again.
    Still don't undestand what are you calling "theoretical problems" though.

    Also I do agree that text compression is AI-hard, but that still doesn't
    mean that you can learn unknown language by using _any_ sufficiently
    large text corpus. Eg. even a english-speaking "normal" person wouldn't
    be able to guess from the previous sentence "what part of text" is a "corpus".

Page 1 of 2 12 LastLast

Similar Threads

  1. Subset model selection problem
    By Alexandre Mutel in forum Data Compression
    Replies: 28
    Last Post: 17th July 2009, 10:43
  2. deflate model for paq8?
    By kaitz in forum Data Compression
    Replies: 2
    Last Post: 6th February 2009, 21:48
  3. fpaq0f new order 0 model
    By Matt Mahoney in forum Forum Archive
    Replies: 31
    Last Post: 5th February 2008, 23:08
  4. CM Match model
    By toffer in forum Forum Archive
    Replies: 32
    Last Post: 1st February 2008, 20:26
  5. Model orders
    By Shelwien in forum Forum Archive
    Replies: 50
    Last Post: 22nd December 2007, 22:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •