Results 1 to 5 of 5

Thread: need advice on low- order backend of tarsalzp

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    hi.

    maybe i will find some spare time to work on tarsalzp someday. currently i'm thinking on techniques to be used in it.

    my current scheme is:

    1. mixed order-4 and order- 8 lzp model. it outputs one predicted symbol and codes a flag indicating whether that symbol occured or outputs information that context confimation failed and doesn't code any flag.
    2. if symbol wasn't coded by previous step then code it using aridemo like order- 2 coder.

    i will work on step 1. myself as almost nobody uses rank0 (or lzp in my terminology) codecs. maybe except toffer who uses it in cmm, but cmm is a cm which is very different.

    i plan to improve step 2. i want to implement aridemo like order- 3 coder and fpaq02f like order- 1 coder both with symbol masking (excluding). what i don't know is a escape modelling and handling of 'pollution'.

    what's 'pollution'? typical example is pdf's. it contains a lot of text and a lot of random (compressed) data. that compressed data pollutes statistics gathered on text additionally compressed data gets expanded by large amount. what i want to achieve is to make some workaround for it.

    mu current workaround is delayed addition of symbols. symbols are added on second occurence. that improves performance on pdf's a lot (it's usually better that ppmii on such type of data) but hurts on many other types of more or less textual data.

    if nobody knows a efficient solution for 'pollution' problem then maybe i will skip it as it's not very important.

    most important thing for me is efficient escape modelling.

  2. #2
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by donkey7
    whats pollution? typical example is pdfs. it contains a lot of text and a lot of random (compressed) data. that compressed data pollutes statistics gathered on text additionally compressed data gets expanded by large amount. what i want to achieve is to make some workaround for it.
    I had same problem with BSP files. They are much more "polluted". I thought, data segmentation could help on such files with my ROLZ based compressor. I tested this teorem with durilcas hidden segmentation features. But, it was a failure Maybe, your compressor fits in durilcas like segmentation which seems based on order-1. I think, you should try it. I wonder the results Because, I still insist on segmentation would help on compression. I think, durilcas power mostly comes from segmentation
    BIT Archiver homepage: www.osmanturan.com

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > 1. mixed order-4 and order- 8 lzp model. it outputs one predicted symbol and
    > codes a flag indicating whether that symbol occured or outputs information that
    > context confimation failed and doesn't code any flag.
    > 2. if symbol wasn't coded by previous step then code it using aridemo like
    > order- 2 coder.
    >
    > i will work on step 1. myself as almost nobody uses rank0 (or lzp in my
    > terminology) codecs. maybe except toffer who uses it in cmm, but cmm is a cm
    > which is very different.

    Wonder if you considered coding a match length instead of a flag.
    Of course, followed with proper masking etc.

    > i plan to improve step 2. i want to implement aridemo like order- 3 coder and
    > fpaq02f like order- 1 coder both with symbol masking (excluding). what i don't
    > know is a escape modelling and handling of 'pollution'.

    Well, escaping is simple to explain (though not so simple to implement).
    Its all about separate modelling of symbols which didn't occur in context yet.
    With PPM approach (eg. ppmd) you just keep an explicit escape symbol in the
    alphabet, and encode it when encountered symbol is not in alphabet. Then you
    switch to another context, mask out already seen symbols, and continue like
    that until you encode the symbol or run out of contexts.
    But with such a model escape codes amount to a significant part of a compressed
    file so are an obvious target for further improvement, like a secondary estimation
    (SEE is origin of SSE, in fact).
    This approach also benefits from so called "inheritance" = cutting a part of
    escape's frequency corresponding to actual symbol's probability in other context
    and assigning it to a new symbol in context.

    Another possibility is to separately encode the escape flag before processing
    the context alphabet. Recent results presented in other thread show that this
    allows for a better compression at the cost of one extra coder call per symbol.

    Then again, this is all about real PPM models, where only the highest order
    corresponds to actual data, while all others process the leftovers.
    It minimizes context updates and coder calls so is relatively fast.

    While with CM you simultaneously model the data with all the context orders,
    so its enough to leave some interval for escape, where mixed estimations from
    other orders could be mapped.
    Normally CM shows better compression, but it requires mixing (at least static)
    and update of all contexts to process each symbol, so is much slower.

    > what's 'pollution'? typical example is pdf's. it contains a lot of text and a
    > lot of random (compressed) data. that compressed data pollutes statistics
    > gathered on text additionally compressed data gets expanded by large amount.
    > what i want to achieve is to make some workaround for it.

    One possible workaround is like what I did in order_test4 demo, where
    rangecoder is not immediately used on encoding, but probabilities are
    collected instead, and discarded when they appear redundant.

    But this only stops the uncompression, though to prevent the "statistics pollution"
    its possible to keep a statistics modification log and undo the changes using
    it when redundancy is detected.

    Its also possible to keep a delayed copy of full model at low orders,
    but I doubt that it would be any faster than patches described above.

    And another idea might be to perform a segmentation pass before the
    actual compression and roughly determine which segments don't fit the model.

    > fpaq02f like order-1 coder

    Btw, fpaq0f2 is nice by itself, but I think its not fit for use as a submodel,
    as you can exploit the same correlations throughout the context model, and
    it would be better than simple model with escapes to fpaq0f2, and there
    won't be any reason to do the same modelling twice.

    > mu current workaround is delayed addition of symbols. symbols are added on
    > second occurence. that improves performance on pdf's a lot (it's usually better
    > that ppmii on such type of data) but hurts on many other types of more or less
    > textual data.

    That's just a matter of implementation.
    Its obviously possible to simulate a normal update with delayed one,
    so delayed update is always superior in prediction... at the cost of
    extra computation.

    > if nobody knows a efficient solution for 'pollution' problem then maybe i will
    > skip it as it's not very important.

    I agree that its wrong to provide support for random data while not having
    proper models for known formats. But thats probably not your reason.

    > most important thing for me is efficient escape modelling.

    Well, again, its simple.
    You just define what exactly is an "escape", and then model it,
    with secondary estimation, submodel mixing and whatever else
    if simple statistics are not enough.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    i think i will opt for segmentation to handle 'pollution'. it's easy to separate normal text from other types of data.

    and i will use something like see in ppmii. in ppmii there's no escape symbol in alphabet. escapes are effectively flags and they are modeled by number of symbols and their total frequency rather than separate esape symbol frequency. and i think that it even more closely represents the probability of escapes than normal aridemo like escape handling.

    of course i'm masking symbols (eg. from going from lzp models to literals coding). without such masking compression is much worse.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > i think i will opt for segmentation to handle 'pollution'. it's easy to separate
    > normal text from other types of data.

    The only rational way to define text here is data compressibility estimation using
    some model which is good for text. Of course its possible to use simpler model for
    estimation and more complex (and slow) for compression, but I'm not sure if its
    really necessary.

    > and i will use something like see in ppmii. in ppmii there's no escape symbol in
    > alphabet. escapes are effectively flags and they are modeled by number of
    > symbols and their total frequency rather than separate esape symbol frequency.
    > and i think that it even more closely represents the probability of escapes than
    > normal aridemo like escape handling.

    Well, Shkarin's sources are tiresome to read, but I'm pretty sure that escape's
    frequency is included into total context's frequency, which means, like I was saying,
    that escape is modelled as a special symbol in context's alphabet.
    And as to flags and SSE, then ppmonstr uses full unary encoding so all the symbols
    are basically flags. But it doesn't change the fact that there're different possible
    primary context models.

    > of course i'm masking symbols (eg. from going from lzp models to literals
    > coding). without such masking compression is much worse.

    I was talking about match length encoding instead of simple flags and corresponding
    masking of symbols from previously skipped substrings.

    Btw, it would be nice if you could release some separate order0 or 1 or 2 coder which
    we could compare to my coders and toffer's cmmd and fpaq series.

Similar Threads

  1. Advice in data compression
    By Chuckie in forum Data Compression
    Replies: 29
    Last Post: 26th March 2010, 16:09
  2. Java port of TarsaLZP
    By Piotr Tarsa in forum Data Compression
    Replies: 19
    Last Post: 8th July 2009, 07:46
  3. M01 - Order 01 context mixer
    By toffer in forum Data Compression
    Replies: 58
    Last Post: 17th June 2008, 19:29
  4. fpaq0f new order 0 model
    By Matt Mahoney in forum Forum Archive
    Replies: 31
    Last Post: 5th February 2008, 23:08
  5. TarsaLZP
    By Piotr Tarsa in forum Forum Archive
    Replies: 83
    Last Post: 31st October 2007, 20:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •