Results 1 to 16 of 16

Thread: Advanced Lazy Matching

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Lightbulb Advanced Lazy Matching

    Previously, I described a version of lazy matching with 2-byte lookahead ? instead of a just checking for a next byte for a longer match, as GZIP does with ?9? switch, we also check second byte for a longer match. If we checking from second byte, we probably should increase our threshold for dropping a match ? we will drop a match if we find, for example, a longer match than current match length plus one.
    OK, but what if we have found a longer match at the first byte, but after current match us may find even longer match? And here we may use an advanced Lazy Matching ? we keep the sum of current match length and length of the followed after that match. If at first, second, or longer distance we will find a longer match compared to that sum, we drop current match easily. This scheme can be regarded as a simplified or restricted Flexible Parsing.
    I think I will add such parsing to the BALZ v1.06!

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Wonder if you tried to collect any statistics on that.
    I mean, imagine that you're trying to encode the optimal parsing results
    using the data + fast parsing as a context.

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Interesting idea. I think it depends on type of an Optimal Parsing - what approach exactly we use...

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Optimal parsing is the parsing that gives the best compression.
    Though I guess any specific parsing scheme can be approximated
    with a statistical model.

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Lightbulb

    The problem is there is no REAL optimal parsing for LZ+ARI - i.e. if we deal with adaptive encodings. Using small blocks (4KB or so) we may achieve some local but not global minimum.
    Furthermore, Storer&Szymanski parsing and my LZ+ARI can perform worser than with simple Lazy Matching, for example on 'canterbury.tar', that points on calculating the real cost is necessary.

    Statistical coders have no parsing problems!

  6. #6
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts

    BALZ 1.05 option EX tested on MOC!

    new results on MOC test passed [option ex]!
    425.231.725 b
    Compression time=1.032,278 sec.
    Decompression time=66,899 sec.
    Hi!

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Well, I think that its relatively simple to implement the real optimal parsing -
    comparing to statistical models, at least
    Basically you just need to keep up multiple coding threads at once.

    Also it might be possible to do parsing based on statistical model.
    I mean, its obvious what the match is in determinated contexts,
    and in other places you can select the match eg. by coding costs.
    In fact, its actually the same thing - just using the results (like symbol
    code lengths) of statistical model as a context instead of fast parsing.

    But well, I also think that fast compressors should be implemented
    by speed optimization of good statistical models, instead of trying
    to improve efficiency of LZ, so whatever.

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Smile

    Quote Originally Posted by Nania Francesco View Post
    new results on MOC test passed [option ex]!
    425.231.725 b
    Compression time=1.032,278 sec.
    Decompression time=66,899 sec.
    Hi!
    Thank you!

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Talking

    Quote Originally Posted by Shelwien
    But well, I also think that fast compressors should be implemented
    by speed optimization of good statistical models, instead of trying
    to improve efficiency of LZ, so whatever.
    Yep, optimized LZ77 might be slower than, say PPM or CM - and thus less efficient in terms of compression time vs. compression ratio. The only reason is very simple and fast decoder - LZ77 is the leader here.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    > The only reason is very simple and fast decoder - LZ77 is the leader here.

    The only requirement for such a fast decoding is copying strings from somewhere
    instead of single bits or bytes.
    So I believe that eg. a word-oriented statistical model would be quite competitive
    in text decoding speed (while having much better compression).
    Also symbol ranking+rle might be fast enough too.

    And isn't plain LZ77 a little boring?
    For example, have you tried inserting the length codes at the actual match start position,
    instead of offset;length pairs?

    And then, statistical models are better only because they don't have the LZ's redundancy.
    But parsing problem remains intact... as a demo, you can try compressing a text and its
    version with all spaces removed.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Question

    Quote Originally Posted by Shelwien View Post
    For example, have you tried inserting the length codes at the actual match start position,
    instead of offset;length pairs?
    Not quite understand the idea. What did you mean?

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    1) inserting the length codes at the actual match start position instead of offset;length pairs
    2) insert<5;2>g the le<8;2><7;2> codes at<19;5>actual m<14;2>c<25;2>sta<45;2> posi<51;2>on <61;3>tead of<2;3>f<73;2>t;<65;7>pairs
    3) 23;;2;;2;2;;2;;5;2;;;;7;;;;;2;;;;;;;;2;;;;;;;;;;;; ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;3;;;;;;;;;;;;;;;;;;; ;;;;
    4) insert{0}g the le{4}{5} codes at{4}actual m{6}c{5}sta{2} posi{2}on {0}tead of{2}f{0}t;{0}pairs

    It appeared unexpectedly troublesome to explain, so I'd written this:
    http://shelwien.googlepages.com/marks.cpp

    (1) is processed string
    (2) is traditional LZ77 match coding
    (3) are lengths codes per location
    (4) is a string encoded with match symbols by (3)
    Actually (3) and (4) are supposedly the same stream, but it'd be unreadable.

    The main idea is that like this there's no offset limit.
    Last edited by Shelwien; 7th May 2008 at 18:43.

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Am I correct, it is similar to:
    LZCB #32 an optimal parser which codes without offsets. Uses "Match-Marker-References". Not as successful as it should be.

    (It's about Charles Bloom's LZCB coder)


  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Compiled marks.cpp:

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    Lightbulb

    Furthermore, you may see how your scheme generate larger amount of literals. Maybe, even pure LZP is better in this case - and here we may completely eliminate offset coding...

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    There'd be less extra literals, if we merge the streams as I said -
    additional literals would be needed only for marking the symbols in matches.
    Though it might be faster if we won't merge the streams and use
    RLE for unmarked areas.

    And anyway, of course its a slower scheme than simple LZ77,
    but so what? The real question is whether it has better compression
    or not.

    As to LZP, then this scheme is still faster as it doesn't require any
    hash accesses in decoding.
    Last edited by Shelwien; 8th May 2008 at 00:32.

Similar Threads

  1. Unaligned bitstring matching experiment
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 16th April 2010, 19:59
  2. Advanced Huffman Encoding
    By Simon Berger in forum Data Compression
    Replies: 28
    Last Post: 15th April 2009, 14:24
  3. Advanced counter/predictor
    By encode in forum Data Compression
    Replies: 22
    Last Post: 27th May 2008, 23:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •