Results 1 to 5 of 5

Thread: Looking at repeated Offsets

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts

    Looking at repeated Offsets

    Hello

    I've been puzzled since a long time as to why well-known compressors were keeping special escape sequence for "repeated" offset values. I found the idea weird, not seeing good illustration of it.

    However, the concept seems popular. Best example is probably 7zip, but you also have LZX/Cabarc, and many contributors in this forum also seem to use it.

    So, willing to get a better understanding of this behavior, i made some statistics on a very simple LZ77-like compression program of my own.

    I got on average the following results :

    Repeated previous offset : ~0.35%
    Repeated offset distance 2 : ~0.6%
    Repeated offset distance 3 : ~0.28%
    ----------------------------------- (7z and LZX only use first 3)
    Repeated offset distance 4 : ~0.2%
    Repeated offset distance 5 : ~0.15%
    Repeated offset distance 6 : ~0.1%
    and this keeps getting lower.

    So, here we have a few conclusions :

    1) the repeated offset distance 2 happens more often than repeated previous offset.
    This is partly because my compressor has no limitation on matchlength, so very long matches do not span over multiple sequences.
    I guess with a maximum matchlength (like ~250 which is quite common) this would boost a bit distance 1 occurence.

    However, i guess distance 2 will keep the lead.
    I believe this is consistent with a scenario where two long sequences are nearly identical, save a small difference in the middle.

    2) repeated offset do happen, yes, but not that often.
    Look, this is not even 1% for the distance 2.
    Distances 1-2-3 combined do reach 1%, but that's about it.

    Common sense would therefore suggest that escape sequence designating a repeated offset is itself a quite long sequence , like 7-8 bits long.


    So my question is :
    - What do you think of repeated offset ? Is it really effective ?
    - Is there any flaw in my proposed calculation ? is it consistent with other statistics ?


    Best Regards
    Last edited by Cyan; 24th September 2009 at 16:28.

  2. #2
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts

    Smile

    Before experts' explanation, I can say something for I'm playing with LZ77 these months.

    First , I want to say I'm quite sure repeated offsets are very necessary.

    I did't think over about the data that why the same offsets occurs near each other. But, on my test, it shows more than 150k more compression on a 50M file which compressed to about 6.2MB. (very coincide to what Bulat told me before I really tried it: about 1-2%improvements on binary). So the efficiency is very obvious.

    Second, it often works well on some kind of data like binary data or execuatables. But in enwik8, I got only less than 100K repeated matches.

    Third, parsing is the most important. If just using greedy parsing, and code it to 0/1/2/3 when facing repeated offsets, it's nearly useless. In my CSC, I used some tricks like LZMA does. If in 3 or 4 repeated offsets, we got a match, we prefer output the repeated code and match_len than just output the longest match pair. Then it helps much in my experiment.


    I hope this can help you.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    Cyan, probably the problem is data you are testing on

    repeated offsets are sequence of regular binary data structures. imagine any table - it has some repetitions and matches encoding those repetitions will have exatly the same offsets

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Thanks for suggestion.

    I changed my test suite from files with mainly text to mainly binary.

    And indeed, it changed the picture dramatically.
    Repeated Offset happen much more often, with a repartition in the vicinity of 5% / 3% / 2%. This is much more palatable, and makes the perspective for savings much better.

    I also noticed that Offset values tend to be smaller on average, meaning more advantages by using variable size offset values.

    On the other hand, match length values, and literal runs values, both tend to be longer. Especially literal runs.

    Last but not least, match hit rate plummeted, resulting in a compressed file with a lot of literals. I mean it. Offset fields share within compressed file went down from 80% (for text) to 20% (for binaries). Quite a dramatic change.

    However, overall compression ratio remains equivalent, because there are less matches, but they are longer...

    Well, i was not expecting such a big difference across the whole picture.
    This looks as if compression improvements are completely different for both set of files...

    Note : thanks Fu for mentionning that parsing makes a difference. This is an interesting lead, since I have to admit that my LZ77 test program is really very simple, so results could be different with better parsing and matching...
    Last edited by Cyan; 24th September 2009 at 18:31.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    they are so different that i even use very simple lz matchfinder for filetype detection

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •