Results 1 to 5 of 5

Thread: LZ77 speed optimization, 2 mem accesses per "round"

  1. #1
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    I just implemented an idea I've had for some time now:

    http://www.quicklz.com/quicklz_experimental.c
    http://www.quicklz.com/quicklz_experimental.exe

    It shouldn't be included on public benchmark websites yet becuase this is really just a dirty hack/experiment and not any official release at all

    Description:

    Normal sliding window LZ compression requires 3 memory reads per "round", 1) fetch some input data 2) lookup offset in hash table 3) read past input data from offset. The addresses of access 2 and 3 are often distant from eachother and thus slow, even
    though in L1 cache.

    This compressor is saving the first 4 bytes of the fetched input in the hastable together with the offset (concecutive data access) so that it can quickly be determined if we have a) no match b) exactly 3 byte LZ match or c) match longer than 3. In case a and b, which is usually more than 50% of the rounds, we avoid
    the 3'rd memory read.

    Benchmarks on the same files and same Celeron as on www.quicklz.com:

    QuickLZ 1.30 beta 6:
    pic.bmp Compressed 18108198 bytes into 16333503 bytes (90.2%) at 78.5 Mbyte/s.

    WINWORD.EXE Compressed 10577312 bytes into 7260047 bytes (68.6%) at 94.2 Mbyte/s.

    proteins.txt Compressed 7254685 bytes into 2602253 bytes (35.9%) at 144.4 Mbyte/s.

    This experimental version:
    pic.bmp Compressed 18108198 bytes into 16333503 bytes (90.2%) at 83.6 Mbyte/s.

    WINWORD.EXE Compressed 10577312 bytes into 7260047 bytes (68.6%) at 101.2 Mbyte/s.

    proteins.txt Compressed 7254685 bytes into 2602253 bytes (35.9%) at 149.1 Mbyte/s.

    EXPERIMENTAL CODE, DO NOT USE IN PRODUCTION

  2. Thanks:

    Bulat Ziganshin (24th November 2013)

  3. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Thanks for sharing this with us!

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,507
    Thanks
    742
    Thanked 665 Times in 359 Posts
    i use exactly the same idea in my CachingMatchFinderN: see http://www.haskell.org/bz/tor.rar

    oh, no, i save bytes with 3..6 offsest while hashing on 4 bytes (i.e. bytes with 0..3 indexes). first byte allows us to check that it is not just a collision (with a high, but not 100% probability) while other bytes allow to detect matches up to 7 bytes long. as result, in my current code i use 3rd access only for matches of 7+bytes long OR only for best match if it is less than 7 bytes. my code uses caching match finder only for 4+ entries hashing, while smaller and faster hashes set to be small enough to fit into L2 cache memory

    but it has meaning only for large hash - otherwise all 3 accesses go into cache with a good probability and each access is only several ticks long

    access to memory outside of L2 cache costs around 50-100 ticks so optimizing number of these accesses becomes the almost sole goal of the whole optimization nowadays. but if your program doesn't go outside of L2 cache (which is 256k-1024k typical these days), it's better to improve something else

    also note that each access outside of cache reads 16-byte line so you can significantly improve speed by using all 16 bytes once they read

  5. #4
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    Nice, I tried almos the same thing. Of course caching byte 3...6 (and 1?) is mostly an advantage when you want to find the longest match out of N possibe because you don't save the match verification of byte 0..2.

    For finding best match out of N possible (N being 8 in this sample code), I once experimented with caching byte 0...7 on x64 and looped trough them like:

    long long diff_best = 0, best_i = 0;
    for(i = 1; i < 8; i++)
    {
    long long diff = cache[hash][i] ^ *ptr_input;
    if (diff & -diff > diff_best)
    {
    best_i = i;
    best_diff = diff;
    }
    }

    It utilizes that x & -x returns a word where only the lowest bit in x is set (see http://www.jjj.de/bitwizardry/ for more code snippets) and it's a good alternative to using shr/shb bit scan instructions of ARM, x86, x64, etc, which unfortunatly isn't standard in C.

    I just got a ~10% speedup compared to the more naive method of just comparing byte N of *ptr_>input with byte N of cache[hash][i] where N is the length of the best match so far, excluding worse matches immediately. I tough speedup would be greater and it's probably worth looking into again.

  6. Thanks:

    Bulat Ziganshin (24th November 2013)

  7. #5
    Programmer
    Join Date
    May 2008
    Location
    denmark
    Posts
    94
    Thanks
    0
    Thanked 2 Times in 2 Posts
    "only lowest bit is set" should have been "only lowest set bit is set"

Similar Threads

  1. *.PSD ("Optimizer")
    By Yuri Grille. in forum Data Compression
    Replies: 0
    Last Post: 2nd February 2010, 21:42
  2. The lie of "The world is a globe"
    By Vacon in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 14th December 2009, 16:58
  3. PAQ8 C++ precedence bug (or "-Wparentheses is annoying")
    By Rugxulo in forum Data Compression
    Replies: 13
    Last Post: 21st August 2009, 21:36
  4. Freeware "Send To" interface for CCM and QUAD
    By LovePimple in forum Forum Archive
    Replies: 2
    Last Post: 20th March 2007, 18:22
  5. The "Nuff said" video!
    By encode in forum Forum Archive
    Replies: 5
    Last Post: 3rd January 2007, 23:11

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •