Results 1 to 13 of 13

Thread: SSE, BMI do not accelerate LZ77 (un)compression

  1. #1
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts

    SSE, BMI do not accelerate LZ77 (un)compression

    I tried to use pcmpestri to find num_matched_bytes (in additional to 4 matched bytes which means that cache hits). This instruction compares 16 bytes and returns in ecx/rcx number of matched bytes:
    Code:
     mov rdx,16
     mov rax,rdx
    .compare:
     movups xmm0,[rsi+rbx]
     pcmpestri xmm0,[rdi+rbx],00011000b      ; compare unsigned bytes with negative polarity
     jc .difference_found
     add rbx,16
     jmp .compare
    After that compression time on enwik8 increases on ~10-15%. The reason is clear: pcmpestri requires a big number of CPU cycles but match_length is short.

    Also I tried pdep and bextr for decoding pieces of data during decompression and did not get the acceleration.
    I've tested on I3 5005U (Broadwell). May be in new CPU's the above instructions work faster, I don't know, I have no Skylake.

  2. Thanks (2):

    Bulat Ziganshin (23rd May 2016),Cyan (23rd May 2016)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    http://encode.su/threads/1970-Fast-L...ll=1#post38465

    you can find instruction timing in the Agner Fog manuals; str instructions are known to be slow since they compare everything to everything. may be they are good match for MTF?

  4. #3
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    "SSE, BMI do not accelerate LZ77 (un)compression "

    Of course they do. See LZSSE for example.

    Also if you just compile scalar code with something like clang/llvm these days, you are actually getting massive amounts of SSE.

    Just compile LZ4 and look at the disasm - the auto vectorizer is actually turning it into SSE instructions all over whether you explicitly wrote them or not.

    eg. when you write a loop like this :

    do { LZ4_copy8(d,s); d+=8; s+=8; } while (d<e);

    the generated code is nothing like that. What you actually get is an unrolled SSE copy (with any modern clang/llvm/gcc on x64 which enable SSE2 and the auto-vectorizer by default). YMMV, it doesn't always actually vectorize or help.

    If you're just looking at match lengths, most people already use the method of doing a U64 xor and cntlz , which is already a kind of SIMD in the scalar register, so just going from 8-wide to 16-wide is pretty pointless (and has other costs). SIMD-in-scalar can be quite fast.

    Obviously this area is our bread-and-butter and we could say a lot about it but we won't
    Last edited by cbloom; 3rd August 2016 at 19:08.

  5. Thanks (3):

    DirtyPunk (30th May 2016),encode (17th November 2016),willvarfar (24th May 2016)

  6. #4
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by cbloom View Post
    If you're just looking at match lengths, most people already use the method of doing a U64 xor and cntlz , which is already a kind of SIMD in the scalar register, so just going from 8-wide to 16-wide is pretty pointless (and has other costs). SIMD-in-scalar can be quite fast.
    I saw this fast memcmp using 64 bit registers (xor, tzcnt) in a pdf, but at the time I can't find it... I doubt that this trick will accelerate memory comparison for a small match_length during compression enwik8 because tzcnt/bsf works slow. Currently I'm using a method like binary search to compute match_length (may be the code below is wrong? ).

    Suppose rsi points to current position, and rdi points to previous match that was found using hash table. Let we should compare no more than N bytes. First I read a byte from address rdi+N, increase it on 1 and write it to address rsi+N (i.e. immediatelly after end of source data). This will stop comparison after the end. (But we may read up to 8 bytes after the end.) rcx accumulates match_length.

    Code:
    .compare:
     mov rax,[rsi+rcx]
     sub rax,[rdi+rcx]
     jnz .1
     add rcx,8
     jmp .compare
    
    .1:
     test eax,eax         
     jnz .2
     add rcx,4
     bswap rax       ; or shr rax,32
     bswap eax       ; with the same execution time...
     
    .2:
     test ax,ax
     jnz .3
     add rcx,2
     bswap eax
     test ah,ah
     jnz .next
     inc rcx
     jmp .next
    
    .3:
     test al,al
     jnz .next
     inc rcx 
    
    .next:

  7. #5
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 114 Times in 36 Posts
    Quote Originally Posted by lz77 View Post
    because tzcnt/bsf works slow
    What makes you think that?

    TZCNT (where present)/BSF are single-cycle, single-uop on most x86s released in the last 10 years. Some AMD chips have microcoded BSF but direct TZCNT, so in that case you want to prefer TZCNT.

  8. #6
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 114 Times in 36 Posts
    Correction: just checked, you want to use LZCNT/TZCNT on all AMD chips that support them, and sometimes these scans are 2-uop/2-cycle. But either way, even the "slow" microcoded bit scans on AMD HW are significantly cheaper than even a single mispredicted branch.

  9. #7
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    Quote Originally Posted by fgiesen View Post
    What makes you think that?
    I change my code above on the
    Code:
    .compare:
     mov rax,[rsi+rcx]
     sub rax,[rdi+rcx]
     jnz .1
     add rcx,8
     jmp .compare
    
    .1:
     tzcnt rdx,rax
     shr rdx,3
     add rcx,rdx
    
    .next:
    After that compression time has increased on ~4%. I've tested on my I3 5005U on enwik8. I will test it on AMD Athlon x4 later...

  10. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    try this:
    Code:
    mov rax,[rsi]
    xor rax,[rdi]
    jnz l0
    mov rax,[rsi+8]
    xor rax,[rdi+8]
    jnz l1
    ...
    add rsi,64
    add rdi,64
    mov rax,[rsi]
    xor rax,[rdi]
    jz loop
    
    l0:...

  11. #9
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    I doubt that this will accelerate the code. I found a fast algorithm, that emulates 32 bit tzcnt: https://en.wikipedia.org/wiki/Find_first_set#CTZ

  12. #10
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by lz77 View Post
    I doubt that this will accelerate the code. I found a fast algorithm, that emulates 32 bit tzcnt: https://en.wikipedia.org/wiki/Find_first_set#CTZ
    Yann used it in LZ4, though he quickly replaced it with __builtin_ctz on platforms that have it. I don't know how the code evolved since then.

  13. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    889
    Thanks
    483
    Thanked 279 Times in 119 Posts
    Yann used it in LZ4, though he quickly replaced it with __builtin_ctz on platforms that have it. I don't know how the code evolved since then.
    Both intrinsics and software versions are present, with dynamic selection at compilation time according to target capabilities.

    https://github.com/Cyan4973/lz4/blob...lib/lz4.c#L248

    There are also versions for 64-bits and big endian targets.

  14. #12
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    I've tested my example on Athlon II X4 840: tzcnt decreases compression speed of enwik8 on 8%... My multibranch code at #4 works faster than the code at #7, hehe... AMD is worse than Intel.

  15. #13
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    125
    Thanks
    36
    Thanked 13 Times in 9 Posts
    I tried 64-bit ctz(x) by Raiser's algorithm:
    Code:
    #include <stdint.h>
    
    /* reiser_table[(UINT64_C(1) << k) % 67] = k for 0 <= k <= 64 */
    static const int8_t reiser_table[67] = {
            64, 0, 1, 39, 2, 15, 40, 23, 3, 12, 16, 59, 41, 19, 24, 54, 4, -1, 13,
            10, 17, 62, 60, 28, 42, 30, 20, 51, 25, 44, 55, 47, 5, 32, -1, 38, 14,
            22, 11, 58, 18, 53, 63, 9, 61, 27, 29, 50, 43, 46, 31, 37, 21, 57, 52,
            8, 26, 49, 45, 36, 56, 7, 48, 35, 6, 34, 33};
    
    int ntz64(uint64_t x) {
            x = x & -x;
            return reiser_table[x % 67];
    }
    I found it here: http://www.hackersdelight.org/corres.txt
    This algorithm increases compression time on 15% on enwik8 in comparison with my binary search code above.

Similar Threads

  1. lz77 visualisation
    By chornobyl in forum Data Compression
    Replies: 3
    Last Post: 7th June 2016, 16:04
  2. XPACK - experimental compression format (LZ77+FSE)
    By Zyzzyva in forum Data Compression
    Replies: 10
    Last Post: 30th May 2016, 10:32
  3. LZ77 on GPU
    By Bulat Ziganshin in forum Data Compression
    Replies: 2
    Last Post: 31st July 2014, 00:52
  4. LZ77 Performance Issues
    By david_werecat in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2011, 18:06
  5. Fast arithcoder for compression of LZ77 output
    By Bulat Ziganshin in forum Forum Archive
    Replies: 13
    Last Post: 15th April 2007, 17:40

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •