I tried to use pcmpestri to find num_matched_bytes (in additional to 4 matched bytes which means that cache hits). This instruction compares 16 bytes and returns in ecx/rcx number of matched bytes:
After that compression time on enwik8 increases on ~10-15%. The reason is clear: pcmpestri requires a big number of CPU cycles but match_length is short.Code:mov rdx,16 mov rax,rdx .compare: movups xmm0,[rsi+rbx] pcmpestri xmm0,[rdi+rbx],00011000b ; compare unsigned bytes with negative polarity jc .difference_found add rbx,16 jmp .compare
Also I tried pdep and bextr for decoding pieces of data during decompression and did not get the acceleration.
I've tested on I3 5005U (Broadwell). May be in new CPU's the above instructions work faster, I don't know, I have no Skylake.![]()