Now, jumpless match length computation:
Code:
MOV r0, [data_ptr+4]
XOR r0, [match_ptr+4]
BSF r0, r1
SHR r1,3
MOV r0, [data_ptr+8]
XOR r0, [match_ptr+8]
BSF r0, r2
SHR r2,3
... now r1, r2... stores number of matched bytes at match_ptr+4, match_ptr+8... We can even use SSE/AVX to make XOR and then load individual words from SSE register:
Code:
MOVDQU xmm0, [data_ptr]
PXOR xmm0, [match_ptr]
MOVD r0, xmm0
TEST r0,r0
JNZ match_not_found
PEXTR r0, xmm0, 1
BSF r0,r1
SHR r1,3
...
SSE variant needs more arithmetic mops, but less registers/L1 cache reads. Using CMOV, we can combine individual BSF results:
Code:
; The code is still incorrect, i will fix it later
MOV r0, [data_ptr+20]
XOR r0, [match_ptr+20]
BSF r0, r1
JZ probably_24_or_more_bytes_are_equal
ADD r1,20
MOV r3,4*8 ; r3/8 = match length
MOV r0, [data_ptr+16]
XOR r0, [match_ptr+16]
BSF r0, r2
CMOVZ r3,r1
ADD r2,16
MOV r0, [data_ptr+12]
XOR r0, [match_ptr+12]
BSF r0, r1
CMOVZ r3,r2
ADD r1,12
...
SHR r3,3 ; now r3=match length
SSE 4.2 PCMPESTRI instruction can also be used to find number of equal bytes:
Code:
MovDqU xmm0, [data_ptr]; find the first *different* bytes, hence negative polarity
PcmpEstrI xmm0, [match_ptr], EQUAL_EACH + NEGATIVE_POLARITY
ja all_16_bytes_are_equal
; here ECX holds the number of equal bytes
According to Agner, PCMPESTRI is rather slow. We can replace it with faster PCMPISTRI and lose a few matches containing 0 bytes. Or we can employ SSE2 PCMPEQB instruction:
Code:
MovDqU xmm0, [data_ptr]
PCMPEQB xmm0, [match_ptr]
; here xmm0 equal bytes contains FF and non-equal are 0's
The question is how to combine all those bits now in order to get something useful. For example, it would be great to combine results of two 16-byte PCMPEQB comparisons into single 32-bit GPR register containing just 1 bit per source byte and then use BSF to find the first non-zero bit:
Code:
; PMADDUBSW/PHADDW require SSSE3, i.e. Core2/Bulldozer
Mask DB 1,2,4,8,16,32,64,128,1,2,4,8,16,32,64,128
MovDqU xmm0, [data_ptr]
MovDqU xmm1, [data_ptr+16]
PCMPEQB xmm0, [match_ptr]
PCMPEQB xmm1, [match_ptr+16]
PABSB xmm0 ; contains 1's for equal bytes, 0s for non-equal ones
PABSB xmm1
PMADDUBSW xmm0, Mask ; words contains 0/1+0/2, 0/4+0/8... with non-0s for equal bytes, 0s for non-equal ones
PMADDUBSW xmm1, Mask
PHADDW xmm0, xmm1 ; 8 words contains 0/1+0/2+0/4+0/8, ... with non-0s for equal bytes, 0s for non-equal ones
PHADDW xmm0, xmm0 ; 4 words contains 0/1+0/2+...+0/128 each
PACKUSWB xmm0, xmm0 ; condense words into bytes
MOV r1, xmm0
NOT r1
BSF r1, r1
JZ all_32_bytes_are_equal
; now r1 contains number of first equal bytes
According to my calculations, this code executes in 11 cycles due to lot of dependencies, but uses only 12 ariops. So it can be overlapped with something useful, like Huffman encoding of previous match. It also lowers register pressure so you can easily check 64 bytes or start checking next match position for lazy/multientry match finder or just for case the current match will fail