I'm trying to make some optimized builds of some archivers. I had a small problem with paq8px, while trying to save some cycles 
I found an optimized (by Dark Shikari) sse2 version of dot_ product() and train() in paq8p
I built paq8p and it is indeed (slightly) faster than the regular paq7asmsse.asm.
But paq8px (tested v30 & v42) crashes. I didn't test all variants but paq8l, Matt's last paq8, does not crash. Then it seems this optimized asm code is not specifically designed for paq8p. Paq8l outputs identical files compiled with the regular or optimized code. Same for paq8p. So it looks Dark Shikari's code can perfectly replace the "not optimized" code.
Then why it doesn't work with paq8px? Does paq8px contains some changes that makes it not compatible with this asm code? I don't think I made something wrong during the assembly/compilation stages.
If I keep the optimized dot_product() and replace the optimized train() by the regular train(), it does not crash. Then it seems the problem is train()
regular sse2 code:
Code:
; Train n neural network weights w[n] on inputs t[n] and err.
; w[i] += t[i]*err*2+1 >> 17 bounded to +- 32K.
; n is rounded up to a multiple of 8.
; Train for SSE2
; Use this code to get some performance...
global train ; (short* t, short* w, int n, int err)
align 16
train:
mov eax, [esp+4] ; t
mov edx, [esp+8] ; w
mov ecx, [esp+12] ; n
add ecx, 7 ; n/8 rounding up
and ecx, -8
jz .done
sub eax, 16
sub edx, 16
movd xmm0, [esp+16]
pshuflw xmm0,xmm0,0
punpcklqdq xmm0,xmm0
.loop: ; each iteration adjusts 8 weights
movdqa xmm3, [eax+ecx*2] ; t[i]
movdqa xmm2, [edx+ecx*2] ; w[i]
paddsw xmm3, xmm3 ; t[i]*2
pmulhw xmm3, xmm0 ; t[i]*err*2 >> 16
paddsw xmm3, [_mask] ; (t[i]*err*2 >> 16)+1
psraw xmm3, 1 ; (t[i]*err*2 >> 16)+1 >> 1
paddsw xmm2, xmm3 ; w[i] + xmm3
movdqa [edx+ecx*2], xmm2
sub ecx, 8
ja .loop
.done:
ret
align 16
_mask dd 10001h,10001h,10001h,10001h ; 8 copies of 1 in xmm1
optimized sse2 code:
Code:
; Train n neural network weights w[n] on inputs t[n] and err.
; w[i] += t[i]*err*2+1 >> 17 bounded to +- 32K.
; n is rounded up to a multiple of 16.
; Train for SSE2
; Use this code to get some performance...
global train ; (short* t, short* w, int n, int err)
align 16
train:
mov eax, [esp+4] ; t
mov edx, [esp+8] ; w
mov ecx, [esp+12] ; n
add ecx, 15 ; n/16 rounding up
and ecx, -16
jz .done
sub eax, 32
sub edx, 32
movd xmm0, [esp+16]
pcmpeqb xmm6, xmm6
pshuflw xmm0, xmm0, 0
psrlw xmm6, 15 ; pw_1
punpcklqdq xmm0, xmm0
.loop: ; each iteration adjusts 16 weights
movdqa xmm3, [eax+ecx*2 +0] ; t[i]
movdqa xmm5, [eax+ecx*2+16]
paddsw xmm3, xmm3 ; t[i]*2
paddsw xmm5, xmm5
pmulhw xmm3, xmm0 ; t[i]*err*2 >> 16
pmulhw xmm5, xmm0
paddsw xmm3, xmm6 ; (t[i]*err*2 >> 16)+1
paddsw xmm5, xmm6
psraw xmm3, 1 ; (t[i]*err*2 >> 16)+1 >> 1
psraw xmm5, 1
paddsw xmm3, [edx+ecx*2+ 0] ; w[i] + xmm3
paddsw xmm5, [edx+ecx*2+16]
movdqa [edx+ecx*2+ 0], xmm3
movdqa [edx+ecx*2+16], xmm5
sub ecx, 16
ja .loop
.done:
ret
I don't understand x86 assembler but, according to the comments, Dark Shikari changed the rounding. Paq8l and paq8p don't mind, but maybe paq8px does not like this.
EDIT: not all input files produce the crash. And not all level. At least it crashes on a file containing the first megabyte of ENWIK8 at levels 4,5,6,7,8. Levels 0,1,2,3 are ok