Code:
.00401330: 89C8 mov eax,ecx
.00401332: 25FFFFFF3F and eax,03FFFFFFF
.00401337: 021C06 add bl,[esi][eax]
.0040133A: 69C117CD5B07 imul eax,ecx,0075BCD17
.00401340: 4A dec edx
.00401341: 8D88765EDF01 lea ecx,[eax][01DF5E76]
.00401347: 79E7 jns .000401330 --- (6)
.00401340: 89C8 mov eax,ecx
.00401342: 43 inc ebx
.00401343: 25FFFFFF3F and eax,03FFFFFFF
.00401348: 8D3407 lea esi,[edi][eax]
.0040134B: 0FB60416 movzx eax,b,[esi][edx]
.0040134F: 0045EF add [ebp][-11],al
.00401352: 69C117CD5B07 imul eax,ecx,0075BCD17
.00401358: 0FBE75EF movsx esi,b,[ebp][-11]
.0040135C: 8D88765EDF01 lea ecx,[eax][01DF5E76]
.00401362: 21F2 and edx,esi
.00401364: 81FBFFC99A3B cmp ebx,03B9AC9FF
.0040136A: 7ED4 jle .000401340 --- (6)
2nd version is much more complex overall, there's this eax/al/eax dependency chain - multiplication has
to wait for the read etc, also there're other memory accesses.
But I still think that 2nd version is a better test, because that simple "add bl,[addr]" sequence may
be optimized by the cpu (there's no dependency on BL, so the loop can continue after issuing a read operation),
while in 2nd version each loop iteration certainly waits until the read finishes.