After some experimentation also with crc32 (software implementations, also with indications by Bulat Ziganshin on Stephan Brumme) I come to the conclusion that
1) actual (real world) results are extremely different from theoretical maximums
On my PC, running from a 1GB (useless in real world) array
Code:
bitwise : CRC=221F390F, 8.082s, 126.706 MB/s
half-byte : CRC=221F390F, 3.850s, 265.980 MB/s
tableless (byte) : CRC=221F390F, 3.628s, 282.283 MB/s
tableless (byte2): CRC=221F390F, 3.555s, 288.010 MB/s
1 byte at once: CRC=221F390F, 1.909s, 536.368 MB/s
4 bytes at once: CRC=221F390F, 0.640s, 1600.459 MB/s
8 bytes at once: CRC=221F390F, 0.440s, 2325.023 MB/s
4x8 bytes at once: CRC=221F390F, 0.317s, 3231.518 MB/s
16 bytes at once: CRC=221F390F, 0.197s, 5193.688 MB/s
16 bytes at once: CRC=221F390F, 0.196s, 5213.763 MB/s (including prefetching)
chunked : CRC=221F390F, 0.199s, 5138.350 MB/s
16 bytes at once, on a "real" case: about 2900MB/s (13GB fully cached by filesystem)
2) considering that roughly all algorithms works somewhat like this
(block reading of a file in a buffer and then processing)
Code:
while ((n = fread (data, sizeof (char), mysize, myfile))> 0)
checksum=calc(data,n,checksum);
for many reasons (latency, operating system cache loading, CPU cache location loss and so on and so forth) the actual performance is wastly reduced (at least in Brumme implementations), even for 16byte slices
With very rough estimates, certainly not scientific, about -60%
However, always around 3GB / s, which for a software implementation is not bad at all
3) crc32c with SSE turns out to be really fast, it is easy to program, it does not require particular "strange" sources
4) SHA1 (!) Is not that bad, again for software implementations
5) xxHash64 (simplified implementation, software) is somewhere in between, but - always in actual version with processing from broken file to blocks and not on buffer benchmarks - I don't know if it's worth it.
So I'll put all the possibilities into my ZPAQ, and maybe SHA256 too, for the "real paranoid".
Unfortunately I have not found a usable hardware SHA1 implementation, it seems that Intel sells libraries of this type. Patience