I want to develop one more hash for usage in SREP. Typically it will be applied to 256/512-byte blocks, less often - to blocks of 32..65536 bytes. It doesn't need to be too fair - now i use just polynomial hash (a[0]+K*a[1]+K*K*a[2]+...) since it supports rolling computation and anything same or better will be enough and CRC quality will be fine. crc32c is already used as the first, rolling hash so it's not an option. Preferably it would be rolling but not 100% necessary - on my 100 gb testfile, i need to compute 80 millions of 512-byte hashes or 2 millions of 64KB hashes, so fast non-rolling hash would be not worse than several times slower rolling one

now i consider the following variants:

1) simd-tuned fast semi-rolling hash - since Intel cpus can perform SSE (4*int32) multiplication with throughput 1 and latency afair 5, it seems that we can reach speed of 16 bytes/cycle using polynomial hash over 32-bit words with a stripe of 5*4:

hash0 = w[0]+K^4*w[4]+K^8*w[8]+...

hash1 = w[1]+K^4*w[5]+K^8*w[9]+...

...

hash = hash0+K*hash1+K^2*hash2+K^3*hash3

where hash0,hash1,hash2,hash3 are 32-bit words of the single SSE register and actual scheme should use 5 SSE registers containing 20 32-bit words in order to deal with SSE MUL latency. The scheme is easily ported to AVX2, Phi or plain 32-bit cpus with given number of registers and MUL latency. I call it semi-rolling since you can roll it by 4 bytes, so it just needs keeping 4 last hash values instead of 1. So it's both fast and rolling, but i'm pretty sure that its fairness would be even lower than of byte-grain polynomial hash.

2) something more fair, but non-rolling. The main idea is to keep exploiting intel SSE ability to perform 4*32-bit MUL each cycle, so it would process 16 bytes/cycle with MUL and a few other operations. like xxHash but use something faster for SSE instead of ROL operations

3) semi-universal hashing. single cycle loads 16 bytes of input data, 16 bytes of constants, performs 4*32-bit MUL bertween them, and adds 4*32-bit results to the accumulator register. So it would process the same 16 bytes/cycle, non-rolling, with good fairness although lower bits will be less fair

4) almost universal hashing. hash = (data0+const0)*(data1+const1) + (data2+const2)*(data3+const3)+... it is universal hashing if used MUL operation is 32*32=64-bit, but instead i plan to accumulate separately high and low 32 bits of products and then just add them together to produce 32-bit hash. It needs 4 loads, 4 adds and 2 muls per 32 bytes of input data so it may also fit in 16 bytes/cycle niche

What you think about these hashes? Can you propose other fast ones?

Existing fast hashes: