"universal hash" is a mathematical term that, from practical grounds, guarantees ideal hashing result. it's impossible to check statistical properties of single fixed hash in mathematically sound way, so universal hash by definition is a family of hashes (i.e. parameterized hash). Kaser&Lemire proved that sum(32*32->64) string hash is universal (here "string" means variable-sized input data as opposed to fixed-width M-bit value). In C++, their hashing algorithm looks as the following:

Here key[] is the algorithm parameter (pseudo-random numbers) that has the same size as input data. But we can limit key[] to, say, 128 bytes and use the hash tree to combine partial results (as implemented by VMAC and UMAC algos):Code:uint32 data[N] uint32 key[N] uint64 hash = 0 for i=0..N-1 hash += uint64(data[i]) * key[i] uint32 result = hash >> 32

Alternatively, the last step may employ the sha256 or any other good hash since it will have much less data to process.Code:result[0] = hash (data[0..31], key[0..31]) result[1] = hash (data[32..63], key[0..31]) ... result = hash(result[0..31], key[0..31])

------------------------------------------------------------------------------------------------------------------------

So, the only thing that determines speed is the cycle summing results of 32*32->64 multiplications:

mov eax, data[i]

mul eax, key[i]

add ebx, eax

adc ecx, edx

it should run in 2 cycles/dword since Pentium Pro, i.e. 8 GB/s for 4 GHz CPU. The cycle speed is limited by those 2 load operations (data[i] and key[i]), so Haswell can run it at 1 cycle/dword speed

------------------------------------------------------------------------------------------------------------------------

Employing SSE2/AVX2 allows to compute 2/4 values simultaneously. Combined with 2 loads per cycle Haswell ability, it should run at 1 cycle per 8/16 bytes (for SSE2/AVX2, respectively), providing abovementioned 64 GB/s hashing speed for the following AVX2 code:

mov ymm0, [data+i*32]

vpmuludq ymm0, [key+i*32]

vpaddq ymm1, ymm0

mov ymm0, [data+i*32+4]

vpmuludq ymm0, [key+i*32+4]

vpaddq ymm1, ymm0

i=i+1

mov ymm0, [data+i*32]

...

------------------------------------------------------------------------------------------------------------------------

On pre-Haskell cpus we are again limited by load throughput - 2 load operations (data and key) per 16-byte SSE element means 2 cycles. We can reduce amount of loads by parallel processing of six 128-byte blocks (since they are sharing the same key values):

mov xmm7, [key+i*16] ;load key once

mov xmm0, [data+i*16]

pmuludq xmm0, xmm7

paddq xmm1, xmm0

mov xmm0, [data+128+i*16]

pmuludq xmm0, xmm7

paddq xmm2, xmm0

...

mov xmm0, [data+128*5+i*16]

pmuludq xmm0, xmm7

paddq xmm6, xmm0

i=i+1

mov xmm7, [key+i*16] ;load the next key...

this code requires 7 loads (6 data values and one key value) per 6*8 bytes of input data, so 4 GHz Core2+ CPU will run at 4*8*6/7 = 27 GB/s using SSE2, saturating its memory bandwidth

------------------------------------------------------------------------------------------------------------------------

It's important to note that all these code snippets will work in x86 (32-bit) mode as well as x64. So, for practical usage we only need the way to generate somewhat around 128..4096 bytes of pseudo-random data from 32..256 bit key provided by the library caller. It can be implemented by calculating AES or SHA256 function from the key, key+1... values (i.e. using AES in CTR mode), again already implemented by the VMAC algo

Strictly speaking, this algorithm returns 32-bit hash value. For wider result, we can either calculate and later combine multiple 32-bit hashes (the same data[i] multiplied to several key[K][i] values) with the proportional speed degradation or use some form of mixing up intermediate data losing universal hash property but keeping the same high speed. I expect that properly designed mixing step would provide us with good practical hash like it is done in the CityHashCrc256 for example

------------------------------------------------------------------------------------------------------------------------

Conclusion: these code snippets may be used to implement alternative to xxHash/CityHash with the following properties:

- fast in both 32-bit and 64-bit code:
- __ 2 bytes/cycle at least on any CPU since Pentium Pro
- __ 7 bytes/cycle at least on any Intel CPU since Core2
- __ 16 bytes/cycle on Haswell, 32 bytes/cycle on Skylake with AVX-512
- 32-bit result by default, wider results with proportional speed degradation
- almost math-proven guarantees of hash quality (real math guarantees require true
randomkey)- wider results with the same speed as 32-bit one without speed degradation - needs further development and no math guarantees anymore
- key generation and results combining needs some good-quality algos, but i believe that AES and SHA256 can run this duty
- the algorithm is parameterized meaning that you can calculate as much independent hashes as you need

So, comparing it to the best non-crypto hashes i know, xxHash/CityHash, i expect this hash to have better speed but the following drawbacks:

- much more compiled code since we need to employ AES and SHA256 implementations
- higher cache occupation (those 128..4096 bytes of key[])
- wide results (>32 bits) without speed degradation needs further development