Last weeks I tried to wrap my head around hashing algorithms like xxHash, MurMurHash and SpookyHash. Now I'm ready to provide some critics on xxHash and propose ways to improve it.

1. xxHash main cycle updates 4 32-bit variables. Each variable depends only on the data in corresponding 32-bit lane of each 128-bit word, so essentially we run 4 independent hashes, each processing its own datastream.

Since the state for each individual stream holds in 32-bit variable, the upper limit of entropy it can carry is 32 bits, but on practice it will be lower. So, 32-bit state is too small for producing high-quality 32-bit hash. And combining 4 hashes of independent data streams can't help either - all the entropy of some particular input may be condensed in just one of these streams (imagine a table with 16-byte rows).

The solution to this problem is to mix state variables as we go through the data, as implemented in SpookyHash and MurmurHash3_x86_128. The variables should be combined in the way that ensures that state is fully mixed, i.e. any input bit may change any state bit after some amount of loop cycles, and we should try to minimize this amount of cycles as well as ensure good data mixing.

2. Hashes of these 4 streams are combined in very minimalistic way. It was already fixed in xxHash64, so I just employed its code in my hash. BTW, XXH32 with XXH64 mixing successfully passed SMHasher as 32 bit hash (both the upper and lower 32 bits of result) and failed only one test as 64-bit hash.

3. State variables are mixed to the single 32-bit variable prior to processing last 1-15 bytes of data. This condenses the entropy too early. Also, after processing last data byte xxHash performs only a few mixing operations to finalize the hash. This makes the last bytes less mixed into hash result.

The solution I propose is to expand last chunk of data to 16 bytes with zeros and process it in the same way as previous blocks, i.e. updating all 4 state variables. And only then we should go to the final mixing of state variables in order to compute hash result.

4. xxHash performs multiplication on input word prior to combining it to the hash state variable. This construction increases ILP, but multiplication operation is "wasted", in the sense that it doesn't mix the hash state itself. It's a question of taste, but i prefer to execute all operations against hash state variables in order to make input bits "diffuse" faster around the state.

So I constructed two update functions - the first one, employed by SlowZZH, execute the same set of operations as xxHash (2 MUL, ADD, ROL/BSWAP) plus one ADD to mix state vars. It has the same bulk speed as xxHash (2 bytes/cycle), since both functions (as well as xxHash64) performance is strictly limited by only 1 port capable of MUL on any CPU I know. (Note: SMHasher incorrectly computes speed on modern CPUs due to changes in RDTSC behavior - it results in showing results that are 10-15% better than real ones.)

But unlike xxHash, it executes both MUL operations against state variables, making input bits "diffuse" faster. In theory second MUL should be followed by another ROL. I tried this change, but haven't found any improvements in SMHasher stats (32-bit, 64-bit) aside of decreased speed.

Computations on input value prior to combining with hash state require to allocate one register for intermediate data. Since my hashes avoid such computations, it frees up one register compared to xxHash. I spend this extra register to increase number of state variables to 5, that decreased pressure of delays between chained operations. The final result is that on Haswell, SlowZZH has the same speed as xxHash.

Second consequence of enlarging internal hash state is that 160-bit state allows to compute hashes up to 128 bits large with some confidentiality.

The first ROL was replaced with BSWAP, available on Intel CPUs since i486. I expected that it may slightly improve bit mixing and therefore SMHasher stats, but instead it improved the performance - most likely due to suboptimal operation scheduling of code with ROL by GCC 4.9.2.

So, SlowZZH provides:

- the same speed as xxHash (at least on Haswell)

- better state mixing (two MUL operations on state variable every loop cycle instead of one)

- 160-bit internal state, new input word influence entire state after processing next 64 input bytes

- results up to 128 bits (now only 32 and 64 bit results are implemented)

- passed SMHasher both as 32-bit hash and 64-bit hash

5. Along with SlowZZH development, I tried to simplify xxHash hashing formula, assuming that bit mixing around 5 state variables will compensate this weakness. I ended up with update procedure that drops the second MUL, opening door to go faster than 2 bytes/cycle.

Compared to xxHash (2 MUL+ADD+ROL), it employs one more ADD to combine state vars, but drops MUL on input word, so it performs MUL+2 ADD+ROL per input word. In theory, it should be as fast as 1 cycle/word on Haswell, but i don't reached that speed (even in x64-optimized version with 12 state vars to avoid delays due to instruction chains).

Nevertheless, zzHash passed all SMHasher tests and runs at 2.5 bytes/cycle in x86 version (25% faster than xxHash), and 5.6 bytes/cycle in x64 version (40% faster than xxHash64 and SpookyHash).

So, zzHash provides:

- 25% faster than xxHash, x64 version is 40% faster than xxHash64/SpookyHash (on Haswell)

- the same state mixing on each step as xxHash, but dropped input word pre-mixing

- 160-bit internal state, new input word influence entire state after processing next 64 input bytes

- results up to 128 bits (now only 32 and 64 bit results are implemented)

- passed SMHasher both as 32-bit hash and 64-bit hash

So, it may be considered as lightweight xxHash version. Instead of premixing input word before combining it with state variable, we mix the state var with another state var on every step. It's questionable how such replacement may improve or degrade a hash quality.

6. Finally, I added x64 versions of both hashes (not compatible with x86-optimized ones), that read input data in 64-bit words and hold the state in 12 64-bit registers. This makes internal state 768-bits wide, so hashes up to 512 bits can be calculated. Except for much bigger state, the following algorithms are 1:1 translations of x86-optimized hashes.

SlowWideZZH:

- 4 bytes/cycle, 5% slower than xxHash64/SpookyHash (on Haswell)

- 0.7 bytes/cycle on x86, 5% slower than xxHash64, 20% slower than SpookyHash (on Haswell)

- the same algorithm as SlowZZH

- 768-bit internal state, new input word influence entire state after processing next 968 input bytes

- results up to 512 bits (now only 32 and 64 bit results are implemented)

- passed SMHasher both as 32-bit hash and 64-bit hash

WideZZH:

- 5.6 bytes/cycle, 40% faster than xxHash64/SpookyHash (on Haswell)

- 1.3 bytes/cycle on x86, 80% faster than xxHash64, 60% faster than SpookyHash (on Haswell)

- the same algorithm as zzHash

- 768-bit internal state, new input word influence entire state after processing next 968 input bytes

- results up to 512 bits (now only 32 and 64 bit results are implemented)

- passed SMHasher both as 32-bit hash and 64-bit hash

The wide hashes definitely need more work. In particular, I will try to make WideZZH run at 8 bytes/cycle. Also I need to modify state mixing so that input data influence entire state faster than in 968 bytes (for comparison, SpookyHash does it in 256 bytes). This may improve the "Differential Tests" results which now is a bit worse than in non-wide versions.

Nevertheless, zzHash (for x86) and WideZZH (for x64) seems to be fastest scalar hashes around that can pass SMHasher.

Now about features that may be implemented in future:

1. I plan to split hashing into two stages - main loop function processing all input data and returning the final 160-bit or 768-bit state, and final mix functions processing data returned by the first function and returning 32/64/128.. bits as final hash result.

2. Algorithms like xxHash/zzHash may be considered as "sequential hashing", since they cannot be multi-threaded. I plan to also develop family of "chunked hashing" algorithms. They will split input data into fixed-size blocks (4KB or so), compute sequential hash over each block and then combine particular hashes. This construction, already implemented in FARSH, allows multi-core CPUs and especially GPUs to employ all their resources to hash a single data buffer.

Moreover, combining of final hash results should be replaced with combining of final internal states, in order to improve hashing speed and quality. This gives additional argument to idea of splitting hash algorithm into two stages.

Finally, I imagine algorithm that collects 160-bit results of processing each 4KB chunk into temporary buffer and runs the same algo on this buffer receiving 160-bit total result. The same approach may be repeated on multiple levels, and then final 160-bit "super-result" processed by the standard 160-to-32 bit mixing procedure (or 160-to-64 and so on).

3. And last idea. By pushing up hash speeds, we allow less speed-critical applications to run multi-pass hashing, which may significantly improve hash quality. Let's see - main source of undesired collisions is bit canceling of successive input words. One way to fix it is to process data in arbitrary order. But it will break hardware cache prefetching (that can be replaced with software prefetching), and still allows bit canceling between successively processed words, although with much less probability in practice.

Another approach is to split data into chunks of ~1-4KB and process each chunk 2 or more times, in different order. The first pass may go through the data sequentially, and second pass in arbitrary order. Small chunk size ensures that data will remain in L1 cache for the second pass. This way, each input word will have different neighborhoods on first and second passes, thus making bit canceling less probable if possible at all.

In order to maximize distance in one pass, between words processed sequentially in another pass, we should select smallest K > sqrt(N), that is coprime to N (N is chunk size in words), and run the second pass in the following order:

i.e. process data in steps of K, modulo N. Of course, for maximum speed the loop should be unrolled and all indexes made constants.Code:for (int i=0; i<N; i++) process (data[(i*K) % N])

4. As you see, we can build huge family of hashes:

- sequential x86-optimized hashes with 32/64/128 bit results

- like above, but x64-optimized with 32..512 bit results

- anything above, performed in 2, 3, 4... passes

- anything above, used as a base for the chunked algo

And we don't yet counted fast/slow zzHash variations, and SIMD-optimized algorithms like xxFarsh/HighwayHash (which also could be a target for multi-pass/chunked algorithms)!

This calls for a Hash Constructor library that provides solid API to build higher-level hash algorithms (such as 128-bit 2-pass chunked algo) from lower-level ones (such as function processing input buffer and returning 160-bit internal state).

And finally, I will be glad to see your answers on the following questions:

- How it should be packaged? I prefer to reuse excellent existing infrastructure of xxHash and provide new hashes as some "extension module" to xxHash. The hashes may end up as part of xxHash library, if Yann will prefer. Do you have better ideas?

- Final mixing. As quick-and-dirty approach, now I just stolen corresponding xxHash64 code. It needs some development, in particular x86-optimized hashes should use 32 bit operations. But I have no idea how to build good algorithm, so I call for your ideas or code. Ignoring x64 hashes for a while, we need x86-optimized code mixing five 32-bit variables into 32..128-bit final hash result.