I've been pondering over the XXH3 main loop for big inputs. It is based on the UMAC construction / NH hash-function family. NH basically means you multiply the data with itself. The original NH goes like this:

uint64_t NH_update(uint64_t *state, uint32_t data[2], uint32_t key[2])

{

*state += (uint64_t)(data[0] + key[0]) * (data[1] + key[1]);

}

Before multiplication, the data is "normalized" with random key material. The purpose is to reduce the chance of multiplication going wrong, e.g. long stretches of zero bits in the multipliers should become unlikely. But still, multiplication by 0 cannot be ruled out, with the probability of 2^-32. This is kind of okay for a 32-bit hash function, which the original UMAC strives to provide. But this is way too bad for even a remotely decent 64-bit hash. Another problem is that the multiplication is not invertible, and may lose data too early. Both problems can be addressed by re-adding the data atop the product.

void update(uint64_t *state, uint64_t data, uint64_t key)

{

uint64_t x = data + key;

uint64_t m = (uint32_t) x * (x >> 32);

*state += m + x;

}

But there is an even better way, which Yann calls cross-pollenization. Since any practical implementation would have multiple states anyway, we can re-add the data atop of the other product:

void update(uint64_t state[2], uint64_t data[2], uint64_t key[2])

{

uint64_t x0 = data[0] + key[0];

uint64_t x1 = data[1] + key[1];

uint64_t m0 = (uint32_t) x0 * (x0 >> 32);

uint64_t m1 = (uint32_t) x1 * (x1 >> 32);

state[0] += m0 + x1;

state[1] += m1 + x0;

}

By this time we've moved away from the original NH far enough, and it didn't promise much anyway. So let's rethink the construction, I thought. Why do we even need the key? We want the multipliers to be random-looking. But if manage to maintain the state itself random-looking, then there is no need for the key. Instead, we can "normalize" the data by simply feeding it right into the state.

void update(uint64_t state[2], uint64_t data[2])

{

uint64_t x0 = state[0] + data[0];

uint64_t x1 = state[1] + data[1];

uint64_t m0 = (uint32_t) x0 * (x0 >> 32);

uint64_t m1 = (uint32_t) x1 * (x1 >> 32);

state[0] += m0 + x1;

state[1] += m1 + x0;

}

Now, what can go wrong with this construction? It feels that there's an increased risk of the data interfering with itself, but at least there is no trivial way for the data to self-cancel. So I tried various clever ways to subvert the construction, only to discover a rather simple one: just feed zeros into it, and it will break soon enough.

int main()

{

uint64_t state[2] = {

UINT64_C(0x736f6d6570736575),

UINT64_C(0x646f72616e646f6d),

};

uint64_t data[2] = { 0, 0 };

for (int i = 0; i < 256; i++) {

update(state, data);

if (i % 16 == 15)

printf("%016" PRIx64 " %016" PRIx64 "\n", state[0], state[1]);

}

return 0;

}

The state then progresses like this:

d8a90cbd10387540 75f5ecc982682820

690f57b9d28a6e00 b925def9054bcc80

fefd84cd461dc400 555f26134bd28e00

0bd3ec9b99f63c00 24d907ba7d90fc00

69a987a47c4c4000 101ea11d2b04e000

13add4589f088000 f5ecfa1991620000

79c6eb3ebdd80000 0323bbe98f780000

fefea2d2a0380000 d7f9c6ecdd200000

5fa0e02d08e00000 6acf3ea23a200000

685e6e003b000000 518f376f1d800000

6baab97b40000000 70cc8dd9bc000000

44ecff32c0000000 c94b591f40000000

67bb4f2680000000 78cbf46b80000000

6b36a80000000000 6b36a80000000000

a800000000000000 a800000000000000

0000000000000000 0000000000000000

It is easier to understand what happens here if we revert, for a moment, to a non-interleaved variant, i.e. state[0] = m0 + x0. This variant degenerates much faster (in about 64 instead of 256 iterations). So what happens here is that we have the iterations like

lo,hi = lo*hi + lo,hi

lo,hi = lo*hi + lo,hi

where "lo,hi" designates a 64-bit quantity constructed from 32-bit halves. Now, I don't have a mathematical proof, but what seems to be happening here is that the "lo" half is eventually a multiple of the original "lo" half, and on each iteration, this multiple is more likely to be even than odd. Once "lo,hi" is even, there is no way for "lo*hi + lo,hi" to become odd. It can only stay even or become "more even". And becoming "more even" means pushing the multiplier towards zero.

Fortunately this flaw can be easily fixed: the iterations should be

lo,hi = lo*hi + hi,lo

lo,hi = lo*hi + hi,lo

i.e. we swap the halves from the last multiplication, which disrupts the pattern of becoming "more even". And it further makes sense to combine two multiplications in exactly this way: instead of combining good bits with good bits and bad bits with bad bits, we "normalize" the total avalanche effect of both products. It results in much faster diffusion of the data from the previous iterations.

So here's the final construction I would like to discuss:

void ZrHa_update(uint64_t state[2], uint64_t data[2])

{

uint64_t x0 = state[0] + data[0];

uint64_t x1 = state[1] + data[1];

uint64_t m0 = (uint32_t) x0 * (x0 >> 32);

uint64_t m1 = (uint32_t) x1 * (x1 >> 32);

state[0] = m0 + rotl64(x1, 32);

state[1] = m1 + rotl64(x0, 32);

}

Of course, rotl64(x, 32) comes for free on behalf of SIMD shuffle: you can swap the halves along with swapping the lanes.

void ZrHa_update_sse2(__m128i *state, const void *data)

{

__m128i x = _mm_add_epi64(*state, _mm_loadu_si128((__m128i *) data));

__m128i m = _mm_mul_epu32(x, _mm_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1)));

*state = _mm_add_epi64(m, _mm_shuffle_epi32(x, _MM_SHUFFLE(0, 1, 2, 3)));

}

Here is the avalanche diagram of ZrHa_update (along with the source code to generate it). It is a 128x128 XY-diagram: you flip one bit on the X-axis, and see which bits got flipped in response. The long horizontal lines do not matter, it's the ripples where the entropy is. The four rhombi are contributed by the two products, and above or below each rhombus there is an oblique line segment about 2 bits thick, those are the data re-added atop the products.

Finally, I constructed amaquette, a scaled-down copy of the complete hash function.

static inline uint64_t rrmxmx(uint64_t x)

{

#define rotl64(x, k) (((x) << (k)) | ((x) >> (64 - (k))))

#define rotr64(x, k) (((x) >> (k)) | ((x) << (64 - (k))))

x ^= rotr64(x, 49) ^ rotr64(x, 24);

x *= UINT64_C(0x9FB21C651E98DF25);

x ^= x >> 28;

x *= UINT64_C(0x9FB21C651E98DF25);

x ^= x >> 28;

return x;

}

uint64_t ZrHa64_maquette(const char *data, size_t len, uint64_t seed)

{

uint64_t state[2], hunk[2];

state[0] = UINT64_C(0x736f6d6570736575) ^ seed;

state[1] = UINT64_C(0x646f72616e646f6d) ^ seed;

const char *last16 = data + len - 16;

do {

memcpy(hunk, data, 16);

ZrHa_update(state, hunk);

data += 16;

} while (data < last16);

memcpy(hunk, last16, 16);

ZrHa_update(state, hunk);

state[0] ^= (uint32_t)(len * 2654435761);

return rrmxmx(state[0]) + rrmxmx(state[1]);

}

The real implementation would probably have state[8] instead of state[2], for two AVX2 registers. But the maquette is easier to study and reason about. Actually I have a sophisticated SIMD routine to merge state[8] down to state[2], but to study ZrHa_update, I just borrow a known-good mixer called rrmxmx. And short strings are handled with another known-good hash function (not shown here). The resulting hash function passes all SMHasher tests.