Thread: ZrHa_update - a fast construction for iterated hashing

1. ZrHa_update - a fast construction for iterated hashing

I've been pondering over the XXH3 main loop for big inputs. It is based on the UMAC construction / NH hash-function family. NH basically means you multiply the data with itself. The original NH goes like this:

uint64_t NH_update(uint64_t *state, uint32_t data, uint32_t key)
{
*state += (uint64_t)(data + key) * (data + key);
}

Before multiplication, the data is "normalized" with random key material. The purpose is to reduce the chance of multiplication going wrong, e.g. long stretches of zero bits in the multipliers should become unlikely. But still, multiplication by 0 cannot be ruled out, with the probability of 2^-32. This is kind of okay for a 32-bit hash function, which the original UMAC strives to provide. But this is way too bad for even a remotely decent 64-bit hash. Another problem is that the multiplication is not invertible, and may lose data too early. Both problems can be addressed by re-adding the data atop the product.

void update(uint64_t *state, uint64_t data, uint64_t key)
{
uint64_t x = data + key;
uint64_t m = (uint32_t) x * (x >> 32);
*state += m + x;
}

But there is an even better way, which Yann calls cross-pollenization. Since any practical implementation would have multiple states anyway, we can re-add the data atop of the other product:

void update(uint64_t state, uint64_t data, uint64_t key)
{
uint64_t x0 = data + key;
uint64_t x1 = data + key;
uint64_t m0 = (uint32_t) x0 * (x0 >> 32);
uint64_t m1 = (uint32_t) x1 * (x1 >> 32);
state += m0 + x1;
state += m1 + x0;
}

By this time we've moved away from the original NH far enough, and it didn't promise much anyway. So let's rethink the construction, I thought. Why do we even need the key? We want the multipliers to be random-looking. But if manage to maintain the state itself random-looking, then there is no need for the key. Instead, we can "normalize" the data by simply feeding it right into the state.

void update(uint64_t state, uint64_t data)
{
uint64_t x0 = state + data;
uint64_t x1 = state + data;
uint64_t m0 = (uint32_t) x0 * (x0 >> 32);
uint64_t m1 = (uint32_t) x1 * (x1 >> 32);
state += m0 + x1;
state += m1 + x0;
}

Now, what can go wrong with this construction? It feels that there's an increased risk of the data interfering with itself, but at least there is no trivial way for the data to self-cancel. So I tried various clever ways to subvert the construction, only to discover a rather simple one: just feed zeros into it, and it will break soon enough. int main()
{
uint64_t state = {
UINT64_C(0x736f6d6570736575),
UINT64_C(0x646f72616e646f6d),
};
uint64_t data = { 0, 0 };
for (int i = 0; i < 256; i++) {
update(state, data);
if (i % 16 == 15)
printf("%016" PRIx64 " %016" PRIx64 "\n", state, state);
}
return 0;
}

The state then progresses like this:

d8a90cbd10387540 75f5ecc982682820
690f57b9d28a6e00 b925def9054bcc80
fefd84cd461dc400 555f26134bd28e00
0bd3ec9b99f63c00 24d907ba7d90fc00
69a987a47c4c4000 101ea11d2b04e000
79c6eb3ebdd80000 0323bbe98f780000
fefea2d2a0380000 d7f9c6ecdd200000
5fa0e02d08e00000 6acf3ea23a200000
685e6e003b000000 518f376f1d800000
6baab97b40000000 70cc8dd9bc000000
44ecff32c0000000 c94b591f40000000
67bb4f2680000000 78cbf46b80000000
6b36a80000000000 6b36a80000000000
a800000000000000 a800000000000000
0000000000000000 0000000000000000

It is easier to understand what happens here if we revert, for a moment, to a non-interleaved variant, i.e. state = m0 + x0. This variant degenerates much faster (in about 64 instead of 256 iterations). So what happens here is that we have the iterations like

lo,hi = lo*hi + lo,hi
lo,hi = lo*hi + lo,hi

where "lo,hi" designates a 64-bit quantity constructed from 32-bit halves. Now, I don't have a mathematical proof, but what seems to be happening here is that the "lo" half is eventually a multiple of the original "lo" half, and on each iteration, this multiple is more likely to be even than odd. Once "lo,hi" is even, there is no way for "lo*hi + lo,hi" to become odd. It can only stay even or become "more even". And becoming "more even" means pushing the multiplier towards zero.

Fortunately this flaw can be easily fixed: the iterations should be

lo,hi = lo*hi + hi,lo
lo,hi = lo*hi + hi,lo

i.e. we swap the halves from the last multiplication, which disrupts the pattern of becoming "more even". And it further makes sense to combine two multiplications in exactly this way: instead of combining good bits with good bits and bad bits with bad bits, we "normalize" the total avalanche effect of both products. It results in much faster diffusion of the data from the previous iterations.

So here's the final construction I would like to discuss:

void ZrHa_update(uint64_t state, uint64_t data)
{
uint64_t x0 = state + data;
uint64_t x1 = state + data;
uint64_t m0 = (uint32_t) x0 * (x0 >> 32);
uint64_t m1 = (uint32_t) x1 * (x1 >> 32);
state = m0 + rotl64(x1, 32);
state = m1 + rotl64(x0, 32);
}

Of course, rotl64(x, 32) comes for free on behalf of SIMD shuffle: you can swap the halves along with swapping the lanes.

void ZrHa_update_sse2(__m128i *state, const void *data)
{
__m128i x = _mm_add_epi64(*state, _mm_loadu_si128((__m128i *) data));
__m128i m = _mm_mul_epu32(x, _mm_shuffle_epi32(x, _MM_SHUFFLE(2, 3, 0, 1)));
*state = _mm_add_epi64(m, _mm_shuffle_epi32(x, _MM_SHUFFLE(0, 1, 2, 3)));
}

Here is the avalanche diagram of ZrHa_update (along with the source code to generate it). It is a 128x128 XY-diagram: you flip one bit on the X-axis, and see which bits got flipped in response. The long horizontal lines do not matter, it's the ripples where the entropy is. The four rhombi are contributed by the two products, and above or below each rhombus there is an oblique line segment about 2 bits thick, those are the data re-added atop the products. Finally, I constructed a maquette, a scaled-down copy of the complete hash function.

static inline uint64_t rrmxmx(uint64_t x)
{
#define rotl64(x, k) (((x) << (k)) | ((x) >> (64 - (k))))
#define rotr64(x, k) (((x) >> (k)) | ((x) << (64 - (k))))
x ^= rotr64(x, 49) ^ rotr64(x, 24);
x *= UINT64_C(0x9FB21C651E98DF25);
x ^= x >> 28;
x *= UINT64_C(0x9FB21C651E98DF25);
x ^= x >> 28;
return x;
}

uint64_t ZrHa64_maquette(const char *data, size_t len, uint64_t seed)
{
uint64_t state, hunk;
state = UINT64_C(0x736f6d6570736575) ^ seed;
state = UINT64_C(0x646f72616e646f6d) ^ seed;
const char *last16 = data + len - 16;
do {
memcpy(hunk, data, 16);
ZrHa_update(state, hunk);
data += 16;
} while (data < last16);
memcpy(hunk, last16, 16);
ZrHa_update(state, hunk);
state ^= (uint32_t)(len * 2654435761);
return rrmxmx(state) + rrmxmx(state);
}

The real implementation would probably have state instead of state, for two AVX2 registers. But the maquette is easier to study and reason about. Actually I have a sophisticated SIMD routine to merge state down to state, but to study ZrHa_update, I just borrow a known-good mixer called rrmxmx. And short strings are handled with another known-good hash function (not shown here). The resulting hash function passes all SMHasher tests.  Reply With Quote

2. The Following 2 Users Say Thank You to svpv For This Useful Post:

Bulat Ziganshin (5th October 2019),Cyan (5th October 2019)

3. This is a nice investigation @svpv. I like it.

A few (hopefully useful) comments :

Regarding quality :
passing SMHasher is a good first stage. It's already promising.
These days, I also use a massive collision generator (calculating 25 - 100 billions hashes) to properly estimate collision probabilities for 64-bit hashes. (SMHasher tests are too short for that, they are good enough for 32-bit hashes, but can't "see" collisions beyond>50 bit).
I could give a try for your algorithm. The main downside is that getting results can be very long, and the machine is already in use right now. But I can find a slot later this week.

Regarding speed :
the algorithm introduces a dependency between each round, which will put a lower bound to possible latency per loop.
This can be partially tamed by widening the number of accumulators, though there are other issues to consider (running out of registers, heavier joining section, etc.).
So it's interesting to measure. Presumably, at this stage, I would expect you already have some numbers.
I could also test it on my side, but I guess the current "maquette" is not meant for that, and would produce non-representative results.  Reply With Quote

4. I think the lesson here should be: Don't do non-reversible state-mixing, i.e. when you have n possible internal states and you do some shuffling, there should be n possible results. Anything else is effectively discarding state.

This function does not adhere to that principle as, for instance, the state (0x8000000000000001, 0x8000000000000001) and the state (0x0000000000000002, 0x0000000000000002) both permute to (0x0000000200000000, 0x0000000200000000).

I don't know how badly the function as stated is discarding state, or what conditions might trigger critical state drop (I wouldn't be surprised if this function has a state-destroying nemesis, like its sibling has all zero input), I just know that there is a whole world of potential problems that can be avoided by following this simple rule.  Reply With Quote

5. The Following User Says Thank You to NohatCoder For This Useful Post:

Bulat Ziganshin (6th October 2019)

6. For your information @svpv, I tested ZrHa64_maquette for collisions directly on my desktop.
It's more limited, so I had to reduce the test to 13 Bi hashes.
That's not enough for a good estimation, but it can detect hashes with less than ideal distributions.

ZrHa64_maquette generated 4 collisions, which is in line with expectation (5).

This is a good beginning.
I hope to find the time to make a proper evaluation with more hashes within a week.  Reply With Quote

7. The Following User Says Thank You to Cyan For This Useful Post:

svpv (8th October 2019)

8. Originally Posted by NohatCoder I think the lesson here should be: Don't do non-reversible state-mixing, i.e. when you have n possible internal states and you do some shuffling, there should be n possible results. Anything else is effectively discarding state.
Right, the theory is that you should only feed the data atop a well-mixed state, and then mix the state again bijectively.
That said, when you combine the data with the state, you are discarding / merging down some possibilities anyway. I wish some form of hashing theory were developed which would explain and further quantify how bad the effect of non-reversible mixing could be. If we are only permitted to use one 32x32->64 multiplication per 64-bit input, we'll have to put up with non-reversible mixing, because you can only multiply the data by another piece of data. You see, the name of the game is not "how to get a high-quality hash". The name of the game is rather "can you get away with murder with that few instructions and still get a decent 64-bit hash".

Also, note that there are various degrees of non-reversibility. Using the technique found in this gist you can quantify non-reversibility by exhaustively counting all 32-bit results. So let's scale ZrHa_update down to 8x8->16 multiplication / 32-bit state and see how many states it can retain.

// the primitive that we're testing
static inline uint32_t mix(uint32_t x)
{
uint8_t x0 = (x >> 0);
uint8_t x1 = (x >> 8);
uint8_t x2 = (x >> 16);
uint8_t x3 = (x >> 24);
uint16_t m0 = x0 * x1;
uint16_t m1 = x2 * x3;
uint16_t state0 = m0 + (x2 << 8 | x3);
uint16_t state1 = m1 + (x0 << 8 | x1);
return state0 << 16 | state1;
}

I get 1768M collisions, which is less than 1 bit lost. Without cross-pollination, i.e. with m0 + (x0 << 8 | x1), you get 3218M collisions, which is about 2 bits lost (a quarter of 2^32 states are possible). Interestingly, XOR seems to work slightly better, with m0 ^ (x2 << 8 | x3) you get 1662M collisions. If I remember correctly, a truly random function (values assigned at random) would trigger 1/e = 1580M collisions. This is the best we can hope for, unless we explicitly target reversible mixing. So it's not too bad, at least it doesn't drink like a fish leak like a sieve.  Reply With Quote

9. The interesting part isn't so much how many bits you lose in one iteration, rather it is how many bits you lose in a lot of iterations. In the ideal case you should lose ~29 bits in 1G iterations, dropping another bit every time the iteration count doubles. But we just established that your function isn't the ideal case, and we don't know that it follows a similar pattern. The established theory of non-reversible mixing is that it is ******* complicated, and therefore best avoided.

I will mention that Merkle-Damgård does non-reversible mixing by xoring the state with a previous state, after a long series of reversible mixing. I believe that this pretty much guarantees the ideal case. I still don't like Merkle-Damgård though, periodically reducing the state to output size does nothing good.

It is by the way not true that you can't multiply the input with itself in a reversible manner, you could do:

//Note: This function is for demonstrating a point only, I do not condone this as a hash function.
void update(uint64_t state, uint64_t data)
{
uint64_t x0 = state + data;
state = state + (uint32_t) x0 * (x0 >> 32);
state = x0;
}

Multiplication between two non-constants still has some weird behaviour, in that the numbers themselves decide how far the mixing goes, with the ultimate undesirable case that one of the numbers is zero.  Reply With Quote

10. I've pushed a non-maquette implementation to github. There are three variants:

uint64_t ZrHa64_long_generic(const void *data, size_t len, uint64_t seed0, uint64_t seed1);
uint64_t ZrHa64_long_sse2(const void *data, size_t len, uint64_t seed0, uint64_t seed1);
uint64_t ZrHa64_long_avx2(const void *data, size_t len, uint64_t seed0, uint64_t seed1);

which have the states, respectively,

uint64_t state;
__m128i state;
__m256i state;

I checked that the return value is the same. (There should be a runtime switch, but that requires an extra portability layer. I haven't made up my mind whether to press on with any kind of a release or whether it's just an experiment.)

This introduces the routine that merges two states into one. It is fast and relatively weak (matching in its speed and weakness the update routine).

void ZrHa_merge(uint64_t state, const uint64_t other)
{
uint64_t x0 = state + rotl64(other, 32);
uint64_t x1 = state + rotl64(other, 32);
uint64_t m0 = (uint32_t) x0 * (x0 >> 32);
uint64_t m1 = (uint32_t) x1 * (x1 >> 32);
state = m0 + x1;
state = m1 + x0;
}

In "state + rotl64(other, 32)", rotl64 is somewhat important: this is how we prefer to combine two products. Combining state with other rather than with other is less important. (It breaks the symmetry though and further eschews combining the data at power-of-two distances.) There is no rotl64 in "m0 + x1", that rotl64 only makes sense in a loop. We end up with two products again. Thus the routine is applied recursively: merge two AVX registers into a single AVX register, worth two SSE registers, merge them into a single SSE register, the last merge stops short and the function returns with "state + rotl64(state, 32)". This makes it a pretty fast mixer with just 2 or 3 SIMD multiplications (resp. with AVX2 or SSE2). The result still passes all SMHasher tests.  Reply With Quote

11. The Following User Says Thank You to svpv For This Useful Post:

Cyan (8th October 2019)

12. That shuffle is pretty ugly for NEON. The main cheap shuffles NEON has are shown below, all taking roughly 3 cycles a piece (except the first one): Edit: The other problem is that vmull.u32 requires an entirely different setup than pmuludq: Or

uint64_t pmuludq(uint64_t a, uint64_t b)
{
return (a & 0xFFFFFFFF) * (b & 0xFFFFFFFF);
}
uint64_t vmull_u32(uint32_t a, uint32_t b)
{
return (uint64_t)a * (uint64_t)b;
}  Reply With Quote

13. The Following User Says Thank You to easyaspi314 For This Useful Post:

Bulat Ziganshin (9th October 2019)

14. In order to do the ZrHa_update routine on NEON, we would need to have, at the same time, a DCBA, AC, and a BD permutation (or CA/DB).

I'm thinking something like this:
Code:
ZrHa_update_neon32:
vld1.64     {d0, d1}, [r0:128]
vld1.8      {d2, d3}, [r1]
vadd.i64    q0, q0, q1
vrev64.32   q1, q0
vtrn.32     d0, d1
vswp        d2, d3
vmlal.u32   q1, d0, d1
vst1.64     {d1, d2}, [r0:128]
bx lr
Code:
ZrHa_update_neon64:
ld1    q0, [x0]
ld1    q1, [x1]
add    v0.2d, v0.2d, v1.2d
xtn    v1.2s, v0.2d
shrn   v2.2s, v0.2d, #32
rev64  v0.4s, v0.4s
ext    v0.16b, v0.16b, v0.16b, #8
umlal  v0.2d, v1.2s, v2.2s
st1    q0, [x0]
ret
Not as clean as SSE2 (Wow, never thought I'd say that one!)

Code:
ZrHa_update_sse2: // x86_64, sysv
movdqu   xmm0, xmmword ptr[rsi]
paddq    xmm0, xmmword ptr[rdi]
pshufd   xmm1, xmm0, _MM_SHUFFLE(2, 3, 0, 1)
pshufd   xmm2, xmm0, _MM_SHUFFLE(0, 1, 2, 3)
pmuludq  xmm0, xmm1
movdqa   xmmword ptr[rdi], xmm0
ret
"XXH3_64b_round":
Code:
XXH3_64b_round_neon32:
vld1.64    {d0, d1}, [r0:128]
vld1.8     {d2, d3}, [r1]
vadd.i64   q0, q0, q1
vld1.8     {d4, d5}, [r2]
veor       q2, q2, q1
vtrn.32    d4, d5
vmlal.u32  q0, d4, d5
vst1.64    {d0, d1}, [r0:128]
bx lr
Code:
XXH3_64b_round_neon64:
ld1     q0, [x0]
ld1     q1, [x1]
add     v0.2d, v0.2d, v1.2d
ld1     q2, [x2]
eor     v2.16b, v2.16b, v1.16b
xtn     v1.2s, v2.2d
shrn    v2.2s, v2.2d, #32
umlal   v0.2d, v1.2s, v2.2s
st1     q0, [x0]
ret  Reply With Quote

15. If I do understand the produced assembly as being a comparison between XXH3 current kernel and Zrha proposed one,
it does not look much worse.
In particular, the nb of instructions seems to be the same in both cases.

Now the order and complexity of these instructions differ,
but is the difference that large ?

The way I see it, the change :
- removes a 'load' and a 'xor' from getting rid of the secret
- adds 2 instructions `vrev` + `vext` to emulate the _MM_SHUFFLE(0, 1, 2, 3)

The `xor` is likely trivial, and the `load` is likely fetching hot data from L1 cache.
But even L1 cache fetching costs a few cycles.
I would expect such a cost to be comparable or even slightly slower than announced 3 cycles for `vrev` and `vext`.  Reply With Quote

16. It's a bit faster, but no massive speedups like SSE2 gets.

Code:
./xxhsum 0.7.2 (64-bits aarch64 little endian), Clang 8.0.1 (tags/RELEASE_801/final), by Yann Collet
Sample of 100 KB...
XXH32                    :     102400 ->    27899 it/s ( 2724.5 MB/s)
XXH32 unaligned          :     102400 ->    22728 it/s ( 2219.5 MB/s)
XXH64                    :     102400 ->    31618 it/s ( 3087.7 MB/s)
XXH64 unaligned          :     102400 ->    31660 it/s ( 3091.8 MB/s)
XXH3_64b                 :     102400 ->    61503 it/s ( 6006.2 MB/s)
XXH3_64b unaligned       :     102400 ->    56964 it/s ( 5562.9 MB/s)
XXH3_64b seeded          :     102400 ->    61156 it/s ( 5972.3 MB/s)
XXH3_64b seeded unaligne :     102400 ->    56355 it/s ( 5503.4 MB/s)
XXH128                   :     102400 ->    51273 it/s ( 5007.1 MB/s)
XXH128 unaligned         :     102400 ->    49356 it/s ( 4819.9 MB/s)
XXH128 seeded            :     102400 ->    51421 it/s ( 5021.6 MB/s)
XXH128 seeded unaligned  :     102400 ->    50749 it/s ( 4955.9 MB/s)
ZrHa64                   :     102400 ->    50611 it/s ( 4942.4 MB/s)
ZrHa64 unaligned         :     102400 ->    41221 it/s ( 4025.4 MB/s)
ZrHa64 (NEON)            :     102400 ->    67734 it/s ( 6614.7 MB/s)
ZrHa64 (NEON) unaligned  :     102400 ->    60503 it/s ( 5908.5 MB/s)
The speed difference makes sense though, it is only 2 or 3 cycles faster.

However, the main problem is that Clang is emitting vaddhn, slowing down the code. (Darn it, instcombine)

GCC 9.2.0 obeys my code, though:

Code:
ZrHa64 (NEON)            :     102400 ->    70132 it/s ( 6848.8 MB/s)
ZrHa64 (NEON) unaligned  :     102400 ->    62468 it/s ( 6100.4 MB/s)
However, it could definitely be improved, as there are multiple ways to end up with the shuffle I need.  Reply With Quote

17. The Following User Says Thank You to easyaspi314 For This Useful Post:

Cyan (10th October 2019)

18. It's no big deal to write an .S file in assembly if you can't cajole the compiler into emitting the right sequence of instructions.  Reply With Quote

19. The Following User Says Thank You to svpv For This Useful Post:

easyaspi314 (16th October 2019)

20. I had some thoughts on non-reversible mixing. It must be bad if you only have a 64-bit state and you want 64-bit result. Interleaving doesn't change this, if the lanes are independent. But that's not what we have with ZrHa_update. Since the lanes are "cross-pollinated", the irreversible mixing applies to 128-bit state. So this might be all good if you only want 64-bit result. And you can't get a decent 128-bit result anyway because a single 32x32->64 multiplication isn't enough for it.

So what's "the mechanics" of non-reversible mixing? The internal state may "wear out" or "obliterate" gradually, but how and when does this happen? After being mixed, the state cannot assume certain values, about 1/e of all possible values. But as we feed the next portion of data, it seems that the state "somewhat recovers" in that it can assume more values. If the next portion is somewhat dissimilar to the mixed state, it is plausible to say that the state can assume any value again. (If we fed ascii strings with XOR, this could not have flipped the high bit in each byte, but we feed with ADD.) Assuming that the state recovers, I can't immediately see how the non-reversible mix is worse than a reversible one. Can it be show that the construction leaks, specifically more than 1 bit after a few iterations?

Of course, the worst case is that the state assumes progressively less values on each iteration, which happens if you feed zeroes into it. We can see how this happens in the scaled-down constructions with 8x8->16 or 16x16->32 multiplications.

#include <stdio.h>
#include <inttypes.h>

static inline uint32_t mix(uint32_t x)
{
uint8_t x0 = (x >> 0);
uint8_t x1 = (x >> 8);
uint8_t x2 = (x >> 16);
uint8_t x3 = (x >> 24);
uint16_t m0 = x0 * x1;
uint16_t m1 = x2 * x3;
uint16_t y0 = m0 + (x2 << 8 | x3);
uint16_t y1 = m1 + (x0 << 8 | x1);
return y0 << 16 | y1;
}

int main()
{
uint32_t x = 2654435761;
while (1) {
x = mix(x);
printf("%08" PRIx32 "\n", x);
}
return 0;
}

The construction collapses with less than 2^10 iterations:

\$ ./a.out |head -\$((1<<9)) |tail
ecd22f4e
e13e0fc7
4a8afd8d
15a3b5e1
422aef14
3cee1fc3
05d9fae7
ba9bec37
ce6ea88a
c95ee32c
\$ ./a.out |head -\$((1<<10)) |tail
2900a500
002900a5
2900a500
002900a5
2900a500
002900a5
2900a500
002900a5
2900a500
002900a5

If you change the combining step to XOR, the construction collapses in under 2^14 iterations:

uint16_t y0 = m0 ^ (x2 << 8 | x3);
uint16_t y1 = m1 ^ (x0 << 8 | x1);

\$ ./a.out |head -\$((1<<13)) |tail
572eb816
2187191a
85ab0b7e
aeef26dc
cf067e54
2f9750a4
a46fbfe9
c273aea3
1d08f488
89bd881c
\$ ./a.out |head -\$((1<<14)) |tail
a400a400
00a400a4
a400a400
00a400a4
a400a400
00a400a4
a400a400
00a400a4
a400a400
00a400a4

Here's the 16x16->32 version, which collapses in under 1^25 and 1^30 iterations (with ADD resp. XOR, I'll spare you the outputs).

#include <stdio.h>
#include <inttypes.h>

static inline uint64_t mix(uint64_t x)
{
uint16_t x0 = (x >> 0);
uint16_t x1 = (x >> 16);
uint16_t x2 = (x >> 32);
uint16_t x3 = (x >> 48);
uint32_t m0 = x0 * x1;
uint32_t m1 = x2 * x3;
uint32_t y0 = m0 + (x2 << 16 | x3);
uint32_t y1 = m1 + (x0 << 16 | x1);
return (uint64_t) y0 << 32 | y1;
}

int main()
{
uint64_t x = 6364136223846793005;
while (1) {
x = mix(x);
printf("%016" PRIx64 "\n", x);
}
return 0;
}

By extrapolation, the construction with 32x32->64 multiplication must collapse in about 2^60 iterations.

Can it be shown that XOR as the combining step works better than ADD also in the average case rather than just in the worst case?

@NohatCoder, how did you calculate that 22 bits must be leaked after 1G iterations? Was that the average case or the worst-case analysis, or the distinction doesn't matter?  Reply With Quote

21. The Following User Says Thank You to svpv For This Useful Post:

Cyan (17th October 2019)

22. True, but there are many drawbacks to writing in assembly:
1. It is platform-specific (and even subarch specific)
2. It can be somewhat difficult to follow with multiple lanes
3. Trying to figure out the best way to inline/unroll is annoying
4. It can't be inlined
5. It can't be constexpr'd

I usually prefer using tiny inline assembly hacks, which while they mess up ordering a bit, can usually help a lot, especially this magic

__asm__("" : "+r" (var));

which can break up patterns, murder SLP vectorization, move loads outside of loops, and more. It is like temporary volatile.

For example, on ARMv7-A, vzip.32 (a.k.a. vtrn.32) modifies in place: If we use this on the even and odd halves of the vector (on ARMv7, Q-forms are unions of 2 half-width D-forms), we can end up with our vmlal.u32 setup in one instruction by vzipping in place at the cost of clobbering data_key (which is ok): Edit: Oops, the arrows are pointing to the wrong lanes. They should point to a >> 32 and b & 0xFFFFFFFF

However, Clang and GCC can't comprehend an operation modifying 2 operands, and emit an extra vmov (a.k.a. vorr) to copy an operand.
This line

__asm__("vzip.32 %e0, %f0" : "+w" (data_key));

forces an in-place modification.

I only write things in assembly when I want tiny code or when I am just messing around. It is just a pain otherwise.  Reply With Quote

23.  Reply With Quote

24. The Following User Says Thank You to easyaspi314 For This Useful Post:

Cyan (17th October 2019)

25. The ideal case for non-reversible mixing is a random mapping function from one state to another, i.e. for 128 bit states, there exist (2^128)^(2^128) different (mostly non-reversible) mapping functions (as (2^128)! of them are reversible), if you pick a random one of those functions you'll expect the state-space to deteriorate by 22 29 bits in a billion applications.

This number can be found by iteration, if at iteration x, n out of p different states are possible, then at iteration x+1 the expected number of possible states is p*(1 - ((p-1)/p)^n). By experimentation that turns into the bit loss being asymptotic to log2(iteration count)-1.

But if your mapping function is not an ideal random pick, and it has some distinguisher (like being equivalent to a simple series of arithmetic operations), then the only thing we can really say theoretically is that it is unlikely to be better in this regard than a random pick amongst the functions. Your test functions stand out by entering loops faster than expected for a random function, and especially by always (?) entering into period 2 loops.  Reply With Quote

26. The Following User Says Thank You to NohatCoder For This Useful Post:

easyaspi314 (17th October 2019)

27. Speaking of assembly, I have spooky64v2-x86_64.S, which closely corresponds to a slightly improved C version. The compiler cannot avoid register spilling, which in fact can be avoided narrowly (12 state registres with 15 general-purpose registers), so when spelled out in assembly, the loop runs about 10% faster (I don't remember the exact figure). As you can see, there is a fair amount of boilerplate, some of it to support both System V and Microsoft x64 calling convention, and some of it to make the rest of the code less error-prone.

Then I started to have doubts about SpookyHash. Leo Yuriev reports that it has "collisions with 4bit diff" (I don't know the details), but more importantly, the way the last portion of data is fed atop of the state that's not been mixed, contrary to the trickle-feed theory, looks very suspicious (this is possibly the same issue that leads to 4-bit diff collisions). If it wasn't Bob Jenkins, I would call it a blunder. I also wrote a JIT compiler to explore ARX constructions like SpookyHash, and specifically to find better rotation constants that maximize avalanche. Measuring avalanche is not a trivial thing though: some if it comes from the fact that the state must be sufficiently random, and some from the strength of the construction proper... I must add that I don't understand well all aspects of Bob Jenkins' theories.  Reply With Quote

28. ZrHa64_update when scaled down to 4 bits instead of 64 bits (with 2 states and 2 inputs = 256x256), generated with libpng instead of terminal escape codes and a screenshot: It seems that some values just don't occur, everything occurs a multiple of 256 times, and there is a huge favoring of 136.
Chart of value occurences

Here is the code to generate it.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <png.h>

// 4-bit ZrHa64_update
void ZrHa4_update(uint8_t state, uint8_t data)
{
uint8_t x0 = (state + data) % 16;
uint8_t x1 = (state + data) % 16;
uint8_t m0 = ((x0 % 4) * (x0 / 4)) % 16;
uint8_t m1 = ((x1 % 4) * (x1 / 4)) % 16;
uint8_t rot1 = ((x1 >> 2) | (x1 << 2)) % 16;
uint8_t rot0 = ((x0 >> 2) | (x0 << 2)) % 16;
state = (m0 + rot1) % 16;
state = (m1 + rot0) % 16;
}

// Shameless copy of the libpng example code.
int main(void)
{
int width = 256, height = 256;
int code = 0;
FILE *fp = NULL;
png_structp png_ptr = NULL;
png_infop info_ptr = NULL;
png_bytep row = NULL;

// Open file for writing (binary mode)
fp = fopen("xormul.png", "wb");
if (fp == NULL) {
fprintf(stderr, "Could not open file xormul.png for writing\n");
code = 1;
goto finalise;
}

// Initialize write structure
png_ptr = png_create_write_struct(PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
if (png_ptr == NULL) {
fprintf(stderr, "Could not allocate write struct\n");
code = 1;
goto finalise;
}

// Initialize info structure
info_ptr = png_create_info_struct(png_ptr);
if (info_ptr == NULL) {
fprintf(stderr, "Could not allocate info struct\n");
code = 1;
goto finalise;
}

// Setup Exception handling
if (setjmp(png_jmpbuf(png_ptr))) {
fprintf(stderr, "Error during png creation\n");
code = 1;
goto finalise;
}

png_init_io(png_ptr, fp);

// Write header (8 bit colour depth)
png_set_IHDR(png_ptr, info_ptr, width, height,
8, PNG_COLOR_TYPE_RGB, PNG_INTERLACE_NONE,
PNG_COMPRESSION_TYPE_BASE, PNG_FILTER_TYPE_BASE);

png_write_info(png_ptr, info_ptr);

// Allocate memory for one row (3 bytes per pixel - RGB)
row = (png_bytep) malloc(3 * width * sizeof(png_byte));

// Write image data
int x, y;

// Count the outputs
int outputs = {0};
for (y=0 ; y<height ; y++) {
for (x=0 ; x<width ; x++) {
// Split up the nibbles
uint8_t state = { (uint8_t)(y % 16), (uint8_t)(y / 16) % 16 };
uint8_t data = { (uint8_t)(x % 16), (uint8_t)(x / 16) % 16 };

// Run our downscaled update routine
ZrHa4_update(state, data);
// Combine the state back together
uint8_t code = state + (state << 4);
// Log the occurrence
++outputs[code];
// Insert the pixel, with the R, G, and B being the outputted value.
row[x*3] = row[x*3+1] = row[x*3+2] = code;
}
png_write_row(png_ptr, row);
}
// Dump CSV of all of the outputs to stdout.
printf("value,amount\n");
for (int i = 0; i < 256; i++) {
printf("%4d,%4d\n", i, outputs[i]);
}
// End write
png_write_end(png_ptr, NULL);

finalise:
if (fp != NULL) fclose(fp);
if (info_ptr != NULL) png_free_data(png_ptr, info_ptr, PNG_FREE_ALL, -1);
if (png_ptr != NULL) png_destroy_write_struct(&png_ptr, (png_infopp)NULL);
if (row != NULL) free(row);

return code;
}  Reply With Quote

29. Yeah, Spooky v2 is broken. You can generate some high probability collision by changing a single bit in the final 8 bytes of the second-to last block, and then undo that change by flipping the two affected bits in the final 16 bytes of the last block. There is a pattern switch for the handling of the last block, and it means that data gets ingested into the same spot twice with virtually no mixing in-between.  Reply With Quote

30. @easyaspi314 Yeah, that is the expected pattern, there is only 256 different possible states after mixing in the plaintext, and you are generating each of those states exactly 256 times. If you want to do something fun, take a slightly bigger version of the function, take each of the possible states and iterate a number of times with a fixed input, then we can do stats on the resulting distribution. Do things get worse/better with a different fixed input? I'd say probably, but I don't know.  Reply With Quote

31. The Following User Says Thank You to NohatCoder For This Useful Post:

easyaspi314 (18th October 2019)

32. Here's 10 bit instead of 4 bit

ZrHa Chart

Here's XXH3_64b: Chart

For the key, I used different chunks of kSecret on each row.

About 1/4 of the values end up being zero in ZrHa, but XXH3 has fairly even distribution.  Reply With Quote

33. The Following User Says Thank You to easyaspi314 For This Useful Post:

Cyan (18th October 2019)

34. Intriguing. It seems like XXH3_128b (swapped acc + input) has perfect distribution. Each value appears exactly 1024 times according to my output.

@Cyan is it correct that XXH128 has even distribution?

Wait, hold on, there is a huge error in my code - I'm only testing the low bits lol  Reply With Quote

35. XXH3_128b should have a slightly better distribution than XXH3_64b.
I would not expect the difference to be huge, just visible and measurable.  Reply With Quote

36. The Following User Says Thank You to Cyan For This Useful Post:

easyaspi314 (18th October 2019)

37. I used 16-bit values and if I only use the first lane, everything is roughly even.

To test both 16-bit accumulators, it would require…

• ~86 GiB of storage for a CSV with an average 21.7 byte line length
• 32 GB of memory to store the counter table ((2 ^ 32) * sizeof(uint64_t))
• A lot of processing power
• A lot of time

I currently have this:

• 147 GB available disk space on a mediocre SSD
• 8 GB RAM
• A 2.0 GHz i7-2635QM on a 2011 MacBook Pro, which has thermal management GPUs die for
• Zero patience

I think I could do it, right?  Reply With Quote

38. Originally Posted by easyaspi314 Here's 10 bit instead of 4 bit
I'm not sure exactly what you are graphing here, but it looks bad, we generally usually want white noise.  Reply With Quote

39. Nah, that was a screwup on my part, I accidentally did 5 bit inputs instead of 10, so we had a lot of duplicates. The correct one is just a larger version of the 4-bit one.  Reply With Quote Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•