ZrHa_update - a fast construction for iterated hashing
I've been pondering over the XXH3 main loop for big inputs. It is based on the UMAC construction / NH hash-function family. NH basically means you multiply the data with itself. The original NH goes like this:
Before multiplication, the data is "normalized" with random key material. The purpose is to reduce the chance of multiplication going wrong, e.g. long stretches of zero bits in the multipliers should become unlikely. But still, multiplication by 0 cannot be ruled out, with the probability of 2^-32. This is kind of okay for a 32-bit hash function, which the original UMAC strives to provide. But this is way too bad for even a remotely decent 64-bit hash. Another problem is that the multiplication is not invertible, and may lose data too early. Both problems can be addressed by re-adding the data atop the product.
void update(uint64_t *state, uint64_t data, uint64_t key)
{
uint64_t x = data + key;
uint64_t m = (uint32_t) x * (x >> 32);
*state += m + x;
}
But there is an even better way, which Yann calls cross-pollenization. Since any practical implementation would have multiple states anyway, we can re-add the data atop of the other product:
By this time we've moved away from the original NH far enough, and it didn't promise much anyway. So let's rethink the construction, I thought. Why do we even need the key? We want the multipliers to be random-looking. But if manage to maintain the state itself random-looking, then there is no need for the key. Instead, we can "normalize" the data by simply feeding it right into the state.
Now, what can go wrong with this construction? It feels that there's an increased risk of the data interfering with itself, but at least there is no trivial way for the data to self-cancel. So I tried various clever ways to subvert the construction, only to discover a rather simple one: just feed zeros into it, and it will break soon enough.
int main()
{
uint64_t state[2] = {
UINT64_C(0x736f6d6570736575),
UINT64_C(0x646f72616e646f6d),
};
uint64_t data[2] = { 0, 0 };
for (int i = 0; i < 256; i++) {
update(state, data);
if (i % 16 == 15)
printf("%016" PRIx64 " %016" PRIx64 "\n", state[0], state[1]);
}
return 0;
}
It is easier to understand what happens here if we revert, for a moment, to a non-interleaved variant, i.e. state[0] = m0 + x0. This variant degenerates much faster (in about 64 instead of 256 iterations). So what happens here is that we have the iterations like
lo,hi = lo*hi + lo,hi
lo,hi = lo*hi + lo,hi
where "lo,hi" designates a 64-bit quantity constructed from 32-bit halves. Now, I don't have a mathematical proof, but what seems to be happening here is that the "lo" half is eventually a multiple of the original "lo" half, and on each iteration, this multiple is more likely to be even than odd. Once "lo,hi" is even, there is no way for "lo*hi + lo,hi" to become odd. It can only stay even or become "more even". And becoming "more even" means pushing the multiplier towards zero.
Fortunately this flaw can be easily fixed: the iterations should be
lo,hi = lo*hi + hi,lo
lo,hi = lo*hi + hi,lo
i.e. we swap the halves from the last multiplication, which disrupts the pattern of becoming "more even". And it further makes sense to combine two multiplications in exactly this way: instead of combining good bits with good bits and bad bits with bad bits, we "normalize" the total avalanche effect of both products. It results in much faster diffusion of the data from the previous iterations.
So here's the final construction I would like to discuss:
Here is the avalanche diagram of ZrHa_update (along with the source code to generate it). It is a 128x128 XY-diagram: you flip one bit on the X-axis, and see which bits got flipped in response. The long horizontal lines do not matter, it's the ripples where the entropy is. The four rhombi are contributed by the two products, and above or below each rhombus there is an oblique line segment about 2 bits thick, those are the data re-added atop the products.
Finally, I constructed a maquette, a scaled-down copy of the complete hash function.
static inline uint64_t rrmxmx(uint64_t x)
{
#define rotl64(x, k) (((x) << (k)) | ((x) >> (64 - (k))))
#define rotr64(x, k) (((x) >> (k)) | ((x) << (64 - (k))))
x ^= rotr64(x, 49) ^ rotr64(x, 24);
x *= UINT64_C(0x9FB21C651E98DF25);
x ^= x >> 28;
x *= UINT64_C(0x9FB21C651E98DF25);
x ^= x >> 28;
return x;
}
The real implementation would probably have state[8] instead of state[2], for two AVX2 registers. But the maquette is easier to study and reason about. Actually I have a sophisticated SIMD routine to merge state[8] down to state[2], but to study ZrHa_update, I just borrow a known-good mixer called rrmxmx. And short strings are handled with another known-good hash function (not shown here). The resulting hash function passes all SMHasher tests.
Regarding quality :
passing SMHasher is a good first stage. It's already promising.
These days, I also use a massive collision generator (calculating 25 - 100 billions hashes) to properly estimate collision probabilities for 64-bit hashes. (SMHasher tests are too short for that, they are good enough for 32-bit hashes, but can't "see" collisions beyond>50 bit).
I could give a try for your algorithm. The main downside is that getting results can be very long, and the machine is already in use right now. But I can find a slot later this week.
Regarding speed :
the algorithm introduces a dependency between each round, which will put a lower bound to possible latency per loop.
This can be partially tamed by widening the number of accumulators, though there are other issues to consider (running out of registers, heavier joining section, etc.).
So it's interesting to measure. Presumably, at this stage, I would expect you already have some numbers.
I could also test it on my side, but I guess the current "maquette" is not meant for that, and would produce non-representative results.
I think the lesson here should be: Don't do non-reversible state-mixing, i.e. when you have n possible internal states and you do some shuffling, there should be n possible results. Anything else is effectively discarding state.
This function does not adhere to that principle as, for instance, the state (0x8000000000000001, 0x8000000000000001) and the state (0x0000000000000002, 0x0000000000000002) both permute to (0x0000000200000000, 0x0000000200000000).
I don't know how badly the function as stated is discarding state, or what conditions might trigger critical state drop (I wouldn't be surprised if this function has a state-destroying nemesis, like its sibling has all zero input), I just know that there is a whole world of potential problems that can be avoided by following this simple rule.
For your information @svpv, I tested ZrHa64_maquette for collisions directly on my desktop.
It's more limited, so I had to reduce the test to 13 Bi hashes.
That's not enough for a good estimation, but it can detect hashes with less than ideal distributions.
ZrHa64_maquette generated 4 collisions, which is in line with expectation (5).
This is a good beginning.
I hope to find the time to make a proper evaluation with more hashes within a week.
I think the lesson here should be: Don't do non-reversible state-mixing, i.e. when you have n possible internal states and you do some shuffling, there should be n possible results. Anything else is effectively discarding state.
Right, the theory is that you should only feed the data atop a well-mixed state, and then mix the state again bijectively.
That said, when you combine the data with the state, you are discarding / merging down some possibilities anyway. I wish some form of hashing theory were developed which would explain and further quantify how bad the effect of non-reversible mixing could be. If we are only permitted to use one 32x32->64 multiplication per 64-bit input, we'll have to put up with non-reversible mixing, because you can only multiply the data by another piece of data. You see, the name of the game is not "how to get a high-quality hash". The name of the game is rather "can you get away with murder with that few instructions and still get a decent 64-bit hash".
Also, note that there are various degrees of non-reversibility. Using the technique found in this gist you can quantify non-reversibility by exhaustively counting all 32-bit results. So let's scale ZrHa_update down to 8x8->16 multiplication / 32-bit state and see how many states it can retain.
I get 1768M collisions, which is less than 1 bit lost. Without cross-pollination, i.e. with m0 + (x0 << 8 | x1), you get 3218M collisions, which is about 2 bits lost (a quarter of 2^32 states are possible). Interestingly, XOR seems to work slightly better, with m0 ^ (x2 << 8 | x3) you get 1662M collisions. If I remember correctly, a truly random function (values assigned at random) would trigger 1/e = 1580M collisions. This is the best we can hope for, unless we explicitly target reversible mixing. So it's not too bad, at least it doesn't drink like a fish leak like a sieve.
The interesting part isn't so much how many bits you lose in one iteration, rather it is how many bits you lose in a lot of iterations. In the ideal case you should lose ~29 bits in 1G iterations, dropping another bit every time the iteration count doubles. But we just established that your function isn't the ideal case, and we don't know that it follows a similar pattern. The established theory of non-reversible mixing is that it is ******* complicated, and therefore best avoided.
I will mention that Merkle-Damgård does non-reversible mixing by xoring the state with a previous state, after a long series of reversible mixing. I believe that this pretty much guarantees the ideal case. I still don't like Merkle-Damgård though, periodically reducing the state to output size does nothing good.
It is by the way not true that you can't multiply the input with itself in a reversible manner, you could do:
//Note: This function is for demonstrating a point only, I do not condone this as a hash function.
void update(uint64_t state[2], uint64_t data)
{
uint64_t x0 = state[0] + data;
state[0] = state[1] + (uint32_t) x0 * (x0 >> 32);
state[1] = x0;
}
Multiplication between two non-constants still has some weird behaviour, in that the numbers themselves decide how far the mixing goes, with the ultimate undesirable case that one of the numbers is zero.
Last edited by NohatCoder; 17th October 2019 at 00:11.
Reason: Minor maths error, it is 29, not 22 bits.
I checked that the return value is the same. (There should be a runtime switch, but that requires an extra portability layer. I haven't made up my mind whether to press on with any kind of a release or whether it's just an experiment.)
This introduces the routine that merges two states into one. It is fast and relatively weak (matching in its speed and weakness the update routine).
In "state[0] + rotl64(other[1], 32)", rotl64 is somewhat important: this is how we prefer to combine two products. Combining state[0] with other[1] rather than with other[0] is less important. (It breaks the symmetry though and further eschews combining the data at power-of-two distances.) There is no rotl64 in "m0 + x1", that rotl64 only makes sense in a loop. We end up with two products again. Thus the routine is applied recursively: merge two AVX registers into a single AVX register, worth two SSE registers, merge them into a single SSE register, the last merge stops short and the function returns with "state[0] + rotl64(state[1], 32)". This makes it a pretty fast mixer with just 2 or 3 SIMD multiplications (resp. with AVX2 or SSE2). The result still passes all SMHasher tests.
If I do understand the produced assembly as being a comparison between XXH3 current kernel and Zrha proposed one,
it does not look much worse.
In particular, the nb of instructions seems to be the same in both cases.
Now the order and complexity of these instructions differ,
but is the difference that large ?
The way I see it, the change :
- removes a 'load' and a 'xor' from getting rid of the secret
- adds 2 instructions `vrev` + `vext` to emulate the _MM_SHUFFLE(0, 1, 2, 3)
The `xor` is likely trivial, and the `load` is likely fetching hot data from L1 cache.
But even L1 cache fetching costs a few cycles.
I would expect such a cost to be comparable or even slightly slower than announced 3 cycles for `vrev` and `vext`.
I had some thoughts on non-reversible mixing. It must be bad if you only have a 64-bit state and you want 64-bit result. Interleaving doesn't change this, if the lanes are independent. But that's not what we have with ZrHa_update. Since the lanes are "cross-pollinated", the irreversible mixing applies to 128-bit state. So this might be all good if you only want 64-bit result. And you can't get a decent 128-bit result anyway because a single 32x32->64 multiplication isn't enough for it.
So what's "the mechanics" of non-reversible mixing? The internal state may "wear out" or "obliterate" gradually, but how and when does this happen? After being mixed, the state cannot assume certain values, about 1/e of all possible values. But as we feed the next portion of data, it seems that the state "somewhat recovers" in that it can assume more values. If the next portion is somewhat dissimilar to the mixed state, it is plausible to say that the state can assume any value again. (If we fed ascii strings with XOR, this could not have flipped the high bit in each byte, but we feed with ADD.) Assuming that the state recovers, I can't immediately see how the non-reversible mix is worse than a reversible one. Can it be show that the construction leaks, specifically more than 1 bit after a few iterations?
Of course, the worst case is that the state assumes progressively less values on each iteration, which happens if you feed zeroes into it. We can see how this happens in the scaled-down constructions with 8x8->16 or 16x16->32 multiplications.
int main()
{
uint64_t x = 6364136223846793005;
while (1) {
x = mix(x);
printf("%016" PRIx64 "\n", x);
}
return 0;
}
By extrapolation, the construction with 32x32->64 multiplication must collapse in about 2^60 iterations.
Can it be shown that XOR as the combining step works better than ADD also in the average case rather than just in the worst case?
@NohatCoder, how did you calculate that 22 bits must be leaked after 1G iterations? Was that the average case or the worst-case analysis, or the distinction doesn't matter?
True, but there are many drawbacks to writing in assembly:
It is platform-specific (and even subarch specific)
It can be somewhat difficult to follow with multiple lanes
Trying to figure out the best way to inline/unroll is annoying
It can't be inlined
It can't be constexpr'd
I usually prefer using tiny inline assembly hacks, which while they mess up ordering a bit, can usually help a lot, especially this magic
__asm__("" : "+r" (var));
which can break up patterns, murder SLP vectorization, move loads outside of loops, and more. It is like temporary volatile.
For example, on ARMv7-A, vzip.32 (a.k.a. vtrn.32) modifies in place:
If we use this on the even and odd halves of the vector (on ARMv7, Q-forms are unions of 2 half-width D-forms), we can end up with our vmlal.u32 setup in one instruction by vzipping in place at the cost of clobbering data_key (which is ok):
Edit: Oops, the arrows are pointing to the wrong lanes. They should point to a >> 32 and b & 0xFFFFFFFF
However, Clang and GCC can't comprehend an operation modifying 2 operands, and emit an extra vmov (a.k.a. vorr) to copy an operand.
This line
__asm__("vzip.32 %e0, %f0" : "+w" (data_key));
forces an in-place modification.
I only write things in assembly when I want tiny code or when I am just messing around. It is just a pain otherwise.
The ideal case for non-reversible mixing is a random mapping function from one state to another, i.e. for 128 bit states, there exist (2^128)^(2^128) different (mostly non-reversible) mapping functions (as (2^128)! of them are reversible), if you pick a random one of those functions you'll expect the state-space to deteriorate by 22 29 bits in a billion applications.
This number can be found by iteration, if at iteration x, n out of p different states are possible, then at iteration x+1 the expected number of possible states is p*(1 - ((p-1)/p)^n). By experimentation that turns into the bit loss being asymptotic to log2(iteration count)-1.
But if your mapping function is not an ideal random pick, and it has some distinguisher (like being equivalent to a simple series of arithmetic operations), then the only thing we can really say theoretically is that it is unlikely to be better in this regard than a random pick amongst the functions. Your test functions stand out by entering loops faster than expected for a random function, and especially by always (?) entering into period 2 loops.
Speaking of assembly, I have spooky64v2-x86_64.S, which closely corresponds to a slightly improved C version. The compiler cannot avoid register spilling, which in fact can be avoided narrowly (12 state registres with 15 general-purpose registers), so when spelled out in assembly, the loop runs about 10% faster (I don't remember the exact figure). As you can see, there is a fair amount of boilerplate, some of it to support both System V and Microsoft x64 calling convention, and some of it to make the rest of the code less error-prone.
Then I started to have doubts about SpookyHash. Leo Yuriev reports that it has "collisions with 4bit diff" (I don't know the details), but more importantly, the way the last portion of data is fed atop of the state that's not been mixed, contrary to the trickle-feed theory, looks very suspicious (this is possibly the same issue that leads to 4-bit diff collisions). If it wasn't Bob Jenkins, I would call it a blunder. I also wrote a JIT compiler to explore ARX constructions like SpookyHash, and specifically to find better rotation constants that maximize avalanche. Measuring avalanche is not a trivial thing though: some if it comes from the fact that the state must be sufficiently random, and some from the strength of the construction proper... I must add that I don't understand well all aspects of Bob Jenkins' theories.
ZrHa64_update when scaled down to 4 bits instead of 64 bits (with 2 states and 2 inputs = 256x256), generated with libpng instead of terminal escape codes and a screenshot:
It seems that some values just don't occur, everything occurs a multiple of 256 times, and there is a huge favoring of 136. Chart of value occurences
// Shameless copy of the libpng example code.
int main(void)
{
int width = 256, height = 256;
int code = 0;
FILE *fp = NULL;
png_structp png_ptr = NULL;
png_infop info_ptr = NULL;
png_bytep row = NULL;
// Open file for writing (binary mode)
fp = fopen("xormul.png", "wb");
if (fp == NULL) {
fprintf(stderr, "Could not open file xormul.png for writing\n");
code = 1;
goto finalise;
}
// Allocate memory for one row (3 bytes per pixel - RGB)
row = (png_bytep) malloc(3 * width * sizeof(png_byte));
// Write image data
int x, y;
// Count the outputs
int outputs[256] = {0};
for (y=0 ; y<height ; y++) {
for (x=0 ; x<width ; x++) {
// Split up the nibbles
uint8_t state[2] = { (uint8_t)(y % 16), (uint8_t)(y / 16) % 16 };
uint8_t data[2] = { (uint8_t)(x % 16), (uint8_t)(x / 16) % 16 };
// Run our downscaled update routine
ZrHa4_update(state, data);
// Combine the state back together
uint8_t code = state[0] + (state[1] << 4);
// Log the occurrence
++outputs[code];
// Insert the pixel, with the R, G, and B being the outputted value.
row[x*3] = row[x*3+1] = row[x*3+2] = code;
}
png_write_row(png_ptr, row);
}
// Dump CSV of all of the outputs to stdout.
printf("value,amount\n");
for (int i = 0; i < 256; i++) {
printf("%4d,%4d\n", i, outputs[i]);
}
// End write
png_write_end(png_ptr, NULL);
finalise:
if (fp != NULL) fclose(fp);
if (info_ptr != NULL) png_free_data(png_ptr, info_ptr, PNG_FREE_ALL, -1);
if (png_ptr != NULL) png_destroy_write_struct(&png_ptr, (png_infopp)NULL);
if (row != NULL) free(row);
Yeah, Spooky v2 is broken. You can generate some high probability collision by changing a single bit in the final 8 bytes of the second-to last block, and then undo that change by flipping the two affected bits in the final 16 bytes of the last block. There is a pattern switch for the handling of the last block, and it means that data gets ingested into the same spot twice with virtually no mixing in-between.
@easyaspi314 Yeah, that is the expected pattern, there is only 256 different possible states after mixing in the plaintext, and you are generating each of those states exactly 256 times. If you want to do something fun, take a slightly bigger version of the function, take each of the possible states and iterate a number of times with a fixed input, then we can do stats on the resulting distribution. Do things get worse/better with a different fixed input? I'd say probably, but I don't know.
Nah, that was a screwup on my part, I accidentally did 5 bit inputs instead of 10, so we had a lot of duplicates. The correct one is just a larger version of the 4-bit one.