It was soon discovered that this construction produces too many collisions, the main reason for this being (as explained by Leo Yuriev down the thread) that a single round of AES does not produce enough avalanche - only about 20 bits respond to a single bit flip in the input, while 32/64 are required to produce a decent 64/128-bit hash.
Recently they pushed another version, v0.5, which allegedly solves the problem of collisions, and it looks so different that it's hard to believe it was written by the same guy (or the same lady). It goes like this:
So the state is xmm0-xmm7 (128 bytes), and they consume 256 bytes at each iteration, 32 bytes per MEOW_MIX. If you look at the columns of arguments to MEOW_MIX, the registers are cycled - the construction is similar to trickle-feed design as used by Bob Jenkins in SpookyHash. The 32 bytes for each MIX further come in 4 overlapping loads. The details are as follows:
It looks plausible enough that this thing can deliver a decent 64-bit hash. Although what bothers me is that they do not disclose their design methodology, and specifically provide no estimates on the avalanche of the internal state. And I am increasingly coming to believe that all further attempts to design a better hash function without a sound methodology should be simply discarded. (I.e. if someone is just tweaking the code to pass the tests narrowly, that's not an achievement.)
One weakness in the MIX construction can already be seen: the 32 input bytes come in four overlapping loads: p+0, p+1, p+15, and p+16. So they basically try to inject the data twice, but they have to overlap inward, so bytes #0 and #31 are at a disadvantage: they are injected only once, and probably do not produce enough avalanche. I wonder whether the two loads (at p+0 and p+1) are any better than a single load at p0 and a cheap permutation such as pshufd aka shuffle_epi32.
Funnily enough, J. Andrew Rogers, the guy who designed MetroHash, recently came up with a new AES-NI hash function, AquaHash, whose main loop is essentially the same as in the early versions of Meow hash (as listed above). The old man's back again, it's a shame the construction's been debunked! That's why you shouldn't believe anyone who claims best speed and high quality, even someone with a good record, unless they disclose their design methodology.
I revamped meowfile.c for Linux. On Xeon E5-2660 (Sandy Bridge ca. 2012, host gcc122.fsffrance.org, gcc 4.8 ) I get this:
Code:
$ gcc -O2 -msse4 -maes meowfile.c -o meowfile
$ time ./meowfile <../enwik9
0c765efc731f3b07b5e65937685b8d4f
4.1 bytes per cycle
real 0m0.301s
user 0m0.113s
sys 0m0.188s
So it's not all that impressive (on older machines), but you can already see that system time (handling page faults etc.) dominates over user time. If we take into account only user time, it must be around 10^9/2^30/0.113=8.2 Gbps. Benchmarking is hard, Meow hash had better be integrated into smhasher and compared with other hashes on equal terms.
Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.
Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.
The peril of multiplication is that high bits do not provide enough avalanche to the internal state (and may cancel too easily with the next portion of input). If we further take the view that a hash function is only as good as its "weakest link in a chain", then a single multiplication is never enough to provide decent diffusion, and moreover it is wasted in a sense that the "weak link" is still there, as if there were no multiplication at all! This is the problem with most fast design with single multiplication - such as MetroHash. I suspect that Bob Jenkins came to the same conclusion, that's why he didn't use multiplication in SpookyHash, but instead studied the constructions which demonstrably maximize avalanche.
However if you base the hash too much on pure shift and add/sub then you end up stalling due to insufficient execution ports. In some algorithms the multiply can be free because it operates in parallel, in which case it's still beneficial surely? A mix of the two is perhaps optimal.
This is the problem with most fast design with single multiplication
In HighwayHash we fold the middle bits (best past dispersion) of the multiplication result to the low bits (best future dispersion) of the next multiplicant in the ZipperMerge function. See ZipperMergeAndAdd and Update functions in
It looks plausible enough that this thing can deliver a decent 64-bit hash. Although what bothers me is that they do not disclose their design methodology, and specifically provide no estimates on the avalanche of the internal state. And I am increasingly coming to believe that all further attempts to design a better hash function without a sound methodology should be simply discarded. (I.e. if someone is just tweaking the code to pass the tests narrowly, that's not an achievement.)
I have tried to analyse every candidate construction manually, and with brute force code wherever I could find some way of using it meaningfully. The last few iterations before this one didn't have the alternate alignment loads, that means systematically changing the input in only one byte column would produce inner state with changes in only 4 out of 16 columns (after one round of AES), so I wrote a brute forcer to find patterns where these changes might cancel out before avalanching to the rest of the columns. That brute forcer did not fail to find candidate patterns for any function I threw at it, that is why we changed the load alignment, these patterns simply became impossible. I haven't found a computationally feasible way of testing for similar patterns in this version as the search space is simply too large. Good news is that that large space also means that it is unlikely that any pattern simple enough to be exploitable exist.
Regarding avalanching, it doesn't really matter how quickly the avalanching happens, as long as there is no way for changes to cancel out before they avalanche to a state where it is unlikely, so that is what I have been focusing on.
I know the Smhasher-designed hashes, have broken a few
Originally Posted by svpv
One weakness in the MIX construction can already be seen: the 32 input bytes come in four overlapping loads: p+0, p+1, p+15, and p+16. So they basically try to inject the data twice, but they have to overlap inward, so bytes #0 and #31 are at a disadvantage: they are injected only once, and probably do not produce enough avalanche. I wonder whether the two loads (at p+0 and p+1) are any better than a single load at p0 and a cheap permutation such as pshufd aka shuffle_epi32.
It does look like an odd choice, but it should be clear that there is no shortcut to collisions through altering only the singleton bytes. Any alteration to one of the end bytes will be AESed before being mixed with other input, so it becomes a difference in 4 bytes, dealing with that difference requires introducing changes to other bytes, and at that point the double ingestion comes into play as intended. The reason not to use a shuffle instruction is that it costs an extra instruction. The loads get fused with add and xor, so each half-round consists of just 3 instructions, which some processors can process in one clock cycle. What surprises many people is that unaligned memory access doesn't really bother modern X86 processors. In this construction, where we don't cross cache line bounds, it seems to have no adverse impact on performance.
Benchmarking Meow comes with some caveats. First of all, save for some game consoles, all modern X86 computers have main memory that is too slow to keep up with the CPU when running Meow, so the algorithm is fast enough to do close to 16 bytes per cycle on data in cache, memory just can't keep up. Second, while we don't have an assembly version for this pre-release, it is the plan to have an assembly version for the final release. What compilers do with the C code is very much hit and miss, typical issues include not combining the loads with add/xor, and reordering the instructions so dependencies are too close (there are limits to the amount of mess out-of-order execution can deal with).
@Shelwien:
We did have plain C code in the past, but current focus has been on designing the function cryptographically rather than making a zoo of implementations that would all need to change if we change the function further. Release should have a bunch of different versions. If you just want to see the AESDEC operation in code there is an implementation in the old Meow 0.4 codebase: https://github.com/cmuratori/meow_ha...re/meow_more.h
Last edited by NohatCoder; 29th September 2019 at 09:56.