Results 1 to 11 of 11

Thread: Meow hash 0.5 - an improved AES-NI hash

  1. #1
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts

    Meow hash 0.5 - an improved AES-NI hash

    Meow hash aims to provide very fast hashing by harnessing AES-NI instructions. It was first released in November 2018. The main loop was as simple as:

    S0 = Meow128_AESDEC_Mem(S0, Source);
    S1 = Meow128_AESDEC_Mem(S1, Source + 16);
    S2 = Meow128_AESDEC_Mem(S2, Source + 32);
    S3 = Meow128_AESDEC_Mem(S3, Source + 48);
    Source += 64;

    It was soon discovered that this construction produces too many collisions, the main reason for this being (as explained by Leo Yuriev down the thread) that a single round of AES does not produce enough avalanche - only about 20 bits respond to a single bit flip in the input, while 32/64 are required to produce a decent 64/128-bit hash.

    Recently they pushed another version, v0.5, which allegedly solves the problem of collisions, and it looks so different that it's hard to believe it was written by the same guy (or the same lady). It goes like this:

    MEOW_MIX(xmm0,xmm4,xmm6,xmm1,xmm2, rax + 0x00);
    MEOW_MIX(xmm1,xmm5,xmm7,xmm2,xmm3, rax + 0x20);
    MEOW_MIX(xmm2,xmm6,xmm0,xmm3,xmm4, rax + 0x40);
    MEOW_MIX(xmm3,xmm7,xmm1,xmm4,xmm5, rax + 0x60);
    MEOW_MIX(xmm4,xmm0,xmm2,xmm5,xmm6, rax + 0x80);
    MEOW_MIX(xmm5,xmm1,xmm3,xmm6,xmm7, rax + 0xa0);
    MEOW_MIX(xmm6,xmm2,xmm4,xmm7,xmm0, rax + 0xc0);
    MEOW_MIX(xmm7,xmm3,xmm5,xmm0,xmm1, rax + 0xe0);
    rax += 0x100;

    So the state is xmm0-xmm7 (128 bytes), and they consume 256 bytes at each iteration, 32 bytes per MEOW_MIX. If you look at the columns of arguments to MEOW_MIX, the registers are cycled - the construction is similar to trickle-feed design as used by Bob Jenkins in SpookyHash. The 32 bytes for each MIX further come in 4 overlapping loads. The details are as follows:

    #define MEOW_MIX_REG(r1, r2, r3, r4, r5, i1, i2, i3, i4) \
    aesdec(r1, r2); \
    paddq(r3, i1); \
    pxor(r2, i2); \
    aesdec(r2, r4); \
    paddq(r5, i3); \
    pxor(r4, i4)
    #define MEOW_MIX(r1, r2, r3, r4, r5, ptr) \
    MEOW_MIX_REG(r1, r2, r3, r4, r5, \
    _mm_loadu_si128( (__m128i *) ((ptr) + 15) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 0) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 1) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 16) ))

    It looks plausible enough that this thing can deliver a decent 64-bit hash. Although what bothers me is that they do not disclose their design methodology, and specifically provide no estimates on the avalanche of the internal state. And I am increasingly coming to believe that all further attempts to design a better hash function without a sound methodology should be simply discarded. (I.e. if someone is just tweaking the code to pass the tests narrowly, that's not an achievement.)

    One weakness in the MIX construction can already be seen: the 32 input bytes come in four overlapping loads: p+0, p+1, p+15, and p+16. So they basically try to inject the data twice, but they have to overlap inward, so bytes #0 and #31 are at a disadvantage: they are injected only once, and probably do not produce enough avalanche. I wonder whether the two loads (at p+0 and p+1) are any better than a single load at p0 and a cheap permutation such as pshufd aka shuffle_epi32.

    Funnily enough, J. Andrew Rogers, the guy who designed MetroHash, recently came up with a new AES-NI hash function, AquaHash, whose main loop is essentially the same as in the early versions of Meow hash (as listed above). The old man's back again, it's a shame the construction's been debunked! That's why you shouldn't believe anyone who claims best speed and high quality, even someone with a good record, unless they disclose their design methodology.

  2. Thanks:

    Shelwien (21st July 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    1. So what's the actual processing speed? Microcode instructions are not necessarily faster than equivalent plain code. Also intel vs amd?

    2. Is there code to compute the same hash without AES instructions? Intel AES-NI pdf only contains pseudocode, not actual C.

    3. I also think that shuffle would be better.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    http://nishi.dreamhosters.com/u/meow_v0.rar
    Code:
    7280X @ 4.5Ghz
    
    1,000,000,000 enwik9 0C765EFC-731F3B07-B5E65937-685B8D4F
    
    0.484s // gcc82 core-avx2
    0.484s // gcc82 corei7-avx
    0.484s // gcc82 westmere (SSE4.2?)
    
    0.234s // memmap gcc82 westmere
    
    0.469s // memmap gcc82 westmere for https://gist.github.com/mmozeiko/9ddd6e9c0ea3a41728eb4f53034f31f6
    
    1000000000/0.234/(2**30) = 3.98GB/s

  5. #4
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts
    I revamped meowfile.c for Linux. On Xeon E5-2660 (Sandy Bridge ca. 2012, host gcc122.fsffrance.org, gcc 4.8 ) I get this:
    Code:
    $ gcc -O2 -msse4 -maes meowfile.c -o meowfile
    $ time ./meowfile <../enwik9
    0c765efc731f3b07b5e65937685b8d4f
    4.1 bytes per cycle
    
    real    0m0.301s
    user    0m0.113s
    sys     0m0.188s
    So it's not all that impressive (on older machines), but you can already see that system time (handling page faults etc.) dominates over user time. If we take into account only user time, it must be around 10^9/2^30/0.113=8.2 Gbps. Benchmarking is hard, Meow hash had better be integrated into smhasher and compared with other hashes on equal terms.

  6. Thanks:

    Shelwien (21st July 2019)

  7. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    899
    Thanks
    84
    Thanked 325 Times in 227 Posts
    i9 results:

    time ./meowfile < enwik9
    0c765efc731f3b07b5e65937685b8d4f
    5.5 bytes per cycle

    real 0m0.118s
    user 0m0.059s
    sys 0m0.059s

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    meow_test.cpp has other hashes too (xxhash etc)

  9. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.

  10. #8
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    24
    Thanks
    7
    Thanked 10 Times in 8 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.
    The peril of multiplication is that high bits do not provide enough avalanche to the internal state (and may cancel too easily with the next portion of input). If we further take the view that a hash function is only as good as its "weakest link in a chain", then a single multiplication is never enough to provide decent diffusion, and moreover it is wasted in a sense that the "weak link" is still there, as if there were no multiplication at all! This is the problem with most fast design with single multiplication - such as MetroHash. I suspect that Bob Jenkins came to the same conclusion, that's why he didn't use multiplication in SpookyHash, but instead studied the constructions which demonstrably maximize avalanche.

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    However if you base the hash too much on pure shift and add/sub then you end up stalling due to insufficient execution ports. In some algorithms the multiply can be free because it operates in parallel, in which case it's still beneficial surely? A mix of the two is perhaps optimal.

  12. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Quote Originally Posted by svpv View Post
    This is the problem with most fast design with single multiplication
    In HighwayHash we fold the middle bits (best past dispersion) of the multiplication result to the low bits (best future dispersion) of the next multiplicant in the ZipperMerge function. See ZipperMergeAndAdd and Update functions in

    https://github.com/google/highwayhas.../highwayhash.c

    Even in a single update there are two multiplications and bits are mixed across the lanes, too.

  13. #11
    Member
    Join Date
    Sep 2019
    Location
    Earth
    Posts
    7
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Hello, co-designer of Meow 0.5 here.

    Quote Originally Posted by svpv View Post
    It looks plausible enough that this thing can deliver a decent 64-bit hash. Although what bothers me is that they do not disclose their design methodology, and specifically provide no estimates on the avalanche of the internal state. And I am increasingly coming to believe that all further attempts to design a better hash function without a sound methodology should be simply discarded. (I.e. if someone is just tweaking the code to pass the tests narrowly, that's not an achievement.)
    I have tried to analyse every candidate construction manually, and with brute force code wherever I could find some way of using it meaningfully. The last few iterations before this one didn't have the alternate alignment loads, that means systematically changing the input in only one byte column would produce inner state with changes in only 4 out of 16 columns (after one round of AES), so I wrote a brute forcer to find patterns where these changes might cancel out before avalanching to the rest of the columns. That brute forcer did not fail to find candidate patterns for any function I threw at it, that is why we changed the load alignment, these patterns simply became impossible. I haven't found a computationally feasible way of testing for similar patterns in this version as the search space is simply too large. Good news is that that large space also means that it is unlikely that any pattern simple enough to be exploitable exist.

    Regarding avalanching, it doesn't really matter how quickly the avalanching happens, as long as there is no way for changes to cancel out before they avalanche to a state where it is unlikely, so that is what I have been focusing on.

    I know the Smhasher-designed hashes, have broken a few

    Quote Originally Posted by svpv View Post
    One weakness in the MIX construction can already be seen: the 32 input bytes come in four overlapping loads: p+0, p+1, p+15, and p+16. So they basically try to inject the data twice, but they have to overlap inward, so bytes #0 and #31 are at a disadvantage: they are injected only once, and probably do not produce enough avalanche. I wonder whether the two loads (at p+0 and p+1) are any better than a single load at p0 and a cheap permutation such as pshufd aka shuffle_epi32.
    It does look like an odd choice, but it should be clear that there is no shortcut to collisions through altering only the singleton bytes. Any alteration to one of the end bytes will be AESed before being mixed with other input, so it becomes a difference in 4 bytes, dealing with that difference requires introducing changes to other bytes, and at that point the double ingestion comes into play as intended. The reason not to use a shuffle instruction is that it costs an extra instruction. The loads get fused with add and xor, so each half-round consists of just 3 instructions, which some processors can process in one clock cycle. What surprises many people is that unaligned memory access doesn't really bother modern X86 processors. In this construction, where we don't cross cache line bounds, it seems to have no adverse impact on performance.

    Benchmarking Meow comes with some caveats. First of all, save for some game consoles, all modern X86 computers have main memory that is too slow to keep up with the CPU when running Meow, so the algorithm is fast enough to do close to 16 bytes per cycle on data in cache, memory just can't keep up. Second, while we don't have an assembly version for this pre-release, it is the plan to have an assembly version for the final release. What compilers do with the C code is very much hit and miss, typical issues include not combining the loads with add/xor, and reordering the instructions so dependencies are too close (there are limits to the amount of mess out-of-order execution can deal with).

    @Shelwien:

    We did have plain C code in the past, but current focus has been on designing the function cryptographically rather than making a zoo of implementations that would all need to change if we change the function further. Release should have a bunch of different versions. If you just want to see the AESDEC operation in code there is an implementation in the old Meow 0.4 codebase: https://github.com/cmuratori/meow_ha...re/meow_more.h
    Last edited by NohatCoder; 29th September 2019 at 08:56.

  14. Thanks:

    Shelwien (29th September 2019)

Similar Threads

  1. Hash / Checksum algorithm considerations
    By Cyan in forum Data Compression
    Replies: 61
    Last Post: 16th June 2017, 00:28
  2. Perfect Hash Function to Hash Strings
    By joey in forum Data Compression
    Replies: 18
    Last Post: 22nd March 2016, 10:59
  3. Extremely fast hash
    By Bulat Ziganshin in forum Data Compression
    Replies: 36
    Last Post: 23rd August 2013, 21:25
  4. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 16th January 2012, 23:29
  5. Hash Zip
    By Black_Fox1 in forum Forum Archive
    Replies: 6
    Last Post: 4th March 2007, 17:12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •