Results 1 to 10 of 10

Thread: Meow hash 0.5 - an improved AES-NI hash

  1. #1
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    18
    Thanks
    6
    Thanked 5 Times in 4 Posts

    Meow hash 0.5 - an improved AES-NI hash

    Meow hash aims to provide very fast hashing by harnessing AES-NI instructions. It was first released in November 2018. The main loop was as simple as:

    S0 = Meow128_AESDEC_Mem(S0, Source);
    S1 = Meow128_AESDEC_Mem(S1, Source + 16);
    S2 = Meow128_AESDEC_Mem(S2, Source + 32);
    S3 = Meow128_AESDEC_Mem(S3, Source + 48);
    Source += 64;

    It was soon discovered that this construction produces too many collisions, the main reason for this being (as explained by Leo Yuriev down the thread) that a single round of AES does not produce enough avalanche - only about 20 bits respond to a single bit flip in the input, while 32/64 are required to produce a decent 64/128-bit hash.

    Recently they pushed another version, v0.5, which allegedly solves the problem of collisions, and it looks so different that it's hard to believe it was written by the same guy (or the same lady). It goes like this:

    MEOW_MIX(xmm0,xmm4,xmm6,xmm1,xmm2, rax + 0x00);
    MEOW_MIX(xmm1,xmm5,xmm7,xmm2,xmm3, rax + 0x20);
    MEOW_MIX(xmm2,xmm6,xmm0,xmm3,xmm4, rax + 0x40);
    MEOW_MIX(xmm3,xmm7,xmm1,xmm4,xmm5, rax + 0x60);
    MEOW_MIX(xmm4,xmm0,xmm2,xmm5,xmm6, rax + 0x80);
    MEOW_MIX(xmm5,xmm1,xmm3,xmm6,xmm7, rax + 0xa0);
    MEOW_MIX(xmm6,xmm2,xmm4,xmm7,xmm0, rax + 0xc0);
    MEOW_MIX(xmm7,xmm3,xmm5,xmm0,xmm1, rax + 0xe0);
    rax += 0x100;

    So the state is xmm0-xmm7 (128 bytes), and they consume 256 bytes at each iteration, 32 bytes per MEOW_MIX. If you look at the columns of arguments to MEOW_MIX, the registers are cycled - the construction is similar to trickle-feed design as used by Bob Jenkins in SpookyHash. The 32 bytes for each MIX further come in 4 overlapping loads. The details are as follows:

    #define MEOW_MIX_REG(r1, r2, r3, r4, r5, i1, i2, i3, i4) \
    aesdec(r1, r2); \
    paddq(r3, i1); \
    pxor(r2, i2); \
    aesdec(r2, r4); \
    paddq(r5, i3); \
    pxor(r4, i4)
    #define MEOW_MIX(r1, r2, r3, r4, r5, ptr) \
    MEOW_MIX_REG(r1, r2, r3, r4, r5, \
    _mm_loadu_si128( (__m128i *) ((ptr) + 15) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 0) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 1) ), \
    _mm_loadu_si128( (__m128i *) ((ptr) + 16) ))

    It looks plausible enough that this thing can deliver a decent 64-bit hash. Although what bothers me is that they do not disclose their design methodology, and specifically provide no estimates on the avalanche of the internal state. And I am increasingly coming to believe that all further attempts to design a better hash function without a sound methodology should be simply discarded. (I.e. if someone is just tweaking the code to pass the tests narrowly, that's not an achievement.)

    One weakness in the MIX construction can already be seen: the 32 input bytes come in four overlapping loads: p+0, p+1, p+15, and p+16. So they basically try to inject the data twice, but they have to overlap inward, so bytes #0 and #31 are at a disadvantage: they are injected only once, and probably do not produce enough avalanche. I wonder whether the two loads (at p+0 and p+1) are any better than a single load at p0 and a cheap permutation such as pshufd aka shuffle_epi32.

    Funnily enough, J. Andrew Rogers, the guy who designed MetroHash, recently came up with a new AES-NI hash function, AquaHash, whose main loop is essentially the same as in the early versions of Meow hash (as listed above). The old man's back again, it's a shame the construction's been debunked! That's why you shouldn't believe anyone who claims best speed and high quality, even someone with a good record, unless they disclose their design methodology.

  2. The Following User Says Thank You to svpv For This Useful Post:

    Shelwien (21st July 2019)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,219
    Thanks
    188
    Thanked 962 Times in 496 Posts
    1. So what's the actual processing speed? Microcode instructions are not necessarily faster than equivalent plain code. Also intel vs amd?

    2. Is there code to compute the same hash without AES instructions? Intel AES-NI pdf only contains pseudocode, not actual C.

    3. I also think that shuffle would be better.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,219
    Thanks
    188
    Thanked 962 Times in 496 Posts
    http://nishi.dreamhosters.com/u/meow_v0.rar
    Code:
    7280X @ 4.5Ghz
    
    1,000,000,000 enwik9 0C765EFC-731F3B07-B5E65937-685B8D4F
    
    0.484s // gcc82 core-avx2
    0.484s // gcc82 corei7-avx
    0.484s // gcc82 westmere (SSE4.2?)
    
    0.234s // memmap gcc82 westmere
    
    0.469s // memmap gcc82 westmere for https://gist.github.com/mmozeiko/9ddd6e9c0ea3a41728eb4f53034f31f6
    
    1000000000/0.234/(2**30) = 3.98GB/s

  5. #4
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    18
    Thanks
    6
    Thanked 5 Times in 4 Posts
    I revamped meowfile.c for Linux. On Xeon E5-2660 (Sandy Bridge ca. 2012, host gcc122.fsffrance.org, gcc 4.8 ) I get this:
    Code:
    $ gcc -O2 -msse4 -maes meowfile.c -o meowfile
    $ time ./meowfile <../enwik9
    0c765efc731f3b07b5e65937685b8d4f
    4.1 bytes per cycle
    
    real    0m0.301s
    user    0m0.113s
    sys     0m0.188s
    So it's not all that impressive (on older machines), but you can already see that system time (handling page faults etc.) dominates over user time. If we take into account only user time, it must be around 10^9/2^30/0.113=8.2 Gbps. Benchmarking is hard, Meow hash had better be integrated into smhasher and compared with other hashes on equal terms.

  6. The Following User Says Thank You to svpv For This Useful Post:

    Shelwien (21st July 2019)

  7. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    778
    Thanks
    63
    Thanked 273 Times in 191 Posts
    i9 results:

    time ./meowfile < enwik9
    0c765efc731f3b07b5e65937685b8d4f
    5.5 bytes per cycle

    real 0m0.118s
    user 0m0.059s
    sys 0m0.059s

  8. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,219
    Thanks
    188
    Thanked 962 Times in 496 Posts
    meow_test.cpp has other hashes too (xxhash etc)

  9. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    682
    Thanks
    208
    Thanked 249 Times in 152 Posts
    Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.

  10. #8
    Member
    Join Date
    Jan 2017
    Location
    Selo Bliny-S'edeny
    Posts
    18
    Thanks
    6
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Me and Jan looked at this in the design of HighwayHash and our conclusion was that muls and shuffles give more entropy diffusion and more throughput. The AES instruction seemed only useful for backward compatibility.
    The peril of multiplication is that high bits do not provide enough avalanche to the internal state (and may cancel too easily with the next portion of input). If we further take the view that a hash function is only as good as its "weakest link in a chain", then a single multiplication is never enough to provide decent diffusion, and moreover it is wasted in a sense that the "weak link" is still there, as if there were no multiplication at all! This is the problem with most fast design with single multiplication - such as MetroHash. I suspect that Bob Jenkins came to the same conclusion, that's why he didn't use multiplication in SpookyHash, but instead studied the constructions which demonstrably maximize avalanche.

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    458
    Thanks
    143
    Thanked 158 Times in 106 Posts
    However if you base the hash too much on pure shift and add/sub then you end up stalling due to insufficient execution ports. In some algorithms the multiply can be free because it operates in parallel, in which case it's still beneficial surely? A mix of the two is perhaps optimal.

  12. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    682
    Thanks
    208
    Thanked 249 Times in 152 Posts
    Quote Originally Posted by svpv View Post
    This is the problem with most fast design with single multiplication
    In HighwayHash we fold the middle bits (best past dispersion) of the multiplication result to the low bits (best future dispersion) of the next multiplicant in the ZipperMerge function. See ZipperMergeAndAdd and Update functions in

    https://github.com/google/highwayhas.../highwayhash.c

    Even in a single update there are two multiplications and bits are mixed across the lanes, too.

Similar Threads

  1. Hash / Checksum algorithm considerations
    By Cyan in forum Data Compression
    Replies: 61
    Last Post: 16th June 2017, 00:28
  2. Perfect Hash Function to Hash Strings
    By joey in forum Data Compression
    Replies: 18
    Last Post: 22nd March 2016, 10:59
  3. Extremely fast hash
    By Bulat Ziganshin in forum Data Compression
    Replies: 36
    Last Post: 23rd August 2013, 21:25
  4. Directory hash as one string
    By FatBit in forum The Off-Topic Lounge
    Replies: 6
    Last Post: 16th January 2012, 23:29
  5. Hash Zip
    By Black_Fox1 in forum Forum Archive
    Replies: 6
    Last Post: 4th March 2007, 17:12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •