Results 1 to 3 of 3

Thread: 64 GB/s aka 16 bytes/cycle universal hashing

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Thanked 691 Times in 374 Posts

    64 GB/s aka 16 bytes/cycle universal hashing

    "universal hash" is a mathematical term that, from practical grounds, guarantees ideal hashing result. it's impossible to check statistical properties of single fixed hash in mathematically sound way, so universal hash by definition is a family of hashes (i.e. parameterized hash). Kaser&Lemire proved that sum(32*32->64) string hash is universal (here "string" means variable-sized input data as opposed to fixed-width M-bit value). In C++, their hashing algorithm looks as the following:

    uint32 data[N]
    uint32 key[N]
    uint64 hash = 0
    for i=0..N-1
      hash += uint64(data[i]) * key[i]
    uint32 result = hash >> 32
    Here key[] is the algorithm parameter (pseudo-random numbers) that has the same size as input data. But we can limit key[] to, say, 128 bytes and use the hash tree to combine partial results (as implemented by VMAC and UMAC algos):

    result[0] = hash (data[0..31], key[0..31])
    result[1] = hash (data[32..63], key[0..31])
    result = hash(result[0..31], key[0..31])
    Alternatively, the last step may employ the sha256 or any other good hash since it will have much less data to process.


    So, the only thing that determines speed is the cycle summing results of 32*32->64 multiplications:

    mov eax, data[i]
    mul eax, key[i]
    add ebx, eax
    adc ecx, edx

    it should run in 2 cycles/dword since Pentium Pro, i.e. 8 GB/s for 4 GHz CPU. The cycle speed is limited by those 2 load operations (data[i] and key[i]), so Haswell can run it at 1 cycle/dword speed


    Employing SSE2/AVX2 allows to compute 2/4 values simultaneously. Combined with 2 loads per cycle Haswell ability, it should run at 1 cycle per 8/16 bytes (for SSE2/AVX2, respectively), providing abovementioned 64 GB/s hashing speed for the following AVX2 code:

    mov ymm0, [data+i*32]
    vpmuludq ymm0, [key+i*32]
    vpaddq ymm1, ymm0

    mov ymm0, [data+i*32+4]
    vpmuludq ymm0, [key+i*32+4]
    vpaddq ymm1, ymm0

    mov ymm0, [data+i*32]


    On pre-Haskell cpus we are again limited by load throughput - 2 load operations (data and key) per 16-byte SSE element means 2 cycles. We can reduce amount of loads by parallel processing of six 128-byte blocks (since they are sharing the same key values):

    mov xmm7, [key+i*16] ;load key once

    mov xmm0, [data+i*16]
    pmuludq xmm0, xmm7
    paddq xmm1, xmm0

    mov xmm0, [data+128+i*16]
    pmuludq xmm0, xmm7
    paddq xmm2, xmm0
    mov xmm0, [data+128*5+i*16]
    pmuludq xmm0, xmm7
    paddq xmm6, xmm0

    mov xmm7, [key+i*16] ;load the next key...

    this code requires 7 loads (6 data values and one key value) per 6*8 bytes of input data, so 4 GHz Core2+ CPU will run at 4*8*6/7 = 27 GB/s using SSE2, saturating its memory bandwidth


    It's important to note that all these code snippets will work in x86 (32-bit) mode as well as x64. So, for practical usage we only need the way to generate somewhat around 128..4096 bytes of pseudo-random data from 32..256 bit key provided by the library caller. It can be implemented by calculating AES or SHA256 function from the key, key+1... values (i.e. using AES in CTR mode), again already implemented by the VMAC algo

    Strictly speaking, this algorithm returns 32-bit hash value. For wider result, we can either calculate and later combine multiple 32-bit hashes (the same data[i] multiplied to several key[K][i] values) with the proportional speed degradation or use some form of mixing up intermediate data losing universal hash property but keeping the same high speed. I expect that properly designed mixing step would provide us with good practical hash like it is done in the CityHashCrc256 for example


    Conclusion: these code snippets may be used to implement alternative to xxHash/CityHash with the following properties:
    • fast in both 32-bit and 64-bit code:
    • __ 2 bytes/cycle at least on any CPU since Pentium Pro
    • __ 7 bytes/cycle at least on any Intel CPU since Core2
    • __ 16 bytes/cycle on Haswell, 32 bytes/cycle on Skylake with AVX-512
    • 32-bit result by default, wider results with proportional speed degradation
    • almost math-proven guarantees of hash quality (real math guarantees require true random key)
    • wider results with the same speed as 32-bit one without speed degradation - needs further development and no math guarantees anymore
    • key generation and results combining needs some good-quality algos, but i believe that AES and SHA256 can run this duty
    • the algorithm is parameterized meaning that you can calculate as much independent hashes as you need

    So, comparing it to the best non-crypto hashes i know, xxHash/CityHash, i expect this hash to have better speed but the following drawbacks:
    • much more compiled code since we need to employ AES and SHA256 implementations
    • higher cache occupation (those 128..4096 bytes of key[])
    • wide results (>32 bits) without speed degradation needs further development
    Last edited by Bulat Ziganshin; 9th April 2014 at 19:22.

  2. Thanks (3):

    Cyan (9th April 2014),Garen (10th April 2014),Paul W. (10th April 2014)

  3. #2
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Melbourne, Florida, USA
    Thanked 798 Times in 489 Posts
    Doesn't limiting the key destroy the universal property of the hash? For example, if you swap data[i] with data[i+32], you have a collision. Likewise if you have data[i]+C, data[i+32]-C.

  4. #3
    Join Date
    Feb 2013
    San Diego
    Thanked 72 Times in 56 Posts
    I have a feeling that if the paper calls for the same number of random key bits as input bits, that's probably critical to making it work.

    Why don't you email Lemire and describe the situation? Maybe he will give you a recommendation.
    Last edited by nburns; 10th April 2014 at 02:51.

Similar Threads

  1. reflate - a new universal deflate recompressor
    By Shelwien in forum Data Compression
    Replies: 131
    Last Post: 4th December 2019, 09:08
  2. Replies: 32
    Last Post: 24th September 2013, 01:57
  3. Replies: 7
    Last Post: 6th September 2012, 04:23
  4. Compress any file to 4 bytes
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 28th June 2011, 08:11
  5. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 01:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts