How to select proper hash algo
How to select proper hash algo
UMAC and VMAC (site) are the fastest MAC algorithms available. They include UHASH/VHASH as part of their algorithms and, to the best of my knowledge, U/VHASH can be used as fast & reliable hash algorithms - more reliable than any non-cryptographic ones (crc,xxhash,cityhash) and even cryptograhically secure as far as attacker doesn't know the hash key. Moreover, since U/VHASH are keyed hashes, you can calculate hash of any K*N size, where N is the size of single hash, by calling them K times with different key values.
Main loop of both hashes is based on very fast family of universal hashes:
sum = (data[0]+key[0])*(data[1]+key[1]) + (data[2]+key[2])*(data[3]+key[3]) + ...
result = sum >> (sizeof(*data)*8)
UHASH is 32-bit hash, i.e. data[i], key[i] and result are 32-bit values. data[i]+key[i] is 32-bit operation, while multiplication is 32*32->64 and sum is 64-bit. VHASH is 64-bit hash, i.e. everything is twice as wide. Both have "double versions" that return double-wide result by performing the calculation twice (with different keys, of course). The "double version" reduces number of memory loads, though, so it's ~25% faster than two runs of ordinary version.
By the definition of univeral hashing, every data[i] should have its own, independent key[i]. So U/VHASH employs this univeral hashing loop only as the base building block. Only 8..8192 (VMAC_NHBYTES) bytes are generated for key[] and therefore separate univeral hash value is computed for each VMAC_NHBYTES bytes of input data. Then those block hashes are combined using much slower but cryptographically-strong algorithms.
Since VHASH performs 64*64->128 multiplication and requires register memory enough to storing two 128-bit values, it turns very slow on plain x86 register-poor architecture. With VMAC_NHBYTES=8192, i got 5 cpb for pure x86, 1 cpb for x86-SSE2 (hand-optimized algorithm storing data in MMX registers) and 0.4 cpb for plain x64 - all these for VHASH-128 algo. VHASH-64 was 0.25 cpb.
UMAC should be much faster on x86, especially with SSE2/AVX2 - 2*UMAC-64 should be 0.5 cpb with SSE2, 0.25 cpb with AVX2. AVX512 (expected in Skylake CPUs) includes 64*64->128 SIMD multiplication, so with AVX512 VHASH should be faster than UHASH once again.
VMAC is currently used by SREP as block checksum and for deduplication. Forthcoming FA'Next uses VMAC for deduplication if chunk checksums aren't stored to the archive. Crypto++ includes VMAC implementation.
Last edited by Bulat Ziganshin; 19th May 2015 at 11:54.
Matt Mahoney (19th May 2015),ne0n (19th May 2015)