Page 1 of 2 12 LastLast
Results 1 to 30 of 44

Thread: Improving xxHash

  1. #1
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts

    Improving xxHash

    Last weeks I tried to wrap my head around hashing algorithms like xxHash, MurMurHash and SpookyHash. Now I'm ready to provide some critics on xxHash and propose ways to improve it.


    1. xxHash main cycle updates 4 32-bit variables. Each variable depends only on the data in corresponding 32-bit lane of each 128-bit word, so essentially we run 4 independent hashes, each processing its own datastream.

    Since the state for each individual stream holds in 32-bit variable, the upper limit of entropy it can carry is 32 bits, but on practice it will be lower. So, 32-bit state is too small for producing high-quality 32-bit hash. And combining 4 hashes of independent data streams can't help either - all the entropy of some particular input may be condensed in just one of these streams (imagine a table with 16-byte rows).

    The solution to this problem is to mix state variables as we go through the data, as implemented in SpookyHash and MurmurHash3_x86_128. The variables should be combined in the way that ensures that state is fully mixed, i.e. any input bit may change any state bit after some amount of loop cycles, and we should try to minimize this amount of cycles as well as ensure good data mixing.


    2. Hashes of these 4 streams are combined in very minimalistic way. It was already fixed in xxHash64, so I just employed its code in my hash. BTW, XXH32 with XXH64 mixing successfully passed SMHasher as 32 bit hash (both the upper and lower 32 bits of result) and failed only one test as 64-bit hash.


    3. State variables are mixed to the single 32-bit variable prior to processing last 1-15 bytes of data. This condenses the entropy too early. Also, after processing last data byte xxHash performs only a few mixing operations to finalize the hash. This makes the last bytes less mixed into hash result.

    The solution I propose is to expand last chunk of data to 16 bytes with zeros and process it in the same way as previous blocks, i.e. updating all 4 state variables. And only then we should go to the final mixing of state variables in order to compute hash result.


    4. xxHash performs multiplication on input word prior to combining it to the hash state variable. This construction increases ILP, but multiplication operation is "wasted", in the sense that it doesn't mix the hash state itself. It's a question of taste, but i prefer to execute all operations against hash state variables in order to make input bits "diffuse" faster around the state.

    So I constructed two update functions - the first one, employed by SlowZZH, execute the same set of operations as xxHash (2 MUL, ADD, ROL/BSWAP) plus one ADD to mix state vars. It has the same bulk speed as xxHash (2 bytes/cycle), since both functions (as well as xxHash64) performance is strictly limited by only 1 port capable of MUL on any CPU I know. (Note: SMHasher incorrectly computes speed on modern CPUs due to changes in RDTSC behavior - it results in showing results that are 10-15% better than real ones.)

    But unlike xxHash, it executes both MUL operations against state variables, making input bits "diffuse" faster. In theory second MUL should be followed by another ROL. I tried this change, but haven't found any improvements in SMHasher stats (32-bit, 64-bit) aside of decreased speed.

    Computations on input value prior to combining with hash state require to allocate one register for intermediate data. Since my hashes avoid such computations, it frees up one register compared to xxHash. I spend this extra register to increase number of state variables to 5, that decreased pressure of delays between chained operations. The final result is that on Haswell, SlowZZH has the same speed as xxHash.

    Second consequence of enlarging internal hash state is that 160-bit state allows to compute hashes up to 128 bits large with some confidentiality.

    The first ROL was replaced with BSWAP, available on Intel CPUs since i486. I expected that it may slightly improve bit mixing and therefore SMHasher stats, but instead it improved the performance - most likely due to suboptimal operation scheduling of code with ROL by GCC 4.9.2.

    So, SlowZZH provides:
    - the same speed as xxHash (at least on Haswell)
    - better state mixing (two MUL operations on state variable every loop cycle instead of one)
    - 160-bit internal state, new input word influence entire state after processing next 64 input bytes
    - results up to 128 bits (now only 32 and 64 bit results are implemented)
    - passed SMHasher both as 32-bit hash and 64-bit hash


    5. Along with SlowZZH development, I tried to simplify xxHash hashing formula, assuming that bit mixing around 5 state variables will compensate this weakness. I ended up with update procedure that drops the second MUL, opening door to go faster than 2 bytes/cycle.

    Compared to xxHash (2 MUL+ADD+ROL), it employs one more ADD to combine state vars, but drops MUL on input word, so it performs MUL+2 ADD+ROL per input word. In theory, it should be as fast as 1 cycle/word on Haswell, but i don't reached that speed (even in x64-optimized version with 12 state vars to avoid delays due to instruction chains).

    Nevertheless, zzHash passed all SMHasher tests and runs at 2.5 bytes/cycle in x86 version (25% faster than xxHash), and 5.6 bytes/cycle in x64 version (40% faster than xxHash64 and SpookyHash).

    So, zzHash provides:
    - 25% faster than xxHash, x64 version is 40% faster than xxHash64/SpookyHash (on Haswell)
    - the same state mixing on each step as xxHash, but dropped input word pre-mixing
    - 160-bit internal state, new input word influence entire state after processing next 64 input bytes
    - results up to 128 bits (now only 32 and 64 bit results are implemented)
    - passed SMHasher both as 32-bit hash and 64-bit hash

    So, it may be considered as lightweight xxHash version. Instead of premixing input word before combining it with state variable, we mix the state var with another state var on every step. It's questionable how such replacement may improve or degrade a hash quality.


    6. Finally, I added x64 versions of both hashes (not compatible with x86-optimized ones), that read input data in 64-bit words and hold the state in 12 64-bit registers. This makes internal state 768-bits wide, so hashes up to 512 bits can be calculated. Except for much bigger state, the following algorithms are 1:1 translations of x86-optimized hashes.

    SlowWideZZH:
    - 4 bytes/cycle, 5% slower than xxHash64/SpookyHash (on Haswell)
    - 0.7 bytes/cycle on x86, 5% slower than xxHash64, 20% slower than SpookyHash (on Haswell)
    - the same algorithm as SlowZZH
    - 768-bit internal state, new input word influence entire state after processing next 968 input bytes
    - results up to 512 bits (now only 32 and 64 bit results are implemented)
    - passed SMHasher both as 32-bit hash and 64-bit hash

    WideZZH:
    - 5.6 bytes/cycle, 40% faster than xxHash64/SpookyHash (on Haswell)
    - 1.3 bytes/cycle on x86, 80% faster than xxHash64, 60% faster than SpookyHash (on Haswell)
    - the same algorithm as zzHash
    - 768-bit internal state, new input word influence entire state after processing next 968 input bytes
    - results up to 512 bits (now only 32 and 64 bit results are implemented)
    - passed SMHasher both as 32-bit hash and 64-bit hash

    The wide hashes definitely need more work. In particular, I will try to make WideZZH run at 8 bytes/cycle. Also I need to modify state mixing so that input data influence entire state faster than in 968 bytes (for comparison, SpookyHash does it in 256 bytes). This may improve the "Differential Tests" results which now is a bit worse than in non-wide versions.

    Nevertheless, zzHash (for x86) and WideZZH (for x64) seems to be fastest scalar hashes around that can pass SMHasher.




    Now about features that may be implemented in future:

    1. I plan to split hashing into two stages - main loop function processing all input data and returning the final 160-bit or 768-bit state, and final mix functions processing data returned by the first function and returning 32/64/128.. bits as final hash result.


    2. Algorithms like xxHash/zzHash may be considered as "sequential hashing", since they cannot be multi-threaded. I plan to also develop family of "chunked hashing" algorithms. They will split input data into fixed-size blocks (4KB or so), compute sequential hash over each block and then combine particular hashes. This construction, already implemented in FARSH, allows multi-core CPUs and especially GPUs to employ all their resources to hash a single data buffer.

    Moreover, combining of final hash results should be replaced with combining of final internal states, in order to improve hashing speed and quality. This gives additional argument to idea of splitting hash algorithm into two stages.

    Finally, I imagine algorithm that collects 160-bit results of processing each 4KB chunk into temporary buffer and runs the same algo on this buffer receiving 160-bit total result. The same approach may be repeated on multiple levels, and then final 160-bit "super-result" processed by the standard 160-to-32 bit mixing procedure (or 160-to-64 and so on).


    3. And last idea. By pushing up hash speeds, we allow less speed-critical applications to run multi-pass hashing, which may significantly improve hash quality. Let's see - main source of undesired collisions is bit canceling of successive input words. One way to fix it is to process data in arbitrary order. But it will break hardware cache prefetching (that can be replaced with software prefetching), and still allows bit canceling between successively processed words, although with much less probability in practice.

    Another approach is to split data into chunks of ~1-4KB and process each chunk 2 or more times, in different order. The first pass may go through the data sequentially, and second pass in arbitrary order. Small chunk size ensures that data will remain in L1 cache for the second pass. This way, each input word will have different neighborhoods on first and second passes, thus making bit canceling less probable if possible at all.

    In order to maximize distance in one pass, between words processed sequentially in another pass, we should select smallest K > sqrt(N), that is coprime to N (N is chunk size in words), and run the second pass in the following order:
    Code:
    for (int i=0; i<N; i++)
        process (data[(i*K) % N])
    i.e. process data in steps of K, modulo N. Of course, for maximum speed the loop should be unrolled and all indexes made constants.


    4. As you see, we can build huge family of hashes:
    - sequential x86-optimized hashes with 32/64/128 bit results
    - like above, but x64-optimized with 32..512 bit results
    - anything above, performed in 2, 3, 4... passes
    - anything above, used as a base for the chunked algo

    And we don't yet counted fast/slow zzHash variations, and SIMD-optimized algorithms like xxFarsh/HighwayHash (which also could be a target for multi-pass/chunked algorithms)!

    This calls for a Hash Constructor library that provides solid API to build higher-level hash algorithms (such as 128-bit 2-pass chunked algo) from lower-level ones (such as function processing input buffer and returning 160-bit internal state).



    And finally, I will be glad to see your answers on the following questions:

    - How it should be packaged? I prefer to reuse excellent existing infrastructure of xxHash and provide new hashes as some "extension module" to xxHash. The hashes may end up as part of xxHash library, if Yann will prefer. Do you have better ideas?

    - Final mixing. As quick-and-dirty approach, now I just stolen corresponding xxHash64 code. It needs some development, in particular x86-optimized hashes should use 32 bit operations. But I have no idea how to build good algorithm, so I call for your ideas or code. Ignoring x64 hashes for a while, we need x86-optimized code mixing five 32-bit variables into 32..128-bit final hash result.
    Last edited by Bulat Ziganshin; 17th August 2016 at 14:33.

  2. Thanks (2):

    Mike (17th July 2016),stevenxu (18th July 2016)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    880
    Thanks
    478
    Thanked 274 Times in 115 Posts
    There is a slot up to grab in xxhash library for generating hashes >= 128 bits.
    If you want it, its yours.
    Last edited by Cyan; 17th July 2016 at 15:34.

  4. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Quote Originally Posted by Cyan View Post
    There is a slot up to grab in xxhash library for generating hashes >= 128 bits.
    If you want it, its yours.
    you mean I can extend the library with XXH128, somewhat like MurMurHash_x86_128 - essentially just add XXH128 code to every instance of XXH32+XXH64 code in existing xxhash.[ch] ?

    OK

    I yet need to find good finalization code. SpookyHash has interesting End() function, but it has limited ILP. So I wrote 3 variants of high-ILP code, expoiting Jenkins approach. I assume that at this stage it is more important to have MORE mixing between state vars rather than to have GOOD mixing. So only operations with minimal delay are used, and every operation result is used at next cycle (on Haswell+).
    Code:
    #define ROL(v)  XXH_rotl32(v,13)
    
        // first version
        ROL(v4);  ROL(v5);  v1 += v2;  v2 ^= v3;
        ROL(v1);  ROL(v2);  v3 += v4;  v4 ^= v5;
        ROL(v3);  ROL(v4);  v5 += v1;  v1 ^= v2;
        ROL(v5);  ROL(v1);  v2 += v3;  v3 ^= v4;
        ROL(v2);  ROL(v3);  v4 += v5;  v5 ^= v1;
    
        // second version
        v6 = v1;
        ROL(v5);  ROL(v6);  v1 += v3;  v2 ^= v4;
        ROL(v1);  ROL(v2);  v3 += v6;  v4 ^= v5;
        ROL(v3);  ROL(v4);  v5 ^= v1;  v6 += v2;
        ROL(v5);  ROL(v6);  v1 += v4;  v2 ^= v3;
        ROL(v1);  ROL(v2);  v3 += v5;  v4 ^= v6;
        ROL(v3);  ROL(v4);  v5 ^= v2;  v6 += v1;
    
        // third version
        v6 = v1;
        ROL(v2);  v3 += v1;  v4 ^= v1;  v5 -= v1;
        ROL(v3);  v4 += v2;  v5 ^= v2;  v6 -= v2;
        ROL(v4);  v5 += v3;  v6 ^= v3;  v1 -= v3;
        ROL(v5);  v6 += v4;  v1 ^= v4;  v2 -= v4;
        ROL(v6);  v1 += v5;  v2 ^= v5;  v3 -= v5;
        ROL(v1);  v2 += v6;  v3 ^= v6;  v4 -= v6;
    Here each line of code can be executed in one CPU cycle on Haswell+. In all versions, we just return 4 vars, updated in last line, as hash result. Versions with 6 variables may be perferable since v6 "backups" the entropy that may be lost when state vars are combined. We have one more register free since at this point we no more need the pointer to input data.

    Of course the code should be repeated several times to mix data well.

    XXH64 finalization code is mostly serial and can be executed in 64 cycles. Spooky128 finalization code requires 72 cycles, performing 72 updates (of form x ?= y), each depending on previous one. It updates each of 12 variables 6 times.

    Targeting f.e. 36 CPU cycles for finalization, I can run the code above 6-7 times. In this case, it will perform 70-108 variable updates, i.e. each variable will be updated 12-18 times. It looks superior to Spooky in every aspect, unless his update order is smarter than mine.
    Last edited by Bulat Ziganshin; 18th July 2016 at 02:20.

  5. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    880
    Thanks
    478
    Thanked 274 Times in 115 Posts
    you mean I can extend the library with XXH128, somewhat like MurMurHash_x86_128 - essentially just add XXH128 code to every instance of XXH32+XXH64 code in existing xxhash.[ch] ?

    Don't feel limited.
    XXH128 is essentially a good reason to add a new component.
    But it can deliver more.
    In particular, I like your idea of a hash that can provide multiple hash lengths.


    And yes, it's important to create a good and fast (i.e. short) finalization stage.
    It's very important to deal with short keys in a minimum number of cycles.
    The SMHasher module testing short keys speed is good enough to evaluate this objective.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    First, a few notes about SMHasher.

    It measures execution times using RDTSC instruction, which is very accurate, measuring exact number of CPU cycles between two RDTSC commands. Unfortunately, it has severe drawbacks.

    When RDTSC was first implemented, CPU frequencies were fixed, and RDTSC really measured number of CPU cycles. Later, TurboBoost was implemented, increasing frequency under load, but RDTSC continued to measure actual number of CPU cycles executed, i.e. its frequency was changed over time. On modern Intel CPUs, however, RDTSC measures amount of base frequency cycles, i.e. it returned back to measuring with fixed frequency.

    F.e. on my i7-4770, base freq is 3.4 GHz, and RDTSC returns time*3.4e9, irrespective of real CPU frequency at the time of measurement. And since real frequency may go up to 3.9 GHz, this means that RDTSC-based SMHasher measurements, converted to bytes/cycle, may be up to 15% more optimistic than real value. For example, XXH32 can't be faster than 2 bytes/cycle due to use of 2 MUL operations per 4 input bytes, but SMHasher measures bulk speed up to ~2.2 bytes/cycle.

    Another problem is that RDTSC instruction is executed out-of-order, like most other instructions. This isn't problem for measuring larger times, but single call to hash like XXH32, with 1-32 bytes of input data, executes only a few dozen instructions, while the CPU instruction reorder buffer holds more than 100 instructions. This means that attempt to measure time of single small-input hash operation with RDTSC may give absolutely meaningless results - just compare results produced by those 2 successive runs

    There is a standard solution to this problem - serializing RDTSC operations with CPUID, but for our needs it's much better to just measure time of 1000 hashing calls. I made this change to SMHasher, and now "Small key speed test" results are reproducible with accuracy of 5%. Of course, they are still 10-15% more optimistic than real speeds due to abovementioned RDTSC feature.

    Finally, we can measure 2 small-input speeds: latency (time to get a result) and throughput (how much hashes can be computed in 1 second). The latency is more interesting for use-cases when hash value should be used immediately, for example to address a hash table or find value in a tree. Since further operation depends on concrete hash value, we should wait until it will be computed.

    OTOH, when all we need is to store hash value somewhere in memory, further operation doesn't depend on concrete value of the hash, and may be overlapped with hash computation - for such scenarios, bulk throughput is more representative. So, i measured both values, but optimized for latency, believing that it's more popular usecase of small-key-speed requests. A tiny sample of new speed reports:

    Code:
    Small key speed test..........
       0-byte keys - latency    16.60 cycles/hash,  throughput    15.00 cycles/hash
       1-byte keys - latency    20.95 cycles/hash,  throughput    15.25 cycles/hash
       2-byte keys - latency    25.64 cycles/hash,  throughput    17.63 cycles/hash
       3-byte keys - latency    29.67 cycles/hash,  throughput    19.61 cycles/hash
       4-byte keys - latency    21.03 cycles/hash,  throughput    17.00 cycles/hash
    You can find new speed reports for XXH, Spooky, MurMur3 and multiple ZZH variants in https://github.com/Bulat-Ziganshin/F...Hasher/reports

  7. Thanks (2):

    Cyan (29th July 2016),Mike (29th July 2016)

  8. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Comparing XXH/MurMur3 and ZZH performance on short inputs, i've found two parts that i need to improve - remainder (last incomplete block) processing and final mixing of hash state. Let's start with remainder processing.

    My initial design was as simple as
    Code:
        U32 remainder[5] = {0};
        memcpy(remainder, p, bEnd-p);
        p = (const BYTE*)remainder;
        ROUND();
    The main idea behind was to have branch-less code, copying variable amount of data with single REP MOVS operation. But i forgot that modern compilers translate memcpy to anything but REP MOVS! So the generated code still had branches, only in less controlled way. Finally, i replaced it with MurMur-inspired switch statement. It has advantage of using only one calculated/conditional jump per entire remainder processing, and this jump is still highly predictable when remainder size is repeated across calls.

    Another approach i tried is equivalent to the two cycles of XXH32, and even reported by SMHasher as faster by 1-3 cycles (here it's at left and switch approach at right). But this approach includes up to 7 conditional jumps. While SMHasher checks speed by performing millions calls with the same input size, these jumps appear as highly predictable and therefore almost free. But in real applications input size may be highly variable, so in worst case we will spend in jumps up to 7*14/2=49 extra cycles per call at average, while with switch approach we have only one calculated jump and spend only 14 extra cycles.

    So i believe that the switch approach is superior to XXH one, although the advantage cannot be measured by SMHasher benchmark.

  9. #7
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Hello! (newcomer here).

    Quote Originally Posted by Bulat Ziganshin View Post
    When RDTSC was first implemented, CPU frequencies were fixed, and RDTSC really measured number of CPU cycles. Later, TurboBoost was implemented, increasing frequency under load, but RDTSC continued to measure actual number of CPU cycles executed, i.e. its frequency was changed over time. On modern Intel CPUs, however, RDTSC measures amount of base frequency cycles, i.e. it returned back to measuring with fixed frequency.

    F.e. on my i7-4770, base freq is 3.4 GHz, and RDTSC returns time*3.4e9, irrespective of real CPU frequency at the time of measurement. And since real frequency may go up to 3.9 GHz,
    Assuming you're not seeking the highest speed but reproducible results, you should limit your max frequency to equal your nominal frequency. That's what I'm doing on my machines during various speed tests, and that's especially important on a laptop where you don't want to hit thermal throttling nor a situation where not all cores may run at the max frequency at the same time and they spend their time alternating. For example :

    Code:
    # grep '' /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
    cpu0/cpufreq/scaling_max_freq:3300000
    cpu1/cpufreq/scaling_max_freq:3300000
    cpu2/cpufreq/scaling_max_freq:3300000
    cpu3/cpufreq/scaling_max_freq:3300000
    
    # cd /sys/devices/system/cpu
    for i in */cpufreq/scaling_max_freq; do echo 3010000 > $i;done
    And now my CPU doesn't go beyond its nominal frequency. Note that you may have to add 1-3% like I did above in order to cover for some internal inaccurate calculations, otherwise you'll never reach this target frequency. similarly if your CPU spends its time in idle at a low frequency, either you can force it to run fast, or you're on the intel pstate driver and it's harder, so the option I use all the time is to "pre-heat" the machine by starting a short loop (~100ms) before my program. That's enough for the system to switch it to the max frequency and to ensure a stable frequency all along the measurement period.

  10. Thanks (2):

    Bulat Ziganshin (11th August 2016),Jan Wassenberg (13th August 2016)

  11. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    you are right, except that i'm on Windows. I can set it through BIOS, though, and may be via programs too. Sometimes I will try it, at least to check my guess that XXH32 is actually limited to 2 bytes/cycle

    afaik, the warm-up time of about million cpu cycles, i.e. around millisecond, should be enough. but it should use SIMD extensions used in measured cycle, f.e. by default AVX commands are executed with halved throughput, and you need to run thousands of AVX instructions before full-throughput engine will be enabled. AFAIR, I've read these details in Agner's tutorials

  12. #9
    Member
    Join Date
    Aug 2016
    Location
    Paris
    Posts
    14
    Thanks
    5
    Thanked 28 Times in 6 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    you are right, except that i'm on Windows. I can set it through BIOS, though, and may be via programs too.
    When I was using Windows, 18 years ago, motherboards were shipped with a CD to manipulate a lot of stuff without having to reboot via the BIOS. Some of them started to even provide some overclocking utilities which allowed you to precisely control the upper and lower frequencies. I'm sure these programs still exist. It may be enough for you to simply start the utility and reload your benchmark profile.

    afaik, the warm-up time of about million cpu cycles, i.e. around millisecond, should be enough. but it should use SIMD extensions used in measured cycle, f.e. by default AVX commands are executed with halved throughput, and you need to run thousands of AVX instructions before full-throughput engine will be enabled. AFAIR, I've read these details in Agner's tutorials
    It depends a lot on the operating system and its tuning. On Linux, the cpufreq subsystem takes into account the latency to switch between frequencies to decide when to switch, and I'm pretty sure Windows proceeds similarly, that's important to preserve laptop batteries. A PLL takes time to stabilize so it's never very fast, and 1ms seems quite short to me. The latency I'm seeing is generally around 20-50ms on Linux, which is why 100ms are enough in my tests.

    Hoping this helps!

  13. Thanks:

    Bulat Ziganshin (12th August 2016)

  14. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    802
    Thanks
    231
    Thanked 290 Times in 172 Posts
    https://github.com/google/highwayhas...anobenchmark.h is an attempt at highly accurate and repeatable benchmarking of small and fast functions when they are bring used in their actual usage environment, not just special benchmarks.

  15. Thanks (3):

    Bulat Ziganshin (12th August 2016),Cyan (12th August 2016),Jan Wassenberg (13th August 2016)

  16. #11
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    Thanks for mentioning Some more details: we use fences to solve the RDTSC reordering problem. It's important to avoid the highly variable CPUID instruction within the region to measure, but LFENCE+RDTSCP works very well.

    To avoid unrealistic branch prediction, we pass a random input from a given distribution to every iteration.

    Result: 30 independent measurements of memcpy(d, s, 3) have median absolute deviation of around 0.3 cycles (<1%).

  17. Thanks:

    Bulat Ziganshin (13th August 2016)

  18. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    We can split all hashing algorithms into 2 categories:

    The first category is hashes whose result has (almost) the same size as internal state. This includes all MurMur3 hashes.

    The second category is hashes whose result is (much) shorter than internal state. XXH32 compiles 32-bit result from 128-bit (16-byte) internal state, XXH64 makes 64-bit result from 256-bit (32-byte) state, Spooky makes 128-bit result from 768 bits (96 bytes) of state.

    And these hashes tend to use shorter computations for smaller inputs. XXH32 uses 32-bit internal state when hashing less than 16 bytes, XXH64 uses 64-bit state for inputs less than 32 bytes, Spooky employs 256-bit state for inputs less than 96 bytes.

    The reason is obvious - with larger internal state you need more time to initialize all vars, and then again more time to combine them into final result. For example, XXH64 spends 12 cycles in the short finalization procedure, that deals with single 64-bit state variable, and applied only to short inputs; while for longer inputs final mixing spends 36 cycles dealing with four 64-bit state vars.

    Also, it should be obvious why no MurMur3 hash variations has shorter final mixing for small inputs - their hash result already has the same size as internal state, so there is no possibility to make things go faster by using smaller internal state for short inputs.



    ZZH has 160-bit internal state and 128-bit result, so it plays in the same league as MurMur3 hashes. This means that I can't add simplified processing for small inputs like XXH32/64 do, or at least can't do it in the same manner. As net result, I can't match XXH32/64 speed on short inputs.

    For final mixing, I evaluated approaches implemented by hashes in XXH, MurMur3 and Spooky families. XXH32 has very weak final mixing, that clearly demonstrated by its SMHasher results. OTOH, experimental hash, that combined main XXH32 algo with XXH64 final mixing, has the same ideal SMHasher quality as other good hashes. XXH64 final mixing is a bit too slow, performing most of its computations sequentially on a single var. At the same time, I failed to apply Spooky approach as well as my own ideas. As result, ZZH employed Murmur3c_x86_128 final mixing code, with a bit more flexible choice of constants/operations, that improved the resulting hash quality.

    I also applied many small tricks here and there. The most notable one is switch-first idea from Golang hash, further decreasing amount of conditional branches. There are lot of other high-performance hashes available, and I plan to study at least Google farm/city/highway family as probable source of new ideas.



    After all, I managed to outperform the only direct competitor among XXH/MurMur3/Spooky families, namely Murmur3c_x86_128, on any input length, both for latency and throughput, in addition to being 2x faster on long inputs. ZZH may be slower than other hashes on short inputs (i.e. 1-32 bytes long), but in all cases it has some reasons - other hashes are either compute shorter results or x64-optimized.

    Since 128 bits = 16 bytes, 10-20% speed difference on 1-32 byte inputs between ZZH and other hashes shouldn't be a real issue - I believe that in most usecases 128-bit hashes are computed on larger inputs.

    So, current ZZH features:
    - quality: ideal, to the extent that SMHasher can check. It measures the same quality level for SHA1, MD5, SpookyV2, Murmur3c_x86_128, XXH64, to name a few. I will make a separate post about it.
    - bulk speed: 2.5 bytes/cycle, fastest among all pure x86 hashes passing SMHasher
    - speed on 1-32 byte inputs: always faster than the only direct competitor, Murmur3c_x86_128

    Yann, now it seems to qualify all your requirements for XXH128?



    32/64-bit hashes can be derived from ZZH. The key point is that on 40+ byte inputs ZZH is already on par with XXH32 and strictly outperforms other listed hashes. So we just need to add prologue switch statement, processing shorter inputs with smaller internal state (recall "the second category" above):
    Code:
    uint32 ZZH32(buf,len,seed)
    {
      switch (len)
      {
        case 0:  ...; break;
        ...
        case 40: ...; break;
    
        default: uint32 result[4];
                 ZZH128(buf,len,seed,result);
                 return result[0];
      }
    }
    Last edited by Bulat Ziganshin; 14th August 2016 at 06:20.

  19. Thanks:

    Cyan (14th August 2016)

  20. #13
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    880
    Thanks
    478
    Thanked 274 Times in 115 Posts
    It certainly looks like a fairly strong start.

    Let's finalize something to start tests and code review.

  21. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    I'm almost dry out of ideas, so it will be finalized soon.

    I think that the best way to start review is to fork the xxHash repository and add portable single-call XXH128 implementation in separate file using all the XXH infrastructure (alignment, ROL and other macroses). What you think about it?

    And it will be very interesting to see NanoBenchmark of ZZH against XXH/MurMur3a - there is possbility that ZZH will be better even on small keys due to minimized amount of conditional jumps

  22. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    880
    Thanks
    478
    Thanked 274 Times in 115 Posts
    Yes, that's a good idea.

    I've created branch xxh128 on github for you :
    https://github.com/Cyan4973/xxHash/tree/xxh128

    You can PR into it whenever you are ready.
    Last edited by Cyan; 14th August 2016 at 17:21.

  23. #16
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    The first category is hashes whose result has (almost) the same size as internal state.
    Those are vulnerable to length-extension attacks: given a message and hash (the complete state), you can easily append something, initialize the hash with the given state, and update it with the appended data to get a valid signature.
    Not sure about your intended application, but that would be bad news for file integrity checks and many others.

    People may end up using hash functions without realizing the security implications, or that (malicious) collisions can lead to terrible overall performance.
    I strongly feel that general-purpose hash functions should provide at least MAC-level security.
    The added cost (much less than prior MACs such as SipHash) is totally worthwhile if it prevents O(N^2) behavior in hash tables (flooding) or poor load-balancing or data loss.

    One implication is that the hash should have internal state at least twice as large as the output.

    Quote Originally Posted by Bulat Ziganshin View Post
    The most notable one is switch-first idea from Golang hash, further decreasing amount of conditional branches. There are lot of other high-performance hashes available, and I plan to study at least Google farm/city/highway family as probable source of new ideas.
    FYI we're currently thinking about a short-string hash for inputs much smaller than HighwayHash's 32 byte blocks. Another little trick can be applied: when the input is 4..8 bytes (e.g. after switch-first), you can simply load two unaligned 32-bit words; repeating the middle bytes doesn't appear to cause harm.

  24. #17
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Those are vulnerable to length-extension attacks: given a message and hash (the complete state), you can easily append something, initialize the hash with the given state, and update it with the appended data to get a valid signature.
    These hashes compute result from internal state, usually with non-reversible operations such as h+=h>>16, so attacker may need to choose among multiple variants - even if it has enough computation resources to reverse computation at all!

    ZZH compresses 160-bit state into 128-bit output, so there are 2^32 variants at average Unfortunately, I can't use more than 5-6 registers for internal state without losing speed on x86 platform. Yeah, those 8080 legacy still bites us!


    That said, I believe that cryptographic algorithms may be developed in two ways:
    1) you have math proof that your algorithm is correct and world-class cryptograhers checked the proof - you can see example of that in UMAC/VMAC papers, whose security is proven to be equivalent to AES security
    2) you have constructed ad-hoc algorithm and world-class cryptograhers unsuccessfully tried to attack it for several years

    Both ways are definitely outside of my skills. It's interesting to note, however, that SipHash got widespread usage not because it was proved to be secure, but because it wasn't yet disproved and SipHash users can't afford to use really proven MAC algorithms due to their slowness on short inputs.


    Going back, my initial goal was to displace XXH32 with better-quality hash, but not MAC/PRF. Now I have 128-bit hash with 160-bit state, and plan to add 32/64-bit backends later. Checking hash algo with SMHasher is the work inside my skills, and ZZH has better quality than XXH32 and MurMur3a_x86_32. So, this project targets fast "SMHasher-ideal" pure x86 hash with 32-128 bits of output. It should be useful for situations when attacks are impossible (or ignored) and pure x86 speed (or minimal speed among all platforms) is priority.


    Another little trick can be applied: when the input is 4..8 bytes (e.g. after switch-first), you can simply load two unaligned 32-bit words; repeating the middle bytes doesn't appear to cause harm.


    it is almost how things are already implemented in ZZH. XXH framework on non-x86 first checks whether input is aligned, and this check, naturally, may be skipped for small inputs. It's the problem of XXH framework, however


    FYI we're currently thinking about a short-string hash for inputs much smaller than HighwayHash's 32 byte blocks
    I had idea to propose you to use SipHash for small inputs and 64-bit output. The key point for you to ensure that small algo is also a PRF

  25. #18
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    These hashes compute result from internal state, usually with non-reversible operations such as h+=h>>16
    That's OK, but when the hash result is the entire state, no reversing is needed - we can continue from there, as if the hash had never stopped.

    Quote Originally Posted by Bulat Ziganshin View Post
    ZZH compresses 160-bit state into 128-bit output, so there are 2^32 variants at average Unfortunately, I can't use more than 5-6 registers for internal state without losing speed on x86 platform. Yeah, those 8080 legacy still bites us!
    I understand 2^32 is in reach of brute-force, 2^64 would be nice, if we could just spare one more register.

    Quote Originally Posted by Bulat Ziganshin View Post
    2) you have constructed ad-hoc algorithm and world-class cryptograhers unsuccessfully tried to attack it for several years
    Both ways are definitely outside of my skills. It's interesting to note, however, that SipHash got widespread usage not because it was proved to be secure, but because it wasn't yet disproved and SipHash users can't afford to use really proven MAC algorithms due to their slowness on short inputs.
    Yes. We are having to go that route as well because the structure is new and we haven't been able to reduce to some provably secure primitive.

    Quote Originally Posted by Bulat Ziganshin View Post
    It should be useful for situations when attacks are impossible (or ignored) and pure x86 speed (or minimal speed among all platforms) is priority.
    That's fair I just wish such hash functions carried a comment so people will know not to use them when attacks are possible.

    Quote Originally Posted by Bulat Ziganshin View Post
    I had idea to propose you to use SipHash for small inputs and 64-bit output. The key point for you to ensure that small algo is also a PRF
    Yes, thanks - we'll need to carefully verify that one as well. The prototype is 2-3x as fast as SipHash for 3..10 bytes, but let's see if it withstands scrutiny.

  26. #19
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    Quote Originally Posted by willy View Post
    It depends a lot on the operating system and its tuning. On Linux, the cpufreq subsystem takes into account the latency to switch between frequencies to decide when to switch, and I'm pretty sure Windows proceeds similarly, that's important to preserve laptop batteries. A PLL takes time to stabilize so it's never very fast, and 1ms seems quite short to me. The latency I'm seeing is generally around 20-50ms on Linux, which is why 100ms are enough in my tests.
    Probably 1 ms is a time required to warm up the AVX engine. So, combining both suggestions togther, CPU should be warmed for 100 ms, including at least 1 ms of actual hash code (to enable AVX if required)

    Quote Originally Posted by willy View Post
    When I was using Windows, 18 years ago, motherboards were shipped with a CD to manipulate a lot of stuff without having to reboot via the BIOS. Some of them started to even provide some overclocking utilities which allowed you to precisely control the upper and lower frequencies. I'm sure these programs still exist. It may be enough for you to simply start the utility and reload your benchmark profile.
    You are right again! My Asrock Z87 mobo has the overclockimg software and I just tried to set up 34x multiplication coefficient - equal to RDTSC frequency. This immediately provided me the results I expected:
    Code:
    --- Testing XXH32 (xxHash, 32-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  1.999 bytes/cycle - 5719.12 MiB/sec @ 3 ghz
    Alignment  1 -  1.995 bytes/cycle - 5708.84 MiB/sec @ 3 ghz
    Alignment  2 -  1.995 bytes/cycle - 5708.91 MiB/sec @ 3 ghz
    Alignment  3 -  1.995 bytes/cycle - 5708.85 MiB/sec @ 3 ghz
    Alignment  4 -  1.999 bytes/cycle - 5717.97 MiB/sec @ 3 ghz
    Alignment  5 -  1.995 bytes/cycle - 5708.34 MiB/sec @ 3 ghz
    Alignment  6 -  1.995 bytes/cycle - 5708.22 MiB/sec @ 3 ghz
    Alignment  7 -  1.995 bytes/cycle - 5708.35 MiB/sec @ 3 ghz
    
    -------------------------------------------------------------------------------
    --- Testing ZZH32 (zzHash, 32-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  2.534 bytes/cycle - 7248.66 MiB/sec @ 3 ghz
    Alignment  1 -  2.467 bytes/cycle - 7058.31 MiB/sec @ 3 ghz
    Alignment  2 -  2.467 bytes/cycle - 7058.78 MiB/sec @ 3 ghz
    Alignment  3 -  2.467 bytes/cycle - 7057.85 MiB/sec @ 3 ghz
    Alignment  4 -  2.533 bytes/cycle - 7247.02 MiB/sec @ 3 ghz
    Alignment  5 -  2.467 bytes/cycle - 7057.50 MiB/sec @ 3 ghz
    Alignment  6 -  2.465 bytes/cycle - 7051.30 MiB/sec @ 3 ghz
    Alignment  7 -  2.467 bytes/cycle - 7057.45 MiB/sec @ 3 ghz
    
    -------------------------------------------------------------------------------
    --- Testing XXH64 (xxHash, 64-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  3.995 bytes/cycle - 11430.85 MiB/sec @ 3 ghz
    Alignment  1 -  3.980 bytes/cycle - 11387.13 MiB/sec @ 3 ghz
    Alignment  2 -  3.980 bytes/cycle - 11387.94 MiB/sec @ 3 ghz
    Alignment  3 -  3.980 bytes/cycle - 11387.96 MiB/sec @ 3 ghz
    Alignment  4 -  3.980 bytes/cycle - 11386.61 MiB/sec @ 3 ghz
    Alignment  5 -  3.980 bytes/cycle - 11386.65 MiB/sec @ 3 ghz
    Alignment  6 -  3.975 bytes/cycle - 11373.75 MiB/sec @ 3 ghz
    Alignment  7 -  3.979 bytes/cycle - 11385.06 MiB/sec @ 3 ghz
    
    -------------------------------------------------------------------------------
    --- Testing Spooky128 (SpookyHash V2, 128-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  4.172 bytes/cycle - 11935.92 MiB/sec @ 3 ghz
    Alignment  1 -  3.984 bytes/cycle - 11397.93 MiB/sec @ 3 ghz
    Alignment  2 -  3.984 bytes/cycle - 11399.41 MiB/sec @ 3 ghz
    Alignment  3 -  3.983 bytes/cycle - 11396.52 MiB/sec @ 3 ghz
    Alignment  4 -  3.983 bytes/cycle - 11396.41 MiB/sec @ 3 ghz
    Alignment  5 -  3.983 bytes/cycle - 11396.68 MiB/sec @ 3 ghz
    Alignment  6 -  3.982 bytes/cycle - 11393.81 MiB/sec @ 3 ghz
    Alignment  7 -  3.984 bytes/cycle - 11397.44 MiB/sec @ 3 ghz
    
    -------------------------------------------------------------------------------
    --- Testing Wide64 (x64-optimized zzHash, 64-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  5.651 bytes/cycle - 16166.86 MiB/sec @ 3 ghz
    Alignment  1 -  5.381 bytes/cycle - 15396.20 MiB/sec @ 3 ghz
    Alignment  2 -  5.382 bytes/cycle - 15398.21 MiB/sec @ 3 ghz
    Alignment  3 -  5.382 bytes/cycle - 15398.61 MiB/sec @ 3 ghz
    Alignment  4 -  5.379 bytes/cycle - 15389.72 MiB/sec @ 3 ghz
    Alignment  5 -  5.382 bytes/cycle - 15396.69 MiB/sec @ 3 ghz
    Alignment  6 -  5.375 bytes/cycle - 15379.21 MiB/sec @ 3 ghz
    Alignment  7 -  5.382 bytes/cycle - 15397.25 MiB/sec @ 3 ghz
    For comparison, default 39x setting showing unrealistic XXH32 speed:
    Code:
    --- Testing XXH32 (xxHash, 32-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  2.226 bytes/cycle - 6367.41 MiB/sec @ 3 ghz
    Alignment  1 -  2.224 bytes/cycle - 6362.33 MiB/sec @ 3 ghz
    Alignment  2 -  2.222 bytes/cycle - 6357.77 MiB/sec @ 3 ghz
    Alignment  3 -  2.225 bytes/cycle - 6365.48 MiB/sec @ 3 ghz
    Alignment  4 -  2.233 bytes/cycle - 6389.56 MiB/sec @ 3 ghz
    Alignment  5 -  2.229 bytes/cycle - 6377.40 MiB/sec @ 3 ghz
    Alignment  6 -  2.227 bytes/cycle - 6371.37 MiB/sec @ 3 ghz
    Alignment  7 -  2.230 bytes/cycle - 6381.31 MiB/sec @ 3 ghz
    
    -------------------------------------------------------------------------------
    --- Testing ZZH32 (zzHash, 32-bit result)
    
    Bulk speed test - 262144-byte keys
    Alignment  0 -  2.811 bytes/cycle - 8043.41 MiB/sec @ 3 ghz
    Alignment  1 -  2.757 bytes/cycle - 7888.47 MiB/sec @ 3 ghz
    Alignment  2 -  2.751 bytes/cycle - 7870.88 MiB/sec @ 3 ghz
    Alignment  3 -  2.756 bytes/cycle - 7885.55 MiB/sec @ 3 ghz
    Alignment  4 -  2.823 bytes/cycle - 8076.45 MiB/sec @ 3 ghz
    Alignment  5 -  2.750 bytes/cycle - 7868.08 MiB/sec @ 3 ghz
    Alignment  6 -  2.751 bytes/cycle - 7870.35 MiB/sec @ 3 ghz
    Alignment  7 -  2.747 bytes/cycle - 7859.87 MiB/sec @ 3 ghz

  27. Thanks:

    SolidComp (21st March 2017)

  28. #20
    Member
    Join Date
    May 2016
    Location
    Canada
    Posts
    6
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    This immediately provided me the results I expected:
    The results are very impressive. I think they should be integrated into xxHash as xxHash32 v2, xxHash64 v2 and xxHash128. I would use them in my future projects for sure.

  29. #21
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    It was my initial proposal, sort of. Now we have x86-optimized hash with 128-bit output, and plans to add 32/64-bit backends. Using MurMur naming convention, they may be called:
    • ZZH_x86_128
    • ZZH_x86_64
    • ZZH_x86_32

    Note that ZZH_x86_64 can't replace XXH64, since XXH64 is faster on 64-bit cpus.



    The second part of my initial plan was to develop a family of ZZH_x64_* hashes, but as Yann pointed out:
    • every x64 cpu has at least SSE2 support. Some even have AVX2 (twice as fast as SSE2), and newer will got AVX512
    • most of x86 cpus (or x64 cpus in 32-bit mode) support the same SSE2/AVX2/AVX512, although only with 8 SIMD registers
    • ARM: every 64-bit cpu as well as high-end 32-bit cpus support NEON
    • other RISC cpus (Power, Sparc, MIPS..) have their SIMD extensions too
    • modern GPUs are essentially 32-bit processors - they are much slower on 64-bit operations than on 32-bit ones

    All in all, this means that we should look into development of hash involving 32-bit operations supported by SSE/NEON instead of 64-bit operations. There is an initial version of such hash in my repository that already passed SMHasher tests (although with a bit weaker results than "ideal hashes" mentioned above).

    Single hash round processing one SIMD word consists of 2 MUL32_64 + 2 SHUFFLE + 2 ADD operations. I expect that it will require ~3 cpu cycles, i.e. run at 5-6 bytes/cycle on SSE2, and twice as fast on AVX2, but still not checked this assumption. 3 SIMD registers should be reserved to store two multiplication constants and intermediate result (although we may try to read multiplication constants from memory or use single constant for both multiplications).

    Note that my best x64-optimized hash runs at 5.6 bytes/cycle, so it seems that SIMD hash will be no slower than x64 hash, and faster in many situations (32-bit cpu with SIMD, AVX2 cpu, GPU).



    SIMD hash with 128-bit internal state may be employed to produce 32/64-bit results and somewhat weak 128-bit result. Since 128 bits completely fit into x86 registers, it runs pretty fast at 0.9 bytes/cycle on pure x86. So, the first batch of SIMD hashes may consist of:
    • ZZH_simd128_128
    • ZZH_simd128_64
    • ZZH_simd128_32

    Of course, in order to optimize SIMD implementation, we should employ ILP, and process at least 4 SIMD registers simultaneously. And for fast AVX2/AVX512 implementations we should process 2-4 times more data, i.e. split data stream into as much as 16 lanes, each 128-bit wide. OTOH fast execution on shorter inputs will require small-state "prologue", the same as for ZZH32.



    Without increasing the computation time, it can be turned into algorithm with any N*128-bit internal state - we just need to add together here results computed from different SIMD registers. Increasing internal state will slowdown pure x86 implementation, but allow to produce 128..512 bit results, calling for the next batch of hashes:
    • ZZH_simd256_128
    • ZZH_simd512_256
    • ZZH_simd768_512

  30. #22
    Member
    Join Date
    May 2016
    Location
    Canada
    Posts
    6
    Thanks
    0
    Thanked 4 Times in 2 Posts
    It would be nice if the SIMD and scalar implementation would be compatible with each other and give the same hash result for a given data. Don't know if it's possible while maintaining the same level of performance.

    Maybe they should also always operate in chunk/mode by default and provide the same hash result used with multithreading or not. That way whatever the optimized back-end used, the new functions would return the same hash result for a selected hash width.

  31. #23
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    Statistical analysis and preliminary cryptanalysis of HighwayHash (mentioned above): https://arxiv.org/abs/1612.06257

    Interestingly, the avalanche effect of SipHash13 and HighwayHash is comparable to that of the Blake2bp cryptographic hash.

    We welcome any comments you might have!

  32. Thanks (2):

    Bulat Ziganshin (18th January 2017),Cyan (28th December 2016)

  33. #24
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    802
    Thanks
    231
    Thanked 290 Times in 172 Posts
    Quote Originally Posted by Jan Wassenberg View Post
    Interestingly, the avalanche effect of SipHash13 and HighwayHash is comparable to that of the Blake2bp cryptographic hash.
    It is difficult to reason why SipHash13 has better avalance bias than SipHash24. It is difficult to believe that more mixing leads to a less mixed result, unless the mixing function has some strange cyclic characteristics. Perhaps it would be educative to run different variations of SipHash with SipHash23, SipHash24, SipHash25, SipHash26, etc. and similarly with SipHash13, SipHash14, ... and see if there is a pattern that some repetition amounts lead to a worse bias.

    Another possible theory is that SipHash mixing is somehow "overly flat" and the mixing function has some determinism that makes it have (when observed this way) statistically good mixing opportunities. When adding more rounds this property starts to disappear, ... and avalance bias gets worse, but true mixing gets better.

    HighwayHash is '14' by default. It would be interesting to observe the avalance bias while reducing it to '13' and '12' -- and also possibly increasing it, just to see if it holds a similar mystery like SipHash13 vs SipHash24.

    Disclaimer: I'm one of the authors of the paper posted by Jan.

  34. #25
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    Interesting idea, thanks for suggesting!
    We can either compare the geometric mean of ratios of biases, or the average difference in bias relative to the average bias.
    Going from Sip13 to Sip14 with <=16 byte inputs increases geomean and also average difference by 4%.
    Going from Sip13 to Sip23 has less of an effect: +2% geomean, +2% average.


    HighwayHash requires at least 4 finalization rounds, but we can compare the normal '14' variant vs 15.
    The experiment is still in progress, but for <=11 byte inputs we see 0.98 geomean and -1% bias.


    More HighwayHash rounds decrease bias as expected, but it is unclear why the opposite happens with SipHash, and why finalization rounds have more of an effect than update rounds.

  35. #26
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts

    SeaHash is making a splash

    There's another supposedly high quality hash out there:
    https://ticki.github.io/blog/seahash-explained/
    https://en.wikipedia.org/wiki/SeaHash
    I have no idea if it's really good.

  36. #27
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    SeaHash looks reasonable, it also relies on multiplication and the PCG variable-shift trick is cool.

    We find it to be slower (6.9 GB/s) than xxHash (10.74 GB/s) for 4K inputs, rather than 3-20% faster.
    Perhaps the difference is due to translations between C++ and Rust, or different input sizes, but the intended application (filesystem) would seem to involve large inputs.

    SeaHash relies on 64-bit multiplications, which means it cannot be vectorized using AVX2. This limits the throughput, and HighwayHash is faster for large inputs.

  37. Thanks:

    Cyan (3rd January 2017)

  38. #28
    Member
    Join Date
    Aug 2016
    Location
    Zürich
    Posts
    13
    Thanks
    11
    Thanked 5 Times in 4 Posts
    If you've never used Intel's IACA, might be worth a try; it enabled a 10% speedup in the avx2 HighwayHash and 16-36% in the sse41 version.

    This technique might also be useful elsewhere - we can emulate _mm_maskload_epi32 on sse4 using only 3 branches.

  39. #29
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,549
    Thanks
    767
    Thanked 684 Times in 370 Posts
    SeaHash has the same problem as xxHash - they manage 4 independent 64-bit states, each collecting entropy from its own channel. This is not enough for a good 64-bit hash

    The mixing step isn't really bijective, as they claim. This line
    Code:
    a ← a ⊕ ((a ≫ 32) ≫ (a ≫ 60))
    produces the same result for 1<<32 and (2<<32)-1

    Their benchmark drops any modern 64-bit hashes: SpookyHash, XXH64, MurMur3c - at least first two are not slower than SeaHash.

    That said, they have interesting mixing step and probably may use more states to get 4 bytes/cycle even on worse compilers - SeaHash speed is limited by 2 MUL operations per 8 bytes processed, since all CPUs i know has only one MUL engine.

    The really interesting question is what sort of guarantees those PCG heritage may give for construction a hash function.

    We find it to be slower (6.9 GB/s) than xxHash (10.74 GB/s) for 4K inputs,
    they compare to xxh32 (~2 bytes/cycle), you seems to compare to xxh64 (~4 bytes/cycle)
    Last edited by Bulat Ziganshin; 9th January 2017 at 20:01.

  40. #30
    Member
    Join Date
    Nov 2015
    Location
    ?l?nsk, PL
    Posts
    81
    Thanks
    9
    Thanked 13 Times in 11 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    The mixing step isn't really bijective, as they claim. This line
    Code:
    a ← a ⊕ ((a ≫ 32) ≫ (a ≫ 60))
    produces the same result for 1<<32 and (2<<32)-1
    No?
    Code:
    #include <stdio.h>
    #include <stdint.h>
    
    uint64_t f(uint64_t x) {
            return x ^ ((x >> 32) >> (x >> 60));
    }
    int main() {
            uint64_t a = 1ull << 32;
            uint64_t b = (2ull << 32) - 1;
            printf("%016llx->%016llx\n", a, f(a));
            printf("%016llx->%016llx\n", b, f(b));
            return 0;
    }
    Output:
    Code:
    0000000100000000->0000000100000001
    00000001ffffffff->00000001fffffffe
    You can treat the input as 2 32-bit words, a and b.
    then this mix step is:
    a := a
    b := b ^ f(a)
    where f is some function (to be precise, f (x) := x >> (x >> 28 )).
    If two numbers differ on word a, the output will differ on word a
    If two number don't differ on a, but differ on b, xoring a constant f(a) to both won't make them same.
    Therefore this function is an injection. Since input and output sets are the same size (in fact they are the same), injective function over them must be a bijection.

    The other mix steps are bijective too, so the whole mix is bijective.

  41. Thanks:

    Bulat Ziganshin (10th January 2017)

Page 1 of 2 12 LastLast

Similar Threads

  1. improving brotli
    By inikep in forum Data Compression
    Replies: 6
    Last Post: 18th November 2015, 21:45
  2. [Java] Improving efficiency of arithmetic decompression
    By rhulcomputerscience in forum Data Compression
    Replies: 1
    Last Post: 17th September 2015, 18:31
  3. Improving CM state machines
    By Mat Chartier in forum Data Compression
    Replies: 3
    Last Post: 3rd July 2013, 16:54
  4. Improving LZ78/LZW?
    By RichSelian in forum Data Compression
    Replies: 7
    Last Post: 19th September 2011, 18:05
  5. Improving RC4 (MC1 cipher proposal)
    By encode in forum The Off-Topic Lounge
    Replies: 14
    Last Post: 5th August 2010, 20:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •