i would like to see any ideas of checking the hash quality once it reached the 10/10 SMHasher score. The only idea i have so far was mentioned above - try to fool SMHasher with crc+postprocessing. If we can fool SMHasher this way, this will mean that we can't seriously trust its results.
For xxHash vs FARSH: in FARSH every word pair has the same influence on intermediate 64-bit sum, so they are interchangeable in some sense (if key[2*i]+data[2*i]==key[2*j]+data[2*j], then you can easily exchange bits between data[2*i+1] and data[2*j+1]).
In xxHash 128-bit sum is continued to be mixed up on every step, so you don't have such obvious patterns for fooling the hash. Of course, probability of abovementioned equlity is only 2^-32, so it may be not a real problem. I'm looking into other's opinions, especially if you have a way to proof your reasons.
i would like to see any ideas of checking the hash quality once it reached the 10/10 SMHasher score. The only idea i have so far was mentioned above - try to fool SMHasher with crc+postprocessing. If we can fool SMHasher this way, this will mean that we can't seriously trust its results.
SMHasher just checks for a number of common issues. IMHO passing it is not an argument that a hash is a quality one.
Ideas?
Split it into elements and evaluate statistical properties. For a non-crypto hash, strong compression function is *the* part that matters the most.
Originally Posted by Bulat Ziganshin
For xxHash vs FARSH: in FARSH every word pair has the same influence on intermediate 64-bit sum, so they are interchangeable in some sense (if key[2*i]+data[2*i]==key[2*j]+data[2*j], then you can easily exchange bits between data[2*i+1] and data[2*j+1]).
Indeed, not good.
Originally Posted by Bulat Ziganshin
In xxHash 128-bit sum is continued to be mixed up on every step, so you don't have such obvious patterns for fooling the hash. Of course, probability of abovementioned equlity is only 2^-32, so it may be not a real problem. I'm looking into other's opinions, especially if you have a way to proof your reasons.
The way I see it: https://github.com/Cyan4973/lz4/blob...r/lib/xxhash.c
xxHash splits the input into 4 independent streams with no intermixing. This is not bad by itself, but I think it is only slightly stronger than a single stream.
Bulat, I have found minor issues in the docs:
1.The link to the source code is outdated
2.This is not just old, but outdated.
3. Is the bitstream stable already?
4. Also, it would be good to compare the algorithm with its competitors.
i fixed first two items. i will look into making high-level algo faster so bitstream will be changed in 0.3+. Recorded 4 as issue. I will massage readme a bit more and release 0.2 soon
SMHasher just checks for a number of common issues. IMHO passing it is not an argument that a hash is a quality one.
Ideas?
Split it into elements and evaluate statistical properties. For a non-crypto hash, strong compression function is *the* part that matters the most.
well, can you give more detailed idea? does stat. properties guarantee a good hash (at least provides better guarantees than SMHasher)?
Originally Posted by m^3
https://github.com/Cyan4973/lz4/blob...r/lib/xxhash.c
xxHash splits the input into 4 independent streams with no intermixing. This is not bad by itself, but I think it is only slightly stronger than a single stream.
i just completely forgot the xxHash internals. It manages 4 independent streams, each with 32-bit input elements and 32-bit intermediate result. This looks inferior to the cycle i proposed above, both for speed and quality.
I'll give you only partial answer now as it's late and I'd rather write too little than write nonsense.
I suggest that you read. http://burtleburtle.net/bob/hash/evahash.html
This gives some idea about properties of a good hash.
If my memory serves me well, Bob sacrificed some formal purity for efficiency in his SpookyHash. He believes the hash is no-funnel but is unable to prove it and when I was into it my guts told me he was wrong, but I didn't prove it either.
And yes, statistical analysis is *the* way to do Hash QA.
1. SMHasher is a black box, it does the same kind of tests that you want to do, but over the entire function. Practically, you want your compression function alone to pass most of SMHasher. It doesn't need stuff like good distribution, but avalanching is key.
2. Some hashes were designed specifically to pass SMHasher and find ways to pass it that don't make them stronger.
The problem with hashes is that everyone can write one. But not everyone can QA one. I was in this position with xxhash256, I knew how to write but not how to test. And when I learned how to test, I knew that my creation was just terrible.
The problem with SMHasher is that it is believed to be a test for hash quality while really it's not. Good hash doesn't need to pass SMHasher. For example distribution is unnecessary when collision-resistance is your goal. OTOH a bad hash can pass it. SMHasher is *a* test suite, not *the definitive* test suite.
The problem with SMHasher is that it is believed to be a test for hash quality while really it's not. Good hash doesn't need to pass SMHasher. For example distribution is unnecessary when collision-resistance is your goal. OTOH a bad hash can pass it. SMHasher is *a* test suite, not *the definitive* test suite.
HighwayHash is simple and pretty fast at around 11 GB/s and gets wonderful statistics on SMHasher, better than any other fast hash including the SipHash. I like the "ZipperMerge" mixing in it -- it reorders the bytes by placing the bits having most entropy to be moved to places where their entropy is mixed most efficiently at the next multiply. Because of this permutation phase the HighwayHash gets more entropy mixing out of multiplying than other multiply-based hashes.
Disclaimer: I co-authored HighwayHash.
Last edited by Jyrki Alakuijala; 30th June 2016 at 02:41.
I reread my previous post and I see that I actually told most of what I wanted to tell. I'll just add that here you have another few comments on how pros do hash QA. The interesting section is "Rationale", especially "Choice of rotation counts".
Originally Posted by Jyrki Alakuijala
HighwayHash is simple and pretty fast at around 11 GB/s and gets wonderful statistics on SMHasher, better than any other fast hash including the SipHash. I like the "ZipperMerge" mixing in it -- it reorders the bytes by placing the bits having most entropy to be moved to places where their entropy is mixed most efficiently at the next multiply. Because of this permutation phase the HighwayHash gets more entropy mixing out of multiplying than other multiply-based hashes.
Disclaimer: I co-authored HighwayHash.
That's interesting. I wish I had more time to study this code.
Let's give it some names! I will call my initial design xxFarsh v0, and now i'm going to describe xxFarsh v1:
Once i realised that PALIGNR requires SSSE3, i started to look into 16-bit and 32-bit rotates. Unfortunately, there is no PSHUFW, so fast 16-bit rotates also need SSSE3. Then i realized that we can combine shuffle preparing data for PMUL and shuffle exchanging data after PMUL into the single operation:
PADDD xmm0, data
PSHUFD xmm1,xmm0,$13
PMULUDQ xmm0,xmm1
i.e. only 3 operations per 128/256 input bits!!! Despite simplicity, any input/hash128 bit may affect any output hash128 bit after just 3 hashing steps, and any 32-bit word may affect any output bit in just 2 steps. So bits mix up like a crazy! I wonder whether this construction can pass the SMHasher, and may be even with 64/128-bit hash result.
Those 3 operations should be able to execute in just a single CPU cycle, providing record-breaking hash speeds. Of course we need to process at least 7 streams simultaneously to cover 7-cycle latency of PADDD+PSHUFD+PMULUDQ.
It should also be pretty fast on 32-bit scalar architectures with 16-32 registers. x86 will be slower due to register shortage and requirement to use only EDX:EAX for MUL32_64, so we get a lot of register moves waiting (on Pentium1 and older architectures) for MUL result.
I think that this design is interesting by itself, at least for uprecedented speed. But the next question is how to improve its weak bit mixing in order to provide better quality while keeping it fast on older hardware (i.e. no HighwayHash techniques of byte shuffling and 1024-bit state).
A side note: in some sense, xxFarsh v1 provides a minimum posible data suffling required to mix up all 128 data bits. Its shuffling can be extended to shuffle data among two 128-bit states:
so instead of 2 independent 128-bit states we will have the single 256-bit one.
This idea can be further extended to mix up hashA with hashB, and hashC with hashD on even steps, then hashA with hashC and hashB with hashD on odd steps - so we will havethe single 1024-bit state with any input/state bit influencing any output bit after just 5 steps - and it still be as fast as plain xxFarsh v1! The drawback, of course, is that scalar cpus and even x86 pre-AVX cpus don't have enough registers to hold such large state.
3) i like your architecture! internal state is 1024 bits that are completely mixed up on every step, together with 256 input bits. and entropy-based permutation is brilliant! All in all, this justifies your decision to return up to 256-bit hashing result
4) it doesn't look like cryptography hash (like sha2), but it may be cryptographically strong keyed hash like SipHash, UMAC and VMAC
5) while SipHash and HighwayHash are strongly optimized for small inputs like a few dozens of bytes (i.e. hashing database keys), other hashes i consider (umac/vmac, xxHash, farsh/xxFarsh) are optimized for hashing at least a few kilobytes (i.e. checksumming data streams). Also, xxHash/*farsh doesn't consider user attacks.
This leads to architecture difference - your hashes does a lot of mixing on every step to guarantee fooling an attacker. Therefore it can be more formally checked with full range of cryptoanalysis applicable to MACs. I know that murmur3 has failed under such cryptoanalysis, and expect that xxHash/xxFarsh will fail too. afair UMAC/VMAC was proven to be secure.
Another consequence of DB-oriented HighwayHash design is that it targets only the actual Google hardware, with significant slowdown on pre-SSSE3 cpus (and may be arm/mips too). It isn't a problem for DB since you may run another hash on another box, but more important for stream checksums. This includes two design decisions - first, a byte permutation that's a bit hard to emulate with bit shifts. Second, internal state of 1024 bits, i.e. 8 SSE registers (i suppose that you will see slowdown even on 32-bit SSSE3 hardware), or 32 of 32-bit registers. These two ideas is a soul of your algo, so i bet that we have a choice - either a cryptosecurity with HighwayHash or fast execution on wide hardware range with xxHash and friends.
The third consequence is lack of multithreading support (useful for checksumming multi-megabyte inputs), but that's easy to fix by splitting data into fixed-size blocks (f.e. 4 KB) and repeating your algo on 2-3 hash levels.
Let's compare the 3 algos:
- xxHash: permutation based on MUL32+ROT32. 4 independent streams with 32-bit internal state, updated on each step with 32 input bits using 4 scalar operations. SMHasher quality
- xxFarsh: permutation based on MUL32_64 and PSHUFD. 4*128-bit streams with 128-bit updates, 3 SSE2/AVX2 operations per 128/256 input bits. Yet unknown quality
- HighwayHash: permutation based on MUL32_64 and entropy-directed PSHUFB. 1024-bit internal state, 256-bit updates using 13 AVX2 operations. Cryptographic quality
Last edited by Bulat Ziganshin; 4th July 2016 at 01:28.
Going toward a release of FARSH 0.2, i built a toolbox to check its performance on YOUR hardware. Just unpack the attached archive, run the runme.cmd and publish results here. I'm especially interested in results on old/rare CPUs. Please describe your CPU if provided program hasn't found its name. On my box runme.cmd prints the following (described here):
An interesting regression there in gcc 5. I picked the wrong one for my other tests clearly! Also a HUGE difference between gcc and icc. Trying gcc6.1 for my i5 system gave
aligned-farsh-x64-avx2 | 53.715 GB/s = 50.026 GiB/s | 59.133 GB/s = 55.072 GiB/s
So not so much difference in that case.
Last edited by JamesB; 13th July 2016 at 17:32.
Reason: Fixed clang test
I've been building the binaries on one system (the Xeon) and then running the binaries on other systems. This is because the Xeon is one of our servers and has a lot of software installed, including all the various compilers/versions. So there shouldn't be build differences between platforms.
The Pentium(R) M result though wasn't mine, so perhaps the difference is in the compiler and not the system. It appears this can have a radical difference, although as noted the non-sse variant is clearly lying for me.
2. x86 results are 2x faster that my gcc 4.9.2 results, but x64 is the same
Either your compiler enables SSE2 by default (try to compile aligned-farsh-x86 with -msse) or it got much better optimization
That is quite possible given pretty much every CPU for years has had sse2 capability. It looks to be gcc version dependent. Older gccs run aligned-farsh-x86 much slower than new ones, changing around 5.x. I'm seeing SSE2 (movdqa) instructions in the gcc-5.3.0 output, even with an explict -msse. I had to add -march=pentium3 to explicitly deny SSE2 onwards. I guess this is simply something that modern gcc does with the assumption if you're running a modern compiler then almost certainly the CPU it uses isn't THAT ancient. With -march=pentium3 I see:
It wasn't icl but icc - perhaps this is windows vs linux naming?
Anyway, yes, I used identical options. Infact I just edited the compile.cmd to use $CC instead of gcc and then rebuilt after changing environment variables. I think it's the same issue we saw before. The aligned-farsh-x64 binary is built using "$CC -O3 -funroll-loops -s -static -m64 -msse2 -DFARSH_SSE2 main.cpp -oaligned-farsh-x64 -DFARSH_ALIGNED_INPUT", however there are no 64-bit systems that don't also support sse3 instructions, so specifying -msse2 doesn't mean ONLY sse2 in this situation. Maybe this means it's gone off and decided to do something totally different.
However even adding -march=core-avx-i to icc gives me only 11/13GiB/s; more or less half what gcc is doing (even with gcc -march=core2 -mtune=core2). I can send you assembler output if wish.
It wasn't icl but icc - perhaps this is windows vs linux naming?
sure
The Pentium(R) M result though wasn't mine, so perhaps the difference is in the compiler and not the system.
Just above your answer FatBit posted Core2 results with my executable that shows the same - compared to PentiumM SSE2 version is almost 3x faster, while non-SSE is about 1.5x faster. Also Core2 SSE2 version gets a 2x-3x performance hit from unaligned loads, that isn't the case for earlier CPUs (probably because Pentium had only 64-bit SSE engines), nor for modern ones (probably since Nehalem that got fast shifters)
I had to add -march=pentium3
good idea! i have added it to compile.cmd. Can you please recompile with it and provide updated results of runme.cmd on your Pentium?
However even adding -march=core-avx-i to icc gives me only 11/13GiB/s
the key option here is -DFARSH_SSE2 that provides hand-vectorized SSE2 code. The only difference between those 2 compilation lines is -DFARSH_SSE2:
Fresh git pull and rebuild. Compiled on the Xeon system, but using the new opts (ie -march=pentium3 for some cases). As expected, the non sse2 x86 ones are now slower.
I'm not sure what the last column is trying to measure, but it's pretty consistent on those numbers.
Regarding intel compilers, they do have some issues with bulding the x86 code (it complains massively about -m32 and missing libraries) - I expect we simply didn't install the necessary bits. It also complains about -mavx2 not being supported and suggests -march=core-avx2 instead. I tried adding an auto target too with -march=core-avx2 with no explicit define for SIMD (SSE2 or AVX) to let the compiler figure it out itself and see how its own code generation compared against the hand crafted form. It did OK (better than not specifying any -msse2 or -mavx), but nowhere as good as the hand tuned sse2 let alone avx2.
Put simply, gcc is just thrashing icc at every test here. I don't know why. Do you have any validation that the answer is correct? Looking at some of these numbers I suspect the results are actually differing. Gcc6 at times is reporting 196GB/s now. No way... It could also be that the compiler is working out some of the answers and optimising the results away. Perhaps one solution is to gcc -c farsh.c, g++ -c main.cpp, and then link main.o farsh.o. This means it has to build farsh.c with no knowledge of the input data - such as how large it is. I tried this and I do indeed get VERY different answers, so I think there is optimisation going on that you haven't managed to defeat. (I see some attempts, but I suspect the compilers got smarter.)
Examples with intel C++ 15 and Gnu gcc 6.1 on one file (with #include ../farsh.c), two files in one compile line (#include ../farsh.h and copies of the KEYS copied over) and two files build separately and linked.
These are equivalent to the "auto" I mentioned before where it just uses the non-simd code but with full simd processor available for the compiler to work with. It worked a bit TOO well in some cases clearly.
It's a bit manual to test, but I wonder if similar things are happening with some of the other variants. There is less scope for this in the hand tuned simd code as it pretty much has to do what you told it.
External loop time measured as a simple difference between first 2 times, so it seems than in your first report sometimes goes wrong - overall time was less than internal loop time. If you will update the sources again, you will see that now mode "1" displays just the same table as original and new mode "2" also prints external loops speed. I think that the last measurement just not interesting for most users
Put simply, gcc is just thrashing icc at every test here. I don't know why. Do you have any validation that the answer is correct?
about ICC - every compiler has its own set of optimizations, so it's not impossible to see multi-fold difference when measuring really small loops. In this case, ICC should just generate code from intrinsics, so it looks a bit strange, but i don't want to investigate compiler quirks now.
The first loop (one measuring the overall speed) checks the result correctness: if (h != 0xd300ddd8 ) {printf("\nWrong hash value at iteration %d: %08x !!!\n", i, h);}
Gcc6 at times is reporting 196GB/s now. No way...
one possibility to get that speeds is to compute hashes in multiple threads. GCC has auto-parallelization support, so in theory it can make that trick. I added one more anti-optimization trick that may prevent that.
ICC also complains about -mavx2 not being supported and suggests -march=core-avx2 instead.
well, ICC just supports all options of latest base compiler (MSVC on windows, GCC on Linux) at the time of its release. So, ICC15 just catched options of some older GCC version that had no -mavx2 support. Then ICC adds a bucnh of its own options, so "-march=core-avx2" may be one of them.
It's not multi-threading as the real time is the same as the user+system time. It's just somehow figuring out a way to precompute part of the answer I think. I can't say what without studying the assembly though.
Anyway, that's only noticable when I don't use -DFARSH_AVX2 but do specify -march=core-avx2. Otherwise it's running fine and as expected.