# Thread: FARSH: hashing 30 GB/s!

1. i would like to see any ideas of checking the hash quality once it reached the 10/10 SMHasher score. The only idea i have so far was mentioned above - try to fool SMHasher with crc+postprocessing. If we can fool SMHasher this way, this will mean that we can't seriously trust its results.

For xxHash vs FARSH: in FARSH every word pair has the same influence on intermediate 64-bit sum, so they are interchangeable in some sense (if key[2*i]+data[2*i]==key[2*j]+data[2*j], then you can easily exchange bits between data[2*i+1] and data[2*j+1]).

In xxHash 128-bit sum is continued to be mixed up on every step, so you don't have such obvious patterns for fooling the hash. Of course, probability of abovementioned equlity is only 2^-32, so it may be not a real problem. I'm looking into other's opinions, especially if you have a way to proof your reasons.

2. Originally Posted by Bulat Ziganshin
i would like to see any ideas of checking the hash quality once it reached the 10/10 SMHasher score. The only idea i have so far was mentioned above - try to fool SMHasher with crc+postprocessing. If we can fool SMHasher this way, this will mean that we can't seriously trust its results.
SMHasher just checks for a number of common issues. IMHO passing it is not an argument that a hash is a quality one.
Ideas?
Split it into elements and evaluate statistical properties. For a non-crypto hash, strong compression function is *the* part that matters the most.

Originally Posted by Bulat Ziganshin
For xxHash vs FARSH: in FARSH every word pair has the same influence on intermediate 64-bit sum, so they are interchangeable in some sense (if key[2*i]+data[2*i]==key[2*j]+data[2*j], then you can easily exchange bits between data[2*i+1] and data[2*j+1]).
Indeed, not good.

Originally Posted by Bulat Ziganshin
In xxHash 128-bit sum is continued to be mixed up on every step, so you don't have such obvious patterns for fooling the hash. Of course, probability of abovementioned equlity is only 2^-32, so it may be not a real problem. I'm looking into other's opinions, especially if you have a way to proof your reasons.
The way I see it:
https://github.com/Cyan4973/lz4/blob...r/lib/xxhash.c
xxHash splits the input into 4 independent streams with no intermixing. This is not bad by itself, but I think it is only slightly stronger than a single stream.

3. Originally Posted by m^2
Bulat, I have found minor issues in the docs:
1.The link to the source code is outdated
2.This is not just old, but outdated.
3. Is the bitstream stable already?
4. Also, it would be good to compare the algorithm with its competitors.
i fixed first two items. i will look into making high-level algo faster so bitstream will be changed in 0.3+. Recorded 4 as issue. I will massage readme a bit more and release 0.2 soon

4. Originally Posted by m^3
SMHasher just checks for a number of common issues. IMHO passing it is not an argument that a hash is a quality one.
Ideas?
Split it into elements and evaluate statistical properties. For a non-crypto hash, strong compression function is *the* part that matters the most.
well, can you give more detailed idea? does stat. properties guarantee a good hash (at least provides better guarantees than SMHasher)?

Originally Posted by m^3
https://github.com/Cyan4973/lz4/blob...r/lib/xxhash.c
xxHash splits the input into 4 independent streams with no intermixing. This is not bad by itself, but I think it is only slightly stronger than a single stream.
i just completely forgot the xxHash internals. It manages 4 independent streams, each with 32-bit input elements and 32-bit intermediate result. This looks inferior to the cycle i proposed above, both for speed and quality.

5. I'll give you only partial answer now as it's late and I'd rather write too little than write nonsense.
I suggest that you read. http://burtleburtle.net/bob/hash/evahash.html
This gives some idea about properties of a good hash.
If my memory serves me well, Bob sacrificed some formal purity for efficiency in his SpookyHash. He believes the hash is no-funnel but is unable to prove it and when I was into it my guts told me he was wrong, but I didn't prove it either.

And yes, statistical analysis is *the* way to do Hash QA.
1. SMHasher is a black box, it does the same kind of tests that you want to do, but over the entire function. Practically, you want your compression function alone to pass most of SMHasher. It doesn't need stuff like good distribution, but avalanching is key.
2. Some hashes were designed specifically to pass SMHasher and find ways to pass it that don't make them stronger.

The problem with hashes is that everyone can write one. But not everyone can QA one. I was in this position with xxhash256, I knew how to write but not how to test. And when I learned how to test, I knew that my creation was just terrible.
The problem with SMHasher is that it is believed to be a test for hash quality while really it's not. Good hash doesn't need to pass SMHasher. For example distribution is unnecessary when collision-resistance is your goal. OTOH a bad hash can pass it. SMHasher is *a* test suite, not *the definitive* test suite.

6. Originally Posted by m^2
The problem with SMHasher is that it is believed to be a test for hash quality while really it's not. Good hash doesn't need to pass SMHasher. For example distribution is unnecessary when collision-resistance is your goal. OTOH a bad hash can pass it. SMHasher is *a* test suite, not *the definitive* test suite.
HighwayHash is simple and pretty fast at around 11 GB/s and gets wonderful statistics on SMHasher, better than any other fast hash including the SipHash. I like the "ZipperMerge" mixing in it -- it reorders the bytes by placing the bits having most entropy to be moved to places where their entropy is mixed most efficiently at the next multiply. Because of this permutation phase the HighwayHash gets more entropy mixing out of multiplying than other multiply-based hashes.

Disclaimer: I co-authored HighwayHash.

7. I reread my previous post and I see that I actually told most of what I wanted to tell. I'll just add that here you have another few comments on how pros do hash QA. The interesting section is "Rationale", especially "Choice of rotation counts".
Originally Posted by Jyrki Alakuijala
HighwayHash is simple and pretty fast at around 11 GB/s and gets wonderful statistics on SMHasher, better than any other fast hash including the SipHash. I like the "ZipperMerge" mixing in it -- it reorders the bytes by placing the bits having most entropy to be moved to places where their entropy is mixed most efficiently at the next multiply. Because of this permutation phase the HighwayHash gets more entropy mixing out of multiplying than other multiply-based hashes.

Disclaimer: I co-authored HighwayHash.
That's interesting. I wish I had more time to study this code.

8. Let's give it some names! I will call my initial design xxFarsh v0, and now i'm going to describe xxFarsh v1:

Once i realised that PALIGNR requires SSSE3, i started to look into 16-bit and 32-bit rotates. Unfortunately, there is no PSHUFW, so fast 16-bit rotates also need SSSE3. Then i realized that we can combine shuffle preparing data for PMUL and shuffle exchanging data after PMUL into the single operation:

Code:
```uint32 data[4]
uint32 hash[4] === uint64 hash64[2] === uint128 hash128

hash64[0] = (data[0]+hash[0]) * (data[3]+hash[3])
hash64[1] = (data[2]+hash[2]) * (data[1]+hash[1])```
or

PSHUFD xmm1,xmm0,\$13
PMULUDQ xmm0,xmm1

i.e. only 3 operations per 128/256 input bits!!! Despite simplicity, any input/hash128 bit may affect any output hash128 bit after just 3 hashing steps, and any 32-bit word may affect any output bit in just 2 steps. So bits mix up like a crazy! I wonder whether this construction can pass the SMHasher, and may be even with 64/128-bit hash result.

Those 3 operations should be able to execute in just a single CPU cycle, providing record-breaking hash speeds. Of course we need to process at least 7 streams simultaneously to cover 7-cycle latency of PADDD+PSHUFD+PMULUDQ.

It should also be pretty fast on 32-bit scalar architectures with 16-32 registers. x86 will be slower due to register shortage and requirement to use only EDX:EAX for MUL32_64, so we get a lot of register moves waiting (on Pentium1 and older architectures) for MUL result.

I think that this design is interesting by itself, at least for uprecedented speed. But the next question is how to improve its weak bit mixing in order to provide better quality while keeping it fast on older hardware (i.e. no HighwayHash techniques of byte shuffling and 1024-bit state).

A side note: in some sense, xxFarsh v1 provides a minimum posible data suffling required to mix up all 128 data bits. Its shuffling can be extended to shuffle data among two 128-bit states:

PSHUFD xmm1,xmm0,\$13

PSHUFD xmm5,xmm4,\$13

PMULUDQ xmm0,xmm5
PMULUDQ xmm4,xmm1

i.e.

hash64A[0] = (data[0]+hashA[0]) * (data[7]+hashB[3])
hash64A[1] = (data[2]+hashA[2]) * (data[5]+hashB[1])

hash64B[0] = (data[4]+hashB[0]) * (data[3]+hashA[3])
hash64B[1] = (data[6]+hashB[2]) * (data[1]+hashA[1])

so instead of 2 independent 128-bit states we will have the single 256-bit one.

This idea can be further extended to mix up hashA with hashB, and hashC with hashD on even steps, then hashA with hashC and hashB with hashD on odd steps - so we will havethe single 1024-bit state with any input/state bit influencing any output bit after just 5 steps - and it still be as fast as plain xxFarsh v1! The drawback, of course, is that scalar cpus and even x86 pre-AVX cpus don't have enough registers to hold such large state.

9. Jyrki Alakuijala, thanks for posting it. so

1) thanks for vector classes! btw, vec.h is almost SSE2-compatible except only for _mm_cmpeq_epi64 operation

2) i think that sse41_highway_tree_hash.h in fact needs only SSSE3, since the only non-SSE2 operation used here is _mm_shuffle_epi8. btw look at http://www-db.in.tum.de/~finis/x86-i...sheet-v2.1.pdf

3) i like your architecture! internal state is 1024 bits that are completely mixed up on every step, together with 256 input bits. and entropy-based permutation is brilliant! All in all, this justifies your decision to return up to 256-bit hashing result

4) it doesn't look like cryptography hash (like sha2), but it may be cryptographically strong keyed hash like SipHash, UMAC and VMAC

5) while SipHash and HighwayHash are strongly optimized for small inputs like a few dozens of bytes (i.e. hashing database keys), other hashes i consider (umac/vmac, xxHash, farsh/xxFarsh) are optimized for hashing at least a few kilobytes (i.e. checksumming data streams). Also, xxHash/*farsh doesn't consider user attacks.

This leads to architecture difference - your hashes does a lot of mixing on every step to guarantee fooling an attacker. Therefore it can be more formally checked with full range of cryptoanalysis applicable to MACs. I know that murmur3 has failed under such cryptoanalysis, and expect that xxHash/xxFarsh will fail too. afair
UMAC/VMAC was proven to be secure.

Another consequence of DB-oriented HighwayHash design is that it targets only the actual Google hardware, with significant slowdown on pre-SSSE3 cpus (and may be arm/mips too). It isn't a problem for DB since you may run another hash on another box, but more important for stream checksums. This includes two design decisions - first, a byte permutation that's a bit hard to emulate with bit shifts. Second, internal state of 1024 bits, i.e. 8 SSE registers (i suppose that you will see slowdown even on 32-bit SSSE3 hardware), or 32 of 32-bit registers. These two ideas is a soul of your algo, so i bet that we have a choice - either a cryptosecurity with HighwayHash or fast execution on wide hardware range with xxHash and friends.

The third consequence is lack of multithreading support (useful for checksumming multi-megabyte inputs), but that's easy to fix by splitting data into fixed-size blocks (f.e. 4 KB) and repeating your algo on 2-3 hash levels.

Let's compare the 3 algos:

- xxHash: permutation based on MUL32+ROT32. 4 independent streams with 32-bit internal state, updated on each step with 32 input bits using 4 scalar operations. SMHasher quality

- xxFarsh: permutation based on MUL32_64 and PSHUFD. 4*128-bit streams with 128-bit updates, 3 SSE2/AVX2 operations per 128/256 input bits. Yet unknown quality

- HighwayHash: permutation based on MUL32_64 and entropy-directed PSHUFB. 1024-bit internal state, 256-bit updates using 13 AVX2 operations. Cryptographic quality

10. Going toward a release of FARSH 0.2, i built a toolbox to check its performance on YOUR hardware. Just unpack the attached archive, run the runme.cmd and publish results here. I'm especially interested in results on old/rare CPUs. Please describe your CPU if provided program hasn't found its name. On my box runme.cmd prints the following (described here):
Code:
```C:\FARSH\benchmark>for %e in (*.exe) do @start /b /wait  /realtime %e 1
aligned-farsh-x64-avx2    |  54.595 GB/s = 50.846 GiB/s  |  65.582 GB/s = 61.078 GiB/s
aligned-farsh-x64-nosimd  |  14.425 GB/s = 13.434 GiB/s  |  14.497 GB/s = 13.501 GiB/s
aligned-farsh-x64         |  31.339 GB/s = 29.187 GiB/s  |  36.503 GB/s = 33.996 GiB/s
aligned-farsh-x86-avx2    |  40.679 GB/s = 37.885 GiB/s  |  62.859 GB/s = 58.542 GiB/s
aligned-farsh-x86-sse2    |  24.920 GB/s = 23.209 GiB/s  |  34.138 GB/s = 31.793 GiB/s
aligned-farsh-x86         |   6.321 GB/s =  5.887 GiB/s  |   6.410 GB/s =  5.970 GiB/s
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz: AVX2
farsh-x64-avx2            |  46.024 GB/s = 42.863 GiB/s  |  64.967 GB/s = 60.505 GiB/s
farsh-x64-nosimd          |  14.112 GB/s = 13.143 GiB/s  |  14.375 GB/s = 13.387 GiB/s
farsh-x64                 |  30.335 GB/s = 28.252 GiB/s  |  34.891 GB/s = 32.495 GiB/s
farsh-x86-avx2            |  35.273 GB/s = 32.851 GiB/s  |  57.252 GB/s = 53.320 GiB/s
farsh-x86-sse2            |  24.502 GB/s = 22.820 GiB/s  |  33.325 GB/s = 31.037 GiB/s
farsh-x86                 |   6.283 GB/s =  5.852 GiB/s  |   6.763 GB/s =  6.299 GiB/s```
If you want to compile the code yourself, it's available here

11. Code:
```aligned-farsh-x64-nosimd  |   7.995 GB/s =  7.446 GiB/s  |   8.472 GB/s =  7.891 GiB/s
aligned-farsh-x64         |  11.300 GB/s = 10.524 GiB/s  |  14.446 GB/s = 13.454 GiB/s
aligned-farsh-x86-sse2    |  10.899 GB/s = 10.151 GiB/s  |  13.280 GB/s = 12.368 GiB/s
aligned-farsh-x86         |   3.805 GB/s =  3.544 GiB/s  |   5.089 GB/s =  4.740 GiB/s
AMD Athlon(tm) II X2 220 Processor: SSE3
farsh-x64-nosimd          |   7.471 GB/s =  6.958 GiB/s  |   8.351 GB/s =  7.777 GiB/s
farsh-x64                 |  12.823 GB/s = 11.943 GiB/s  |  14.187 GB/s = 13.212 GiB/s
farsh-x86-sse2            |  10.933 GB/s = 10.182 GiB/s  |  12.389 GB/s = 11.538 GiB/s
farsh-x86                 |   3.786 GB/s =  3.526 GiB/s  |   5.825 GB/s =  5.425 GiB/s```

12. 1. AMD Turion II P540 2400 Mhz
2. i5-6500 (x86 better than i7-4770)

Code:
```C:\teszt\farsh02-benchmark>for %e in (*.exe) do @start /b /wait  /realtime %e 1
aligned-farsh-x64-nosimd  |   6.687 GB/s =  6.228 GiB/s  |   7.112 GB/s =  6.623 GiB/s
aligned-farsh-x64         |   9.473 GB/s =  8.823 GiB/s  |  12.069 GB/s = 11.240 GiB/s
aligned-farsh-x86-sse2    |   9.131 GB/s =  8.504 GiB/s  |  11.212 GB/s = 10.442 GiB/s
aligned-farsh-x86         |   3.192 GB/s =  2.973 GiB/s  |   4.262 GB/s =  3.969 GiB/s
AMD Turion(tm) II P540 Dual-Core Processor: SSE3
farsh-x64-nosimd          |   6.281 GB/s =  5.850 GiB/s  |   6.990 GB/s =  6.510 GiB/s
farsh-x64                 |  10.906 GB/s = 10.157 GiB/s  |  11.958 GB/s = 11.136 GiB/s
farsh-x86-sse2            |   9.253 GB/s =  8.617 GiB/s  |  10.455 GB/s =  9.737 GiB/s
farsh-x86                 |   3.183 GB/s =  2.964 GiB/s  |   4.887 GB/s =  4.552 GiB/s

C:\teszt\farsh02-benchmark>for %e in (*.exe) do @start /b /wait  /realtime %e 1
aligned-farsh-x64-avx2    |  49.237 GB/s = 45.856 GiB/s  |  58.531 GB/s = 54.511 GiB/s
aligned-farsh-x64-nosimd  |  13.532 GB/s = 12.602 GiB/s  |  13.544 GB/s = 12.614 GiB/s
aligned-farsh-x64         |  29.642 GB/s = 27.606 GiB/s  |  33.941 GB/s = 31.610 GiB/s
aligned-farsh-x86-avx2    |  39.200 GB/s = 36.508 GiB/s  |  55.925 GB/s = 52.085 GiB/s
aligned-farsh-x86-sse2    |  23.380 GB/s = 21.774 GiB/s  |  32.344 GB/s = 30.122 GiB/s
aligned-farsh-x86         |   7.680 GB/s =  7.153 GiB/s  |   9.465 GB/s =  8.815 GiB/s
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz: AVX2
farsh-x64-avx2            |  46.947 GB/s = 43.722 GiB/s  |  56.768 GB/s = 52.869 GiB/s
farsh-x64-nosimd          |  13.522 GB/s = 12.593 GiB/s  |  13.564 GB/s = 12.632 GiB/s
farsh-x64                 |  29.579 GB/s = 27.547 GiB/s  |  32.992 GB/s = 30.726 GiB/s
farsh-x86-avx2            |  37.770 GB/s = 35.176 GiB/s  |  54.926 GB/s = 51.154 GiB/s
farsh-x86-sse2            |  23.812 GB/s = 22.176 GiB/s  |  32.173 GB/s = 29.964 GiB/s
farsh-x86                 |   7.652 GB/s =  7.127 GiB/s  |   8.693 GB/s =  8.096 GiB/s```

13. I did bechmarks on my laptop...

here you go:
Code:
```C:\Downloads\farsh02-benchmark>runme

aligned-farsh-x64-nosimd  |  14.972 GB/s = 13.944 GiB/s  |  15.985 GB/s = 14.887 GiB/s
aligned-farsh-x64         |  23.622 GB/s = 22.000 GiB/s  |  28.797 GB/s = 26.820 GiB/s
aligned-farsh-x86-sse2    |  22.504 GB/s = 20.958 GiB/s  |  28.551 GB/s = 26.590 GiB/s
aligned-farsh-x86         |   5.245 GB/s =  4.885 GiB/s  |   5.927 GB/s =  5.520 GiB/s
Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: AVX
farsh-x64-nosimd          |  14.655 GB/s = 13.648 GiB/s  |  16.029 GB/s = 14.928 GiB/s
farsh-x64                 |  23.957 GB/s = 22.312 GiB/s  |  27.862 GB/s = 25.949 GiB/s
farsh-x86-sse2            |  21.693 GB/s = 20.203 GiB/s  |  27.913 GB/s = 25.996 GiB/s
farsh-x86                 |   5.182 GB/s =  4.826 GiB/s  |   6.209 GB/s =  5.783 GiB/s```

14. notebook
Code:
```aligned-farsh-x64-nosimd  |  11.580 GB/s = 10.785 GiB/s  |  12.020 GB/s = 11.194 GiB/s
aligned-farsh-x64         |  18.941 GB/s = 17.641 GiB/s  |  21.905 GB/s = 20.401 GiB/s
aligned-farsh-x86-sse2    |  17.844 GB/s = 16.619 GiB/s  |  21.395 GB/s = 19.926 GiB/s
aligned-farsh-x86         |   4.144 GB/s =  3.860 GiB/s  |   4.531 GB/s =  4.220 GiB/s
Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz: AVX
farsh-x64-nosimd          |  11.486 GB/s = 10.697 GiB/s  |  12.345 GB/s = 11.497 GiB/s
farsh-x64                 |  18.221 GB/s = 16.970 GiB/s  |  21.383 GB/s = 19.914 GiB/s
farsh-x86-sse2            |  17.149 GB/s = 15.971 GiB/s  |  21.561 GB/s = 20.080 GiB/s
farsh-x86                 |   4.019 GB/s =  3.743 GiB/s  |   4.575 GB/s =  4.260 GiB/s```
desktop (i7-3770K CPU @ 3.50GHz - OC @4.40GHz)
Code:
```aligned-farsh-x64-nosimd  |  20.272 GB/s = 18.880 GiB/s  |  21.542 GB/s = 20.062 GiB/s
aligned-farsh-x64         |  31.860 GB/s = 29.672 GiB/s  |  38.475 GB/s = 35.833 GiB/s
aligned-farsh-x86-sse2    |  30.327 GB/s = 28.244 GiB/s  |  38.157 GB/s = 35.537 GiB/s
aligned-farsh-x86         |   7.052 GB/s =  6.567 GiB/s  |   7.959 GB/s =  7.412 GiB/s
Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz: AVX
farsh-x64-nosimd          |  19.752 GB/s = 18.396 GiB/s  |  21.528 GB/s = 20.050 GiB/s
farsh-x64                 |  32.272 GB/s = 30.056 GiB/s  |  37.499 GB/s = 34.923 GiB/s
farsh-x86-sse2            |  29.234 GB/s = 27.226 GiB/s  |  37.739 GB/s = 35.148 GiB/s
farsh-x86                 |   6.961 GB/s =  6.483 GiB/s  |   8.335 GB/s =  7.763 GiB/s```

unfortunately nothing rare, might test some crappy AMD A8 later, and the old athlon system is out of order right now.

15. Intel(R) Pentium(R) M processor 1500MHz: SSE2

aligned-farsh-x86-sse2 | 2.625 GB/s = 2.444 GiB/s | 2.791 GB/s = 2.5 GiB/s
aligned-farsh-x86 | 1.664 GB/s = 1.550 GiB/s | 1.946 GB/s = 1.8 GiB/s
farsh-x86-sse2 | 2.025 GB/s = 1.886 GiB/s | 2.302 GB/s = 2.1 GiB/s
farsh-x86 | 1.471 GB/s = 1.370 GiB/s | 1.715 GB/s = 1.5 GiB/s

16. Old computer:
```aligned-farsh-x64-nosimd  |  10.721 GB/s =  9.985 GiB/s  |  11.577 GB/s = 10.782 GiB/s
aligned-farsh-x64         |  17.130 GB/s = 15.953 GiB/s  |  21.394 GB/s = 19.924 GiB/s
aligned-farsh-x86-sse2    |  13.790 GB/s = 12.843 GiB/s  |  20.830 GB/s = 19.400 GiB/s
aligned-farsh-x86         |   3.872 GB/s =  3.606 GiB/s  |   5.457 GB/s =  5.082 GiB/s
AMD A8-5500 APU with Radeon(tm) HD Graphics    : AVX
farsh-x64-nosimd          |  10.104 GB/s =  9.410 GiB/s  |  10.845 GB/s = 10.100 GiB/s
farsh-x64                 |  15.313 GB/s = 14.262 GiB/s  |  19.659 GB/s = 18.309 GiB/s
farsh-x86-sse2            |  13.812 GB/s = 12.863 GiB/s  |  18.977 GB/s = 17.674 GiB/s
farsh-x86                 |   3.959 GB/s =  3.687 GiB/s  |   5.056 GB/s =  4.709 GiB/s```

```aligned-farsh-x64-avx2    |  66.601 GB/s = 62.027 GiB/s  |  80.120 GB/s = 74.618 GiB/s
aligned-farsh-x64-nosimd  |  17.426 GB/s = 16.229 GiB/s  |  17.482 GB/s = 16.281 GiB/s
aligned-farsh-x64         |  37.954 GB/s = 35.348 GiB/s  |  43.990 GB/s = 40.969 GiB/s
aligned-farsh-x86-avx2    |  49.332 GB/s = 45.944 GiB/s  |  75.817 GB/s = 70.611 GiB/s
aligned-farsh-x86-sse2    |  30.765 GB/s = 28.652 GiB/s  |  41.403 GB/s = 38.559 GiB/s
aligned-farsh-x86         |   7.648 GB/s =  7.122 GiB/s  |   7.747 GB/s =  7.215 GiB/s
Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz: AVX2
farsh-x64-avx2            |  54.909 GB/s = 51.138 GiB/s  |  78.279 GB/s = 72.903 GiB/s
farsh-x64-nosimd          |  17.170 GB/s = 15.990 GiB/s  |  17.480 GB/s = 16.280 GiB/s
farsh-x64                 |  37.114 GB/s = 34.565 GiB/s  |  42.617 GB/s = 39.690 GiB/s
farsh-x86-avx2            |  42.906 GB/s = 39.960 GiB/s  |  70.969 GB/s = 66.095 GiB/s
farsh-x86-sse2            |  30.112 GB/s = 28.044 GiB/s  |  40.689 GB/s = 37.894 GiB/s
farsh-x86                 |   7.552 GB/s =  7.034 GiB/s  |   8.123 GB/s =  7.565 GiB/s```

17. Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz: AVX

aligned-farsh-x86-sse2 | 21.879 GB/s = 20.377 GiB/s | 27.674 GB/s = 25.774 GiB/s
aligned-farsh-x86 | 5.090 GB/s = 4.740 GiB/s | 5.779 GB/s = 5.382 GiB/s
farsh-x86-sse2 | 21.020 GB/s = 19.577 GiB/s | 27.400 GB/s = 25.518 GiB/s
farsh-x86 | 5.035 GB/s = 4.689 GiB/s | 6.053 GB/s = 5.638 GiB/s

18. Kennon Conrad, your 80 GB/s result looks too much for 4790. May be it's overclocked?

So far, we have Skylake, Haswell, Ivy Bridge, Pentium M, AMD K10 and Piledriver results. What about other cores?

19. Linux, gcc 5.3.0

Code:
```Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
aligned-farsh-x64         |  32.077 GB/s = 29.874 GiB/s  |  34.669 GB/s = 32.288 GiB/s
aligned-farsh-x64-avx2    |  54.000 GB/s = 50.291 GiB/s  |  58.815 GB/s = 54.775 GiB/s
aligned-farsh-x64-nosimd  |  14.248 GB/s = 13.270 GiB/s  |  14.008 GB/s = 13.046 GiB/s
aligned-farsh-x86         |  12.848 GB/s = 11.965 GiB/s  |  14.220 GB/s = 13.244 GiB/s
aligned-farsh-x86-avx2    |  36.763 GB/s = 34.239 GiB/s  |  55.701 GB/s = 51.875 GiB/s
aligned-farsh-x86-sse2    |  24.429 GB/s = 22.751 GiB/s  |  33.133 GB/s = 30.857 GiB/s
farsh-x64                 |  29.453 GB/s = 27.430 GiB/s  |  30.308 GB/s = 28.227 GiB/s
farsh-x64-avx2            |  48.622 GB/s = 45.283 GiB/s  |  54.491 GB/s = 50.749 GiB/s
farsh-x64-nosimd          |  14.077 GB/s = 13.110 GiB/s  |  13.600 GB/s = 12.666 GiB/s
farsh-x86                 |  12.587 GB/s = 11.723 GiB/s  |  13.599 GB/s = 12.665 GiB/s
farsh-x86-avx2            |  34.730 GB/s = 32.345 GiB/s  |  52.544 GB/s = 48.935 GiB/s
farsh-x86-sse2            |  24.815 GB/s = 23.111 GiB/s  |  25.301 GB/s = 23.563 GiB/s

Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
aligned-farsh-x64         |  19.796 GB/s = 18.436 GiB/s  |  23.169 GB/s = 21.578 GiB/s
aligned-farsh-x64-nosimd  |  10.963 GB/s = 10.210 GiB/s  |  11.879 GB/s = 11.064 GiB/s
aligned-farsh-x86         |  10.034 GB/s =  9.345 GiB/s  |  11.791 GB/s = 10.981 GiB/s
aligned-farsh-x86-sse2    |  19.205 GB/s = 17.886 GiB/s  |  22.935 GB/s = 21.360 GiB/s
farsh-x64                 |  17.690 GB/s = 16.476 GiB/s  |  17.248 GB/s = 16.064 GiB/s
farsh-x64-nosimd          |  10.650 GB/s =  9.918 GiB/s  |  10.829 GB/s = 10.086 GiB/s
farsh-x86                 |   9.470 GB/s =  8.820 GiB/s  |  10.483 GB/s =  9.763 GiB/s
farsh-x86-sse2            |  18.155 GB/s = 16.909 GiB/s  |  17.577 GB/s = 16.370 GiB/s

Pentium(R) Dual-Core  CPU      E5300  @ 2.60GHz
aligned-farsh-x64         |  12.286 GB/s = 11.442 GiB/s  |  14.683 GB/s = 13.675 GiB/s
aligned-farsh-x64-nosimd  |   7.156 GB/s =  6.664 GiB/s  |   5.509 GB/s =  5.131 GiB/s
aligned-farsh-x86         |   6.695 GB/s =  6.235 GiB/s  |   5.578 GB/s =  5.195 GiB/s
aligned-farsh-x86-sse2    |  12.487 GB/s = 11.629 GiB/s  |  14.750 GB/s = 13.737 GiB/s
farsh-x64                 |   5.163 GB/s =  4.809 GiB/s  |   6.064 GB/s =  5.647 GiB/s
farsh-x64-nosimd          |   4.268 GB/s =  3.975 GiB/s  |   3.633 GB/s =  3.383 GiB/s
farsh-x86                 |   3.768 GB/s =  3.510 GiB/s  |   3.193 GB/s =  2.974 GiB/s
farsh-x86-sse2            |   5.000 GB/s =  4.656 GiB/s  |   4.260 GB/s =  3.967 GiB/s

AMD Opteron(TM) Processor 6272 (2.1GHz)
aligned-farsh-x64         |  14.869 GB/s = 13.848 GiB/s  |  19.903 GB/s = 18.537 GiB/s
aligned-farsh-x64-nosimd  |   9.296 GB/s =  8.658 GiB/s  |  10.490 GB/s =  9.770 GiB/s
aligned-farsh-x86         |   7.981 GB/s =  7.433 GiB/s  |  10.016 GB/s =  9.328 GiB/s
aligned-farsh-x86-sse2    |  11.760 GB/s = 10.953 GiB/s  |  17.773 GB/s = 16.553 GiB/s
farsh-x64                 |  14.251 GB/s = 13.272 GiB/s  |  16.918 GB/s = 15.756 GiB/s
farsh-x64-nosimd          |   9.272 GB/s =  8.635 GiB/s  |  10.289 GB/s =  9.582 GiB/s
farsh-x86                 |   6.944 GB/s =  6.467 GiB/s  |   9.781 GB/s =  9.109 GiB/s
farsh-x86-sse2            |  11.219 GB/s = 10.449 GiB/s  |  15.467 GB/s = 14.405 GiB/s```
Trying different compilers on one system (Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz, as it has the most choice installed) with just aligned-farsh-x64:

Code:
```icc 2013:        aligned-farsh-x64         |  10.594 GB/s =  9.866 GiB/s  |  12.335 GB/s = 11.488 GiB/s
icc 2015:        aligned-farsh-x64         |  10.522 GB/s =  9.799 GiB/s  |  12.232 GB/s = 11.392 GiB/s
gcc 4.6.3:       aligned-farsh-x64         |  22.289 GB/s = 20.758 GiB/s  |  26.129 GB/s = 24.334 GiB/s
gcc 5.3.0:       aligned-farsh-x64         |  19.797 GB/s = 18.438 GiB/s  |  23.125 GB/s = 21.537 GiB/s
gcc 6.1.0:       aligned-farsh-x64         |  25.283 GB/s = 23.546 GiB/s  |  26.965 GB/s = 25.114 GiB/s
clang 3.7.0:     aligned-farsh-x64         |  21.817 GB/s = 20.319 GiB/s  |  22.587 GB/s = 21.036 GiB/s```
An interesting regression there in gcc 5. I picked the wrong one for my other tests clearly! Also a HUGE difference between gcc and icc. Trying gcc6.1 for my i5 system gave
aligned-farsh-x64-avx2 | 53.715 GB/s = 50.026 GiB/s | 59.133 GB/s = 55.072 GiB/s

So not so much difference in that case.

20. Some google perftools output, which uses the hardware counters to track various events. On Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz

Code:
```@ gen1c[FARSH/benchmark]; perf stat ./farsh-x64 1
farsh-x64                 |  18.494 GB/s = 17.224 GiB/s  |  18.639 GB/s = 17.359 GiB/s

Performance counter stats for './farsh-x64 1':

11596.273600 task-clock                #    0.998 CPUs utilized
995 context-switches          #    0.000 M/sec
6 CPU-migrations            #    0.000 M/sec
75 page-faults               #    0.000 M/sec
34435862730 cycles                    #    2.970 GHz
6363180382 stalled-cycles-frontend   #   18.48% frontend cycles idle
1188835421 stalled-cycles-backend    #    3.45% backend  cycles idle
91395146357 instructions              #    2.65  insns per cycle
#    0.07  stalled cycles per insn
7112207820 branches                  #  613.318 M/sec
122874085 branch-misses             #    1.73% of all branches

11.619984961 seconds time elapsed

@ gen1c[FARSH/benchmark]; perf stat ./aligned-farsh-x64 1
aligned-farsh-x64         |  24.774 GB/s = 23.073 GiB/s  |  26.983 GB/s = 25.130 GiB/s

Performance counter stats for './aligned-farsh-x64 1':

8308.627953 task-clock                #    0.998 CPUs utilized
704 context-switches          #    0.000 M/sec
5 CPU-migrations            #    0.000 M/sec
81 page-faults               #    0.000 M/sec
24559400575 cycles                    #    2.956 GHz
3586456951 stalled-cycles-frontend   #   14.60% frontend cycles idle
105345425 stalled-cycles-backend    #    0.43% backend  cycles idle
75523838148 instructions              #    3.08  insns per cycle
#    0.05  stalled cycles per insn
2022960769 branches                  #  243.477 M/sec
42350 branch-misses             #    0.00% of all branches

8.324662378 seconds time elapsed```
On my core-i5 system with avx2
Code:
```@ deskpro107386[tmp/farsh]; perf stat ./aligned-farsh-x64-avx2 1
aligned-farsh-x64-avx2    |  52.800 GB/s = 49.174 GiB/s  |  58.985 GB/s = 54.934 GiB/s

Performance counter stats for './aligned-farsh-x64-avx2 1':

3860.592586 task-clock (msec)         #    1.000 CPUs utilized
376 context-switches          #    0.097 K/sec
10 cpu-migrations            #    0.003 K/sec
78 page-faults               #    0.020 K/sec
13668406450 cycles                    #    3.540 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
47616993740 instructions              #    3.48  insns per cycle
2020012893 branches                  #  523.239 M/sec
24869 branch-misses             #    0.00% of all branches

3.860103317 seconds time elapsed```
Alas it lacks the correct version of perf or kernel to get the stalled cycle information.

Anyway, it's closing in on the 4 instructions per cycle count, which I think is probably the limit of ports for this CPU.

21. Originally Posted by Bulat Ziganshin
Kennon Conrad, your 80 GB/s result looks too much for 4790. May be it's overclocked?
4.32 GHz.

22. 1. It's hard to believe in such huge jump in just a single CPU generation:
Code:
```Intel(R) Pentium(R) M processor 1500MHz: SSE2
aligned-farsh-x86-sse2    |   2.625 GB/s =  2.444 GiB/s  |   2.791 GB/s =  2.5 GiB/s
aligned-farsh-x86         |   1.664 GB/s =  1.550 GiB/s  |   1.946 GB/s =  1.8 GiB/s

Pentium(R) Dual-Core  CPU      E5300  @ 2.60GHz
aligned-farsh-x86-sse2    |  12.487 GB/s = 11.629 GiB/s  |  14.750 GB/s = 13.737 GiB/s
aligned-farsh-x86         |   6.695 GB/s =  6.235 GiB/s  |   5.578 GB/s =  5.195 GiB/s```

2. x86 results are 2x faster that my gcc 4.9.2 results, but x64 is the same
Code:
```Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
aligned-farsh-x64-nosimd  |  14.248 GB/s = 13.270 GiB/s  |  14.008 GB/s = 13.046 GiB/s
aligned-farsh-x86         |  12.848 GB/s = 11.965 GiB/s  |  14.220 GB/s = 13.244 GiB/s```
Either your compiler enables SSE2 by default (try to compile aligned-farsh-x86 with -msse) or it got much better optimization

3. The most likely source of this difference is difference in compilation options:
Code:
```icc 2013:        aligned-farsh-x64         |  10.594 GB/s =  9.866 GiB/s  |  12.335 GB/s = 11.488 GiB/s
icc 2015:        aligned-farsh-x64         |  10.522 GB/s =  9.799 GiB/s  |  12.232 GB/s = 11.392 GiB/s
gcc 4.6.3:       aligned-farsh-x64         |  22.289 GB/s = 20.758 GiB/s  |  26.129 GB/s = 24.334 GiB/s```
are you passed "-msse2 -DFARSH_SSE2" to ICL?

23. Intel(R) Core(TM)2 Duo CPU T7100 @ 1.80GHz: Supplemental SSE3

aligned-farsh-x86-sse2 | 7.799 GB/s = 7.263 GiB/s | 9.272 GB/s = 8.635 GiB/s
aligned-farsh-x86 | 2.562 GB/s = 2.386 GiB/s | 3.005 GB/s = 2.799 GiB/s
farsh-x86-sse2 | 3.233 GB/s = 3.011 GiB/s | 4.154 GB/s = 3.869 GiB/s
farsh-x86 | 2.069 GB/s = 1.927 GiB/s | 2.644 GB/s = 2.462 GiB/s

24. Originally Posted by Bulat Ziganshin
1. It's hard to believe in such huge jump in just a single CPU generation:
Code:
```Intel(R) Pentium(R) M processor 1500MHz: SSE2
aligned-farsh-x86-sse2    |   2.625 GB/s =  2.444 GiB/s  |   2.791 GB/s =  2.5 GiB/s
aligned-farsh-x86         |   1.664 GB/s =  1.550 GiB/s  |   1.946 GB/s =  1.8 GiB/s

Pentium(R) Dual-Core  CPU      E5300  @ 2.60GHz
aligned-farsh-x86-sse2    |  12.487 GB/s = 11.629 GiB/s  |  14.750 GB/s = 13.737 GiB/s
aligned-farsh-x86         |   6.695 GB/s =  6.235 GiB/s  |   5.578 GB/s =  5.195 GiB/s```
I've been building the binaries on one system (the Xeon) and then running the binaries on other systems. This is because the Xeon is one of our servers and has a lot of software installed, including all the various compilers/versions. So there shouldn't be build differences between platforms.

The Pentium(R) M result though wasn't mine, so perhaps the difference is in the compiler and not the system. It appears this can have a radical difference, although as noted the non-sse variant is clearly lying for me.

2. x86 results are 2x faster that my gcc 4.9.2 results, but x64 is the same
Code:
```Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
aligned-farsh-x64-nosimd  |  14.248 GB/s = 13.270 GiB/s  |  14.008 GB/s = 13.046 GiB/s
aligned-farsh-x86         |  12.848 GB/s = 11.965 GiB/s  |  14.220 GB/s = 13.244 GiB/s```
Either your compiler enables SSE2 by default (try to compile aligned-farsh-x86 with -msse) or it got much better optimization
That is quite possible given pretty much every CPU for years has had sse2 capability. It looks to be gcc version dependent. Older gccs run aligned-farsh-x86 much slower than new ones, changing around 5.x. I'm seeing SSE2 (movdqa) instructions in the gcc-5.3.0 output, even with an explict -msse. I had to add -march=pentium3 to explicitly deny SSE2 onwards. I guess this is simply something that modern gcc does with the assumption if you're running a modern compiler then almost certainly the CPU it uses isn't THAT ancient. With -march=pentium3 I see:

aligned-farsh-x86 | 6.377 GB/s = 5.939 GiB/s | 5.831 GB/s = 5.430 GiB/s

With -march=pentium4 (which has SSE2):

aligned-farsh-x86 | 13.184 GB/s = 12.279 GiB/s | 14.176 GB/s = 13.202 GiB/s

3. The most likely source of this difference is difference in compilation options:
Code:
```icc 2013:        aligned-farsh-x64         |  10.594 GB/s =  9.866 GiB/s  |  12.335 GB/s = 11.488 GiB/s
icc 2015:        aligned-farsh-x64         |  10.522 GB/s =  9.799 GiB/s  |  12.232 GB/s = 11.392 GiB/s
gcc 4.6.3:       aligned-farsh-x64         |  22.289 GB/s = 20.758 GiB/s  |  26.129 GB/s = 24.334 GiB/s```
are you passed "-msse2 -DFARSH_SSE2" to ICL?
It wasn't icl but icc - perhaps this is windows vs linux naming?

Anyway, yes, I used identical options. Infact I just edited the compile.cmd to use \$CC instead of gcc and then rebuilt after changing environment variables. I think it's the same issue we saw before. The aligned-farsh-x64 binary is built using "\$CC -O3 -funroll-loops -s -static -m64 -msse2 -DFARSH_SSE2 main.cpp -oaligned-farsh-x64 -DFARSH_ALIGNED_INPUT", however there are no 64-bit systems that don't also support sse3 instructions, so specifying -msse2 doesn't mean ONLY sse2 in this situation. Maybe this means it's gone off and decided to do something totally different.

However even adding -march=core-avx-i to icc gives me only 11/13GiB/s; more or less half what gcc is doing (even with gcc -march=core2 -mtune=core2). I can send you assembler output if wish.

25. Originally Posted by m^2
Also, it would be good to compare the algorithm with its competitors.
while i don't provided comparison yet, i've collected a lot of links to fast hashes and their QA testing: https://github.com/Bulat-Ziganshin/FARSH#competition

It wasn't icl but icc - perhaps this is windows vs linux naming?
sure

The Pentium(R) M result though wasn't mine, so perhaps the difference is in the compiler and not the system.
Just above your answer FatBit posted Core2 results with my executable that shows the same - compared to PentiumM SSE2 version is almost 3x faster, while non-SSE is about 1.5x faster. Also Core2 SSE2 version gets a 2x-3x performance hit from unaligned loads, that isn't the case for earlier CPUs (probably because Pentium had only 64-bit SSE engines), nor for modern ones (probably since Nehalem that got fast shifters)

good idea! i have added it to compile.cmd. Can you please recompile with it and provide updated results of runme.cmd on your Pentium?

However even adding -march=core-avx-i to icc gives me only 11/13GiB/s
the key option here is -DFARSH_SSE2 that provides hand-vectorized SSE2 code. The only difference between those 2 compilation lines is -DFARSH_SSE2:
Code:
```gcc -O3 -funroll-loops -s -static -m64                     main.cpp -oaligned-farsh-x64-nosimd  -DFARSH_ALIGNED_INPUT
gcc -O3 -funroll-loops -s -static -m64 -msse2 -DFARSH_SSE2 main.cpp -oaligned-farsh-x64         -DFARSH_ALIGNED_INPUT```
Abd it results in the same 2x speed drop:
Code:
```Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
aligned-farsh-x64         |  19.796 GB/s = 18.436 GiB/s  |  23.169 GB/s = 21.578 GiB/s
aligned-farsh-x64-nosimd  |  10.963 GB/s = 10.210 GiB/s  |  11.879 GB/s = 11.064 GiB/s```
So my best guess is that you have compiled with ICC without -DFARSH_SSE2. This define enables my hand-optimized SSE2 code

there are no 64-bit systems that don't also support sse3 instructions
Original K8 (first Athlon64 chips) back in 2003 had no SSE3 (2004) support: https://en.wikipedia.org/wiki/SSE3

26. Fresh git pull and rebuild. Compiled on the Xeon system, but using the new opts (ie -march=pentium3 for some cases). As expected, the non sse2 x86 ones are now slower.

Code:
```Pentium(R) Dual-Core  CPU      E5300  @ 2.60GHz, gcc 5.3.0
aligned-farsh-x64         |  12.561 GB/s  |  14.712 GB/s  |  85.883 GB/s
aligned-farsh-x64-nosimd  |   7.228 GB/s  |   5.439 GB/s  | -21.972 GB/s
aligned-farsh-x86         |   3.941 GB/s  |   2.972 GB/s  | -12.090 GB/s
aligned-farsh-x86-sse2    |  12.431 GB/s  |  14.898 GB/s  |  75.061 GB/s
farsh-x64                 |   5.232 GB/s  |   6.298 GB/s  |  30.900 GB/s
farsh-x64-nosimd          |   4.365 GB/s  |   3.674 GB/s  | -23.221 GB/s
farsh-x86                 |   3.075 GB/s  |   2.250 GB/s  |  -8.386 GB/s
farsh-x86-sse2            |   5.064 GB/s  |   4.336 GB/s  | -30.150 GB/s```
I'm not sure what the last column is trying to measure, but it's pretty consistent on those numbers.

Regarding intel compilers, they do have some issues with bulding the x86 code (it complains massively about -m32 and missing libraries) - I expect we simply didn't install the necessary bits. It also complains about -mavx2 not being supported and suggests -march=core-avx2 instead. I tried adding an auto target too with -march=core-avx2 with no explicit define for SIMD (SSE2 or AVX) to let the compiler figure it out itself and see how its own code generation compared against the hand crafted form. It did OK (better than not specifying any -msse2 or -mavx), but nowhere as good as the hand tuned sse2 let alone avx2.

Put simply, gcc is just thrashing icc at every test here. I don't know why. Do you have any validation that the answer is correct? Looking at some of these numbers I suspect the results are actually differing. Gcc6 at times is reporting 196GB/s now. No way... It could also be that the compiler is working out some of the answers and optimising the results away. Perhaps one solution is to gcc -c farsh.c, g++ -c main.cpp, and then link main.o farsh.o. This means it has to build farsh.c with no knowledge of the input data - such as how large it is. I tried this and I do indeed get VERY different answers, so I think there is optimisation going on that you haven't managed to defeat. (I see some attempts, but I suspect the compilers got smarter.)

27. Examples with intel C++ 15 and Gnu gcc 6.1 on one file (with #include ../farsh.c), two files in one compile line (#include ../farsh.h and copies of the KEYS copied over) and two files build separately and linked.

@ gen1b[FARSH/benchmark]; g++6 -O3 -funroll-loops -march=core-avx2 -s -static main.cpp
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 17.016 GB/s | 196.757 GB/s | 18.626 GB/s
@ gen1b[FARSH/benchmark]; icpc -O3 -funroll-loops -march=core-avx2 -s -static main.cpp
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 14.590 GB/s | 15.209 GB/s | 358.527 GB/s

@ gen1b[FARSH/benchmark]; g++6 -O3 -funroll-loops -march=core-avx2 -s -static main.tmp.cpp ../farsh.c
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 16.430 GB/s | 11.896 GB/s | -43.112 GB/s
@ gen1b[FARSH/benchmark]; icpc -O3 -funroll-loops -march=core-avx2 -s -static main.tmp.cpp ../farsh.c
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 14.384 GB/s | 14.632 GB/s | 847.829 GB/s

@ gen1b[FARSH/benchmark]; g++6 -O3 -funroll-loops -march=core-avx2 -c main.tmp.cpp
@ gen1b[FARSH/benchmark]; g++6 -O3 -funroll-loops -march=core-avx2 -c ../farsh.c
@ gen1b[FARSH/benchmark]; g++6 -O3 -funroll-loops -march=core-avx2 -s -static main.tmp.o farsh.o
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 16.435 GB/s | 11.393 GB/s | -37.135 GB/s
@ gen1b[FARSH/benchmark]; icpc -O3 -funroll-loops -march=core-avx2 -c main.tmp.cpp
@ gen1b[FARSH/benchmark]; icpc -O3 -funroll-loops -march=core-avx2 -c ../farsh.c
@ gen1b[FARSH/benchmark]; icpc -O3 -funroll-loops -march=core-avx2 -s -static main.tmp.o farsh.o
@ gen1b[FARSH/benchmark]; ssh deskpro107386 tmp/FARSH/benchmark/a.out 1
farsh-x64-nosimd | 14.387 GB/s | 14.656 GB/s | 784.813 GB/s

These are equivalent to the "auto" I mentioned before where it just uses the non-simd code but with full simd processor available for the compiler to work with. It worked a bit TOO well in some cases clearly.

It's a bit manual to test, but I wonder if similar things are happening with some of the other variants. There is less scope for this in the hand tuned simd code as it pretty much has to do what you told it.

28. I'm not sure what the last column is trying to measure, but it's pretty consistent on those numbers.
well, you just catched me at the middle of changes. Nevertheless, you can see it by running any program without parameters:
Code:
```C:\FARSH\benchmark>aligned-farsh-x64-avx2.exe
FARSH 0.2 Benchmark. See https://github.com/Bulat-Ziganshin/FARSH
Usage: farsh [1|2] - choose display format
Hashing 100 GiB: 2033.903 milliseconds = 52.792 GB/s = 49.167 GiB/s
Internal loop:   1655.593 milliseconds = 64.855 GB/s = 60.401 GiB/s
External loop:   378.311 milliseconds = 283.825 GB/s = 264.333 GiB/s```
External loop time measured as a simple difference between first 2 times, so it seems than in your first report sometimes goes wrong - overall time was less than internal loop time. If you will update the sources again, you will see that now mode "1" displays just the same table as original and new mode "2" also prints external loops speed. I think that the last measurement just not interesting for most users

Put simply, gcc is just thrashing icc at every test here. I don't know why. Do you have any validation that the answer is correct?
about ICC - every compiler has its own set of optimizations, so it's not impossible to see multi-fold difference when measuring really small loops. In this case, ICC should just generate code from intrinsics, so it looks a bit strange, but i don't want to investigate compiler quirks now.

The first loop (one measuring the overall speed) checks the result correctness: if (h != 0xd300ddd8 ) {printf("\nWrong hash value at iteration %d: %08x !!!\n", i, h);}

Gcc6 at times is reporting 196GB/s now. No way...
one possibility to get that speeds is to compute hashes in multiple threads. GCC has auto-parallelization support, so in theory it can make that trick. I added one more anti-optimization trick that may prevent that.

ICC also complains about -mavx2 not being supported and suggests -march=core-avx2 instead.
well, ICC just supports all options of latest base compiler (MSVC on windows, GCC on Linux) at the time of its release. So, ICC15 just catched options of some older GCC version that had no -mavx2 support. Then ICC adds a bucnh of its own options, so "-march=core-avx2" may be one of them.

29. Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz: AVX

aligned-farsh-x86-sse2 | 21.901 GB/s = 20.397 GiB/s | 27.296 GB/s = 25.421 GiB/s
aligned-farsh-x86 | 5.074 GB/s = 4.726 GiB/s | 5.719 GB/s = 5.326 GiB/s
farsh-x86-sse2 | 21.418 GB/s = 19.947 GiB/s | 27.872 GB/s = 25.958 GiB/s
farsh-x86 | 5.082 GB/s = 4.733 GiB/s | 6.042 GB/s = 5.627 GiB/s

aligned-farsh-x64 | 23.226 GB/s = 21.631 GiB/s | 28.276 GB/s = 26.334 GiB/s
aligned-farsh-x64-nosimd | 14.429 GB/s = 13.438 GiB/s | 15.290 GB/s = 14.240 GiB/s
farsh-x64 | 23.041 GB/s = 21.458 GiB/s | 26.871 GB/s = 25.025 GiB/s
farsh-x64-nosimd | 14.311 GB/s = 13.329 GiB/s | 15.784 GB/s = 14.700 GiB/s

30. It's not multi-threading as the real time is the same as the user+system time. It's just somehow figuring out a way to precompute part of the answer I think. I can't say what without studying the assembly though.

Anyway, that's only noticable when I don't use -DFARSH_AVX2 but do specify -march=core-avx2. Otherwise it's running fine and as expected.

Page 2 of 3 First 123 Last