FARSH stands for fast and reliable (but not secure) hash, and I just released version 0.1
I'm interested in results of running benchmark executables (farsh*.exe) on your computers and, of course, in your versions with better bit mixing.
FARSH stands for fast and reliable (but not secure) hash, and I just released version 0.1
I'm interested in results of running benchmark executables (farsh*.exe) on your computers and, of course, in your versions with better bit mixing.
I'm working on my side on an AVX2 version of xxHash.
It's not published yet because, as farsh 0.1, its quality is not yet good enough.
xxh_avx is quite fast, but not as fast as farsh numbers.
Plus, I like its uhash-like ability to generate any key size up to 1024 bits.
So it's a pretty good start.
Universal hashing mentions shifting result to the right.The universal hashing formula used here (and derived from UMAC) is as simple as
for (i=0; i < elements; i+=2)
sum += (data[i] + key[i]) * (ULONG)(data[i+1] + key[i+1]);
I've made an experiment and made histograms of multiplication results after shifting to the right. Here's Scala code:
The output is:Code:object Main { val factorSize = 8 val factorsNum = 1 << factorSize val factors = 0 until factorsNum val shifts = 0 to factorSize val resultSize = factorSize val resultsNum = 1 << resultSize val resultMask = (1 << resultSize) - 1 val countsForShifts = Array.ofDim[Int](resultsNum, shifts.length) val ideal = factorsNum * factorsNum / resultsNum def tabS(s: String): String = f"$s%12s " def tabI(i: Int): String = tabS(i.toString) def countErrorsSum(shift: Int): Int = { countsForShifts.map(row => math.abs(ideal - row(shift))).sum } def main(args: Array[String]): Unit = { for (factor1 <- factors; factor2 <- factors; shift <- shifts) { val result = ((factor1 * factor2) >> shift) & resultMask countsForShifts(result)(shift) += 1 } println(tabS("Value") + tabS("Ideal") + shifts.map(tabI).mkString) println("Rows:") for ((countsRow, i) <- countsForShifts.zipWithIndex) { println(tabI(i) + tabI(ideal) + countsRow.map(tabI).mkString) } println("Total error sums:") println(tabS("") + tabI(0) + shifts.map(shift => tabI(countErrorsSum(shift))).mkString) val bestShift = shifts.minBy(countErrorsSum) val timesBetter = countErrorsSum(0).toDouble / countErrorsSum(bestShift) println(s"Best shift is $bestShift by $timesBetter times than default one") } }
Each column is for different shift. Shift 0 is the default one and it's the shift that Bulat used (shift 0 is no shift).Code:Value Ideal 0 1 2 3 4 5 6 7 8 Rows: 0 256 1280 894 765 730 729 767 905 1226 1968 1 256 128 208 224 249 276 366 528 910 1301 2 256 256 268 279 288 318 366 549 713 1173 3 256 128 186 226 251 305 388 566 757 1100 4 256 384 334 291 268 271 364 441 659 1032 5 256 128 204 244 282 328 421 495 693 976 6 256 256 272 277 287 315 377 458 622 937 7 256 128 206 240 284 330 420 489 640 909 8 256 512 362 303 283 308 336 440 593 870 9 256 128 184 218 233 277 329 441 614 841 10 256 256 268 279 309 339 355 437 562 821 11 256 128 190 248 277 330 360 448 575 784 12 256 384 330 299 276 299 348 412 551 772 13 256 128 196 250 282 308 339 411 544 750 14 256 256 256 249 271 318 355 421 530 734 15 256 128 186 246 289 357 377 445 552 732 16 256 640 458 365 350 296 328 377 499 696 17 256 128 184 204 214 271 320 396 522 682 18 256 256 260 263 252 249 318 407 504 674 19 256 128 214 248 295 321 357 417 499 654 20 256 384 334 305 288 285 326 366 472 646 21 256 128 204 256 307 322 337 397 510 636 22 256 256 264 271 284 302 361 396 463 623 23 256 128 206 232 266 290 303 366 466 602 24 256 512 378 319 303 313 338 393 481 609 25 256 128 184 204 252 273 298 361 449 580 26 256 256 252 257 276 278 322 375 444 584 27 256 128 214 250 274 289 320 367 457 565 28 256 384 306 273 296 331 329 375 445 562 29 256 128 188 226 232 242 306 346 434 549 30 256 256 264 271 314 336 353 392 439 550 31 256 128 214 266 307 292 322 361 447 549 32 256 768 518 397 292 303 307 340 404 513 33 256 128 200 256 260 266 283 348 431 528 34 256 256 252 253 281 305 318 356 407 497 35 256 128 190 218 228 246 288 352 420 517 36 256 384 326 301 304 337 337 349 405 496 37 256 128 196 210 216 236 312 355 413 493 38 256 256 256 265 273 278 301 336 388 475 39 256 128 198 240 253 297 326 349 407 478 40 256 512 390 315 287 283 294 334 391 473 41 256 128 192 236 251 272 284 323 388 467 42 256 256 252 265 267 265 290 339 387 455 43 256 128 194 244 276 294 320 359 385 456 44 256 384 318 293 292 306 293 303 361 437 45 256 128 212 268 289 326 336 354 405 457 46 256 256 256 263 273 268 294 313 359 429 47 256 128 178 226 252 262 291 337 370 433 48 256 640 450 369 310 280 290 328 379 425 49 256 128 192 210 229 259 283 319 362 420 50 256 256 252 267 276 269 276 306 354 415 51 256 128 194 238 256 286 312 335 359 405 52 256 384 314 301 290 299 316 314 351 405 53 256 128 188 232 248 258 287 306 360 399 54 256 256 264 271 270 274 292 338 352 399 55 256 128 198 232 251 254 273 298 339 381 56 256 512 382 323 315 301 297 311 340 387 57 256 128 200 250 265 271 293 318 351 383 58 256 256 268 267 268 295 292 303 328 375 59 256 128 186 212 212 249 265 303 344 373 60 256 384 334 317 294 305 318 311 331 371 61 256 128 196 242 278 280 290 310 335 367 62 256 256 264 277 254 254 274 305 335 365 63 256 128 198 262 258 278 281 311 335 365 64 256 896 582 347 269 250 264 284 315 339 65 256 128 200 246 303 308 306 309 321 345 66 256 256 260 249 239 252 278 301 315 345 67 256 128 206 212 236 250 264 288 323 351 68 256 384 314 289 282 277 292 280 308 323 69 256 128 204 248 275 272 278 302 306 327 70 256 256 264 251 243 267 288 300 323 339 71 256 128 190 220 237 242 266 282 304 331 72 256 512 390 329 314 302 273 290 300 311 73 256 128 200 242 270 268 280 293 317 327 74 256 256 252 241 240 264 280 283 296 313 75 256 128 186 210 239 270 286 298 306 313 76 256 384 322 301 282 265 250 276 285 305 77 256 128 188 214 246 269 292 279 298 301 78 256 256 256 267 275 289 279 295 302 305 79 256 128 194 208 245 264 270 274 279 299 80 256 640 434 327 289 250 253 277 292 294 81 256 128 192 220 244 262 281 294 301 289 82 256 256 276 269 278 268 266 263 280 295 83 256 128 190 226 231 240 272 293 286 281 84 256 384 322 295 282 265 263 256 277 295 85 256 128 188 214 236 253 257 271 275 267 86 256 256 256 249 257 269 282 291 291 285 87 256 128 198 240 242 269 285 266 265 271 88 256 512 390 333 310 259 250 265 275 268 89 256 128 192 238 250 253 234 256 267 265 90 256 256 260 263 287 303 300 290 272 267 91 256 128 198 228 259 269 282 282 286 271 92 256 384 318 283 268 260 259 247 253 253 93 256 128 188 220 227 240 235 260 270 257 94 256 256 264 257 278 283 283 278 265 260 95 256 128 198 232 235 226 249 256 261 249 96 256 768 534 399 283 264 256 260 267 247 97 256 128 200 220 239 252 272 273 252 247 98 256 256 252 243 257 273 270 256 258 241 99 256 128 178 220 245 250 251 260 257 244 100 256 384 342 299 263 250 247 247 246 231 101 256 128 188 228 234 246 254 266 264 249 102 256 256 248 251 254 275 271 256 238 225 103 256 128 186 204 226 238 255 250 253 231 104 256 512 398 349 320 295 263 250 238 224 105 256 128 200 216 229 240 255 266 256 233 106 256 256 268 279 280 281 260 269 259 231 107 256 128 206 234 237 243 232 231 222 209 108 256 384 342 319 287 256 270 256 258 227 109 256 128 196 228 255 276 275 256 230 206 110 256 256 248 253 253 235 238 237 233 215 111 256 128 202 220 227 252 245 252 239 211 112 256 640 442 339 277 242 243 241 233 223 113 256 128 192 248 259 255 264 245 232 198 114 256 256 260 253 250 248 240 245 235 207 115 256 128 202 238 260 264 259 249 227 197 116 256 384 318 279 265 246 270 258 237 205 117 256 128 204 236 255 269 231 217 218 192 118 256 256 248 225 223 235 251 249 236 207 119 256 128 170 196 214 222 232 242 216 181 120 256 512 398 349 306 289 263 252 237 204 121 256 128 208 216 248 262 244 229 212 181 122 256 256 268 269 241 239 251 244 218 189 123 256 128 206 268 276 275 250 237 230 187 124 256 384 314 259 255 252 259 238 211 182 125 256 128 204 218 248 251 242 239 227 187 126 256 256 272 263 270 256 238 235 207 179 127 256 128 190 192 212 207 243 242 228 187 128 256 1024 514 337 259 250 238 228 207 168 129 256 128 176 230 256 263 257 226 205 169 130 256 256 244 259 247 232 233 236 223 179 131 256 128 198 252 266 266 257 231 190 162 132 256 384 306 255 253 245 221 224 222 175 133 256 128 180 214 245 258 248 220 192 161 134 256 256 240 249 262 243 230 229 201 158 135 256 128 178 202 211 230 255 243 226 173 136 256 512 406 339 286 256 231 208 180 151 137 256 128 200 256 278 270 241 234 210 162 138 256 256 244 259 252 256 252 226 201 161 139 256 128 194 222 226 220 232 216 187 145 140 256 384 310 263 247 256 238 224 203 158 141 256 128 188 216 225 242 228 231 198 151 142 256 256 256 245 228 210 218 214 191 145 143 256 128 198 232 248 264 251 230 201 154 144 256 640 438 353 303 256 241 213 189 139 145 256 128 200 238 257 238 231 210 186 145 146 256 256 252 259 259 271 255 242 200 144 147 256 128 170 206 210 205 218 210 186 141 148 256 384 306 277 263 253 229 212 185 133 149 256 128 180 190 202 221 222 213 188 136 150 256 256 248 259 277 279 271 233 194 143 151 256 128 178 202 223 235 227 219 184 127 152 256 512 390 323 276 226 201 180 166 132 153 256 128 200 242 253 259 257 227 192 133 154 256 256 260 245 257 260 235 228 176 127 155 256 128 170 212 229 232 208 198 179 126 156 256 384 334 309 277 249 243 215 184 129 157 256 128 196 228 247 238 238 211 172 123 158 256 256 248 259 245 236 232 216 182 116 159 256 128 170 196 232 220 210 194 169 121 160 256 768 506 385 301 269 251 230 189 123 161 256 128 184 210 201 209 219 194 158 110 162 256 256 260 265 256 252 231 214 175 117 163 256 128 194 236 243 234 221 202 162 110 164 256 384 314 289 267 247 233 204 178 121 165 256 128 188 228 235 243 222 211 168 109 166 256 256 256 245 242 250 233 206 167 108 167 256 128 186 210 222 223 227 200 162 103 168 256 512 378 311 260 250 218 200 180 120 169 256 128 192 230 244 237 224 215 168 99 170 256 256 260 245 242 253 228 181 140 97 171 256 128 190 210 213 205 218 208 172 100 172 256 384 322 289 279 254 219 181 160 105 173 256 128 172 190 202 220 236 228 167 101 174 256 256 256 243 230 227 206 181 148 94 175 256 128 206 236 237 251 218 195 164 93 176 256 640 446 365 309 242 226 200 154 94 177 256 128 192 220 234 243 240 202 154 95 178 256 256 260 263 251 228 220 195 159 95 179 256 128 190 196 199 194 201 191 147 86 180 256 384 326 297 275 250 219 204 165 89 181 256 128 196 242 265 259 237 186 140 82 182 256 256 248 267 277 268 240 207 160 89 183 256 128 186 218 222 226 214 185 150 91 184 256 512 386 311 272 235 200 196 146 78 185 256 128 184 212 226 239 219 181 143 79 186 256 256 244 255 247 225 223 198 147 80 187 256 128 198 206 209 188 192 180 147 79 188 256 384 306 289 271 249 233 196 150 78 189 256 128 188 232 259 237 224 191 146 81 190 256 256 248 241 223 249 229 196 142 73 191 256 128 186 200 197 204 182 170 142 70 192 256 896 570 343 298 265 249 211 139 73 193 256 128 184 196 208 205 192 181 146 68 194 256 256 252 237 230 246 225 168 129 69 195 256 128 178 206 215 223 209 197 149 76 196 256 384 326 317 313 287 234 192 134 67 197 256 128 180 190 206 206 198 182 136 61 198 256 256 248 247 236 205 210 166 132 62 199 256 128 194 234 240 234 208 187 140 61 200 256 512 378 309 277 258 221 182 123 64 201 256 128 184 180 195 201 196 180 135 59 202 256 256 260 245 227 201 194 168 129 60 203 256 128 198 216 222 230 228 200 146 65 204 256 384 318 289 283 264 235 182 122 53 205 256 128 196 216 211 217 200 164 128 54 206 256 256 256 263 262 235 208 186 124 51 207 256 128 190 210 210 213 186 168 131 52 208 256 640 462 363 306 252 225 193 132 53 209 256 128 192 234 257 237 205 158 115 48 210 256 256 236 233 239 234 224 192 135 51 211 256 128 194 216 224 224 201 167 120 48 212 256 384 318 275 259 239 224 186 138 53 213 256 128 196 236 219 202 201 175 114 45 214 256 256 256 245 222 212 192 164 118 42 215 256 128 186 222 237 198 193 158 111 41 216 256 512 378 305 275 284 235 201 127 40 217 256 128 192 212 209 197 188 165 119 41 218 256 256 252 259 252 237 209 166 118 38 219 256 128 186 206 226 230 212 172 106 39 220 256 384 322 287 255 238 215 172 124 36 221 256 128 196 222 234 216 178 155 108 37 222 256 256 248 237 243 247 203 179 122 36 223 256 128 186 202 206 194 204 155 105 35 224 256 768 490 355 256 242 214 181 125 38 225 256 128 184 210 204 193 197 161 113 31 226 256 256 260 263 250 246 223 176 116 29 227 256 128 206 222 236 226 188 158 96 28 228 256 384 298 263 226 222 212 170 116 27 229 256 128 196 230 227 203 177 156 104 26 230 256 256 264 263 251 219 210 179 108 25 231 256 128 198 242 247 234 193 144 101 24 232 256 512 370 305 277 268 226 180 113 23 233 256 128 184 214 220 223 203 159 103 22 234 256 256 244 235 231 197 181 156 102 21 235 256 128 178 208 230 216 178 154 100 20 236 256 384 298 251 230 228 211 167 108 19 237 256 128 188 210 210 202 192 165 108 18 238 256 256 264 265 256 254 216 156 94 17 239 256 128 182 214 212 197 196 155 95 16 240 256 640 454 335 288 218 192 170 109 15 241 256 128 192 218 226 231 208 166 102 14 242 256 256 252 241 243 205 185 144 91 13 243 256 128 182 224 245 234 194 154 96 12 244 256 384 322 275 254 251 221 171 102 11 245 256 128 180 186 188 192 182 142 92 10 246 256 256 264 261 238 205 179 154 96 9 247 256 128 214 250 265 246 219 170 95 8 248 256 512 370 297 247 225 186 148 99 7 249 256 128 176 218 217 226 195 150 86 6 250 256 256 244 233 220 203 192 159 95 5 251 256 128 178 210 235 228 201 161 94 4 252 256 384 326 287 260 212 192 143 87 3 253 256 128 180 204 195 204 189 154 93 2 254 256 256 240 243 221 199 178 146 90 1 255 256 128 194 242 273 287 243 179 97 0 Total error sums: 0 32768 16832 8970 6498 7128 10694 17826 29832 48550 Best shift is 3 by 5.042782394582948 times than default one
If I change factorSize and comment out printing of rows, ie:
Then I get the following:Code:object Main { val factorSize = 15 val factorsNum = 1 << factorSize val factors = 0 until factorsNum val shifts = 0 to factorSize val resultSize = factorSize val resultsNum = 1 << resultSize val resultMask = (1 << resultSize) - 1 val countsForShifts = Array.ofDim[Int](resultsNum, shifts.length) val ideal = factorsNum * factorsNum / resultsNum def tabS(s: String): String = f"$s%12s " def tabI(i: Int): String = tabS(i.toString) def countErrorsSum(shift: Int): Int = { countsForShifts.map(row => math.abs(ideal - row(shift))).sum } def main(args: Array[String]): Unit = { for (factor1 <- factors; factor2 <- factors; shift <- shifts) { val result = ((factor1 * factor2) >> shift) & resultMask countsForShifts(result)(shift) += 1 } println(tabS("Value") + tabS("Ideal") + shifts.map(tabI).mkString) // println("Rows:") // for ((countsRow, i) <- countsForShifts.zipWithIndex) { // println(tabI(i) + tabI(ideal) + countsRow.map(tabI).mkString) // } println("Total error sums:") println(tabS("") + tabI(0) + shifts.map(shift => tabI(countErrorsSum(shift))).mkString) val bestShift = shifts.minBy(countErrorsSum) val timesBetter = countErrorsSum(0).toDouble / countErrorsSum(bestShift) println(s"Best shift is $bestShift by $timesBetter times than default one") } }
With growing factor size there are growing differences. They certainly will cause problems when there's only one pair of inputs to hash and the shift is bad. I don't know wheter the spread would be bad if there are many pairs to be hashed, but there's SMHasher and its reports.Code:Value Ideal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total error sums: 0 536870912 269083040 135153166 68433600 35241912 19274406 13546862 13617772 18553286 30496378 53392724 94561504 166028768 286628836 483389300 790058434 Best shift is 6 by 39.63064745178625 times than default one
Piotr: yes, the hash value produced directly by this formula, should be shifted by 32 for correct result, since lower bits aren't 100% random. but since we are using this value only as entropy source, saving entire 64 bits can't make results worser. from the practical grounds, my first version saved only higher 32 bits and i've quickly discovered that similar short source data (such as {1} and {2}) leads to the similar hashes
uhash saves entire 64 bit results for 1024-byte blocks, concatenates them and then runs more complex polynomial hashing over these data. i just tried to calculate crc32 of the same data (sequence of 64-bit block hashes) and it made much better smhasher results - now only Avalanche and Zeroes tests are fail
i7-4770/16Gb/W7C x64
farsh-x86.exe | Hashing 100 GiB (b004ee2e): 16754.775 milliseconds = 6.409 GB/s = 5.968 GiB/s
farsh-x86-sse2.exe | Hashing 100 GiB (b004ee2e): 3448.710 milliseconds = 31.135 GB/s = 28.996 GiB/s
farsh-x86-avx2.exe | Hashing 100 GiB (b004ee2e): 1853.249 milliseconds = 57.938 GB/s = 53.959 GiB/s
farsh-x64.exe | Hashing 100 GiB (b004ee2e): 3960.820 milliseconds = 27.109 GB/s = 25.247 GiB/s
farsh-x64-avx2.exe | Hashing 100 GiB (b004ee2e): 1891.681 milliseconds = 56.761 GB/s = 52.863 GiB/s
hey, 404, i have exactly the same cpu! i would like to see other cpu resultsbut it seems that only haswellers are brave enough to run these executables
![]()
How about old school Conroe?
C2D E6550 @2.33 GHz, 6GB DDR2, W7x64Code:farsh-x86.exe Hashing 100 GiB (b004ee2e): 32914.037 milliseconds = 3.262 GB/s = 3.038 GiB/s farsh-x86-sse2.exe Hashing 100 GiB (b004ee2e): 15302.369 milliseconds = 7.017 GB/s = 6.535 GiB/s farsh-x64.exe Hashing 100 GiB (b004ee2e): 15822.045 milliseconds = 6.786 GB/s = 6.320 GiB/s
PAQer, thanks, you are reading my thoughts! i expected good perfromance on Core2 since it got 128-bit SIMD engines, but your results are discouraging. i looked into the problem and it seems that CPU spends half of the time on unaligned loads. so i prepared executables with all data aligned and, optionally, aligned loads are used. can you please run at least all 4 farsh-x86-sse2.exe programs from the attached archive. even better - run all non-avx executables and give me a console output (i can format it myself)
these executables should tell a lot for any pre-haswell cpu, since only haswell finally get rid of unaligned penalty. if it will show the difference, i will probably need to provide 2-4 variants of SSE2 code with dynamic selection on the alignment of key+data
Last edited by Bulat Ziganshin; 30th May 2015 at 02:08.
Phenom II 945, 3 GHz, W7x64, 4 GB RAM
farsh0.1\benchmark>farsh-x86.exe
Hashing 100 GiB (b004ee2e): 24995.086 milliseconds = 4.296 GB/s = 4.001 GiB/s
farsh0.1\benchmark>farsh-x64.exe
Hashing 100 GiB (b004ee2e): 7903.949 milliseconds = 13.585 GB/s = 12.652 GiB/s
farsh0.1\benchmark>farsh-x86-sse2.exe
Hashing 100 GiB (b004ee2e): 7244.024 milliseconds = 14.822 GB/s = 13.804 GiB/s
---
farsh-alignment\align-both>farsh-x86.exe
Hashing 100 GiB (fe82e924): 30476.906 milliseconds = 3.523 GB/s = 3.281 GiB/s
farsh-alignment\align-both>farsh-x64.exe
Hashing 100 GiB (fe82e924): 8754.846 milliseconds = 12.265 GB/s = 11.422 GiB/s
farsh-alignment\align-both>farsh-x86-sse2.exe
Hashing 100 GiB (fe82e924): 7384.217 milliseconds = 14.541 GB/s = 13.542 GiB/s
---
farsh-alignment\align-data>farsh-x86.exe
Hashing 100 GiB (fe82e924): 30467.067 milliseconds = 3.524 GB/s = 3.282 GiB/s
farsh-alignment\align-data>farsh-x64.exe
Hashing 100 GiB (fe82e924): 8760.540 milliseconds = 12.257 GB/s = 11.415 GiB/s
farsh-alignment\align-data>farsh-x86-sse2.exe
Hashing 100 GiB (fe82e924): 7394.061 milliseconds = 14.522 GB/s = 13.524 GiB/s
---
farsh-alignment\align-key>farsh-x86.exe
ashing 100 GiB (fe82e924): 30472.935 milliseconds = 3.524 GB/s = 3.282 GiB/s
farsh-alignment\align-key>farsh-x64.exe
Hashing 100 GiB (fe82e924): 8758.215 milliseconds = 12.260 GB/s = 11.418 GiB/s
farsh-alignment\align-key>farsh-x86-sse2.exe
Hashing 100 GiB (fe82e924): 7382.260 milliseconds = 14.545 GB/s = 13.546 GiB/s
---
farsh-alignment\no-align>farsh-x86.exe
Hashing 100 GiB (fe82e924): 30452.065 milliseconds = 3.526 GB/s = 3.284 GiB/s
farsh-alignment\no-align>farsh-x64.exe
Hashing 100 GiB (fe82e924): 8966.305 milliseconds = 11.975 GB/s = 11.153 GiB/s
farsh-alignment\no-align>farsh-x86-sse2.exe
Hashing 100 GiB (fe82e924): 7545.100 milliseconds = 14.231 GB/s = 13.254 GiB/s
Code:align-both.farsh-x64-avx2.exe | Hashing 100 GiB (fe82e924): 2037.786 milliseconds = 52.692 GB/s = 49.073 GiB/s align-both.farsh-x64.exe | Hashing 100 GiB (fe82e924): 3565.271 milliseconds = 30.117 GB/s = 28.048 GiB/s align-both.farsh-x86-avx2.exe | Hashing 100 GiB (fe82e924): 2227.479 milliseconds = 48.204 GB/s = 44.894 GiB/s align-both.farsh-x86-sse2.exe | Hashing 100 GiB (fe82e924): 3522.907 milliseconds = 30.479 GB/s = 28.386 GiB/s align-both.farsh-x86.exe | Hashing 100 GiB (fe82e924): 18804.064 milliseconds = 5.710 GB/s = 5.318 GiB/s align-data.farsh-x64-avx2.exe | Hashing 100 GiB (fe82e924): 2040.561 milliseconds = 52.620 GB/s = 49.006 GiB/s align-data.farsh-x64.exe | Hashing 100 GiB (fe82e924): 3577.289 milliseconds = 30.016 GB/s = 27.954 GiB/s align-data.farsh-x86-avx2.exe | Hashing 100 GiB (fe82e924): 2245.868 milliseconds = 47.810 GB/s = 44.526 GiB/s align-data.farsh-x86-sse2.exe | Hashing 100 GiB (fe82e924): 3559.210 milliseconds = 30.168 GB/s = 28.096 GiB/s align-data.farsh-x86.exe | Hashing 100 GiB (fe82e924): 18721.620 milliseconds = 5.735 GB/s = 5.341 GiB/s align-key.farsh-x64-avx2.exe | Hashing 100 GiB (fe82e924): 2030.425 milliseconds = 52.883 GB/s = 49.251 GiB/s align-key.farsh-x64.exe | Hashing 100 GiB (fe82e924): 3581.340 milliseconds = 29.982 GB/s = 27.923 GiB/s align-key.farsh-x86-avx2.exe | Hashing 100 GiB (fe82e924): 2230.221 milliseconds = 48.145 GB/s = 44.839 GiB/s align-key.farsh-x86-sse2.exe | Hashing 100 GiB (fe82e924): 3510.804 milliseconds = 30.584 GB/s = 28.484 GiB/s align-key.farsh-x86.exe | Hashing 100 GiB (fe82e924): 18792.595 milliseconds = 5.714 GB/s = 5.321 GiB/s no-align.farsh-x64-avx2.exe | Hashing 100 GiB (fe82e924): 2034.917 milliseconds = 52.766 GB/s = 49.142 GiB/s no-align.farsh-x64.exe | Hashing 100 GiB (fe82e924): 3997.399 milliseconds = 26.861 GB/s = 25.016 GiB/s no-align.farsh-x86-avx2.exe | Hashing 100 GiB (fe82e924): 2217.646 milliseconds = 48.418 GB/s = 45.093 GiB/s no-align.farsh-x86-sse2.exe | Hashing 100 GiB (fe82e924): 3588.777 milliseconds = 29.919 GB/s = 27.865 GiB/s no-align.farsh-x86.exe | Hashing 100 GiB (fe82e924): 18802.944 milliseconds = 5.710 GB/s = 5.318 GiB/s
Intel i7 4960X OC 4.5GHz, 32GB DDR3 2000MHz, Win 8.1 64bit:
farsh-x64.exe
Hashing 100 GiB (b004ee2e): 3311.381 milliseconds = 32.426 GB/s = 30.199 GiB/s
benchmark>farsh-x64-avx2.exe
Crash
farsh-x86.exe
Hashing 100 GiB (b004ee2e): 14734.266 milliseconds = 7.287 GB/s = 6.787 GiB/s
farsh-x86-sse2.exe
Hashing 100 GiB (b004ee2e): 3302.627 milliseconds = 32.512 GB/s = 30.279 GiB/s
farsh-x86-avx2.exe
Crash
Intel i7 4960X is an Ivy Bridge and it doesn't support AVX2 so the crash is rather expected although I think printing error message would be a lot better than just plainly crashing :]
Intel Pentium M 1,5 GHz + Win XP SP3 32 bit Czech + 512 MB RAM
align-both\farsh-x32.exe Hashing 100 GiB (fe82e924): 74240.206 milliseconds = 1.446 GB/s = 1.347 GiB/s
align-both\farsh-x32-avx2.exe Not supported
align-both\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 40271.890 milliseconds = 2.666 GB/s = 2.483 GiB/s
align-data\farsh-x32.exe Hashing 100 GiB (fe82e924): 73956.563 milliseconds = 1.452 GB/s = 1.352 GiB/s
align-data\farsh-x32-avx2.exe Not supported
align-data\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 46726.115 milliseconds = 2.298 GB/s = 2.140 GiB/s
align-key\farsh-x32.exe Hashing 100 GiB (fe82e924): 74018.630 milliseconds = 1.451 GB/s = 1.351 GiB/s
align-key\farsh-x32-avx2.exe Not supported
align-key\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 46018.345 milliseconds = 2.333 GB/s = 2.173 GiB/s
no-align\farsh-x32.exe Hashing 100 GiB (fe82e924): 73991.978 milliseconds = 1.451 GB/s = 1.351 GiB/s
no-align\farsh-x32-avx2.exe Not supported
no-align\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 51969.406 milliseconds = 2.066 GB/s = 1.924 GiB/s
thanks. Pentium M was the last Intel CPU with 64-bit SSE engine, so it shows less speed and less difference between aligned and non-aligned versions. but aligned is still 30% faster
i feel that on older CPUs xxHash may be faster than any Farsh variants, just because SSE was too slow and pure x86 version is faster in xxHash
Yes, aligned loads almost 1.6 times faster here compared to unaligned.
BTW: Conroe has slow shuffle unit (Penryn doubled shuffle perfomance).
Code:T:\farsh-alignment\align-both>farsh-x86.exe Hashing 100 GiB (fe82e924): 43673.399 milliseconds = 2.459 GB/s = 2.290 GiB/s T:\farsh-alignment\align-both>farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 10475.474 milliseconds = 10.250 GB/s = 9.546 GiB/s T:\farsh-alignment\align-both>farsh-x64.exe Hashing 100 GiB (fe82e924): 12140.137 milliseconds = 8.845 GB/s = 8.237 GiB/s T:\farsh-alignment\align-data>farsh-x86.exe Hashing 100 GiB (fe82e924): 43610.016 milliseconds = 2.462 GB/s = 2.293 GiB/s T:\farsh-alignment\align-data>farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 13461.910 milliseconds = 7.976 GB/s = 7.428 GiB/s T:\farsh-alignment\align-data>farsh-x64.exe Hashing 100 GiB (fe82e924): 15137.458 milliseconds = 7.093 GB/s = 6.606 GiB/s T:\farsh-alignment\align-key>farsh-x86.exe Hashing 100 GiB (fe82e924): 43140.574 milliseconds = 2.489 GB/s = 2.318 GiB/s T:\farsh-alignment\align-key>farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 14095.930 milliseconds = 7.617 GB/s = 7.094 GiB/s T:\farsh-alignment\align-key>farsh-x64.exe Hashing 100 GiB (fe82e924): 15810.221 milliseconds = 6.791 GB/s = 6.325 GiB/s T:\farsh-alignment\no-align>farsh-x86.exe Hashing 100 GiB (fe82e924): 43529.906 milliseconds = 2.467 GB/s = 2.297 GiB/s T:\farsh-alignment\no-align>farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 17283.675 milliseconds = 6.212 GB/s = 5.786 GiB/s T:\farsh-alignment\no-align>farsh-x64.exe Hashing 100 GiB (fe82e924): 17802.779 milliseconds = 6.031 GB/s = 5.617 GiB/s
UMAC employs PSRLDQ instead of PSHUFD, that is 2x faster on early CPUs. On newer ones, though, it uses the same port as PMUL, so it may limit performance to 2 cycles per one SIMD word. Probably, interleaving them will provide better perfromance for SandyBridge+ CPUs (which can load 2 SIMD words per cycle), while PSRLDQ-only cycle is optimum for older Intel CPUs
although in the case of Core2, the speed may be limited by the code size - Core2 can't decode more than 16 instruction bytes per cycle and has only 64-byte loop buffer. At least, it may be the reason why x64 version is slower than x86-sse2 one. And it may be the reason why aligned version is faster - it should have smaller code
Last edited by Bulat Ziganshin; 30th May 2015 at 19:41.
Impressive work !
I've run all test on my i7-4700MQ (running @ ~3.2 GHz).
Order is
farsh-x64
farsh-x64-avx2
farsh-x86
farsh-x86-avx2
farsh-x86-sse2
for all tests.
I could run tests on a i7-4500U ULV processor as well. Sadly, I don't have a non-Haswell CPU anymore..Code:align-both Hashing 100 GiB (fe82e924): 4081.477 milliseconds = 26.308 GB/s = 24.501 GiB/s Hashing 100 GiB (fe82e924): 2304.709 milliseconds = 46.589 GB/s = 43.389 GiB/s Hashing 100 GiB (fe82e924): 21166.664 milliseconds = 5.073 GB/s = 4.724 GiB/s Hashing 100 GiB (fe82e924): 2501.702 milliseconds = 42.920 GB/s = 39.973 GiB/s Hashing 100 GiB (fe82e924): 3968.424 milliseconds = 27.057 GB/s = 25.199 GiB/s align-data Hashing 100 GiB (fe82e924): 4070.672 milliseconds = 26.378 GB/s = 24.566 GiB/s Hashing 100 GiB (fe82e924): 2298.015 milliseconds = 46.725 GB/s = 43.516 GiB/s Hashing 100 GiB (fe82e924): 21197.086 milliseconds = 5.066 GB/s = 4.718 GiB/s Hashing 100 GiB (fe82e924): 2515.694 milliseconds = 42.682 GB/s = 39.750 GiB/s Hashing 100 GiB (fe82e924): 3994.229 milliseconds = 26.882 GB/s = 25.036 GiB/s align-key Hashing 100 GiB (fe82e924): 4070.239 milliseconds = 26.380 GB/s = 24.569 GiB/s Hashing 100 GiB (fe82e924): 2300.552 milliseconds = 46.673 GB/s = 43.468 GiB/s Hashing 100 GiB (fe82e924): 21193.883 milliseconds = 5.066 GB/s = 4.718 GiB/s Hashing 100 GiB (fe82e924): 2499.901 milliseconds = 42.951 GB/s = 40.002 GiB/s Hashing 100 GiB (fe82e924): 3969.799 milliseconds = 27.048 GB/s = 25.190 GiB/s no-align Hashing 100 GiB (fe82e924): 4563.363 milliseconds = 23.530 GB/s = 21.914 GiB/s Hashing 100 GiB (fe82e924): 2315.089 milliseconds = 46.380 GB/s = 43.195 GiB/s Hashing 100 GiB (fe82e924): 21247.594 milliseconds = 5.053 GB/s = 4.706 GiB/s Hashing 100 GiB (fe82e924): 2503.921 milliseconds = 42.882 GB/s = 39.937 GiB/s Hashing 100 GiB (fe82e924): 4055.867 milliseconds = 26.474 GB/s = 24.656 GiB/s
Last edited by zody; 30th May 2015 at 20:58.
Intel Core2 Duo CPU T7100 1,8 GHz + Win XP SP3 32 bit Czech + 1024 MB RAM
C:\Data\x>.\align-both\farsh-x32.exe Hashing 100 GiB (fe82e924): 48604.204 milliseconds = 2.209 GB/s = 2.057 GiB/s
C:\Data\x>Rem .\align-both\farsh-x32-avx2.exe Not supported
C:\Data\x>.\align-both\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 11917.074 milliseconds = 9.010 GB/s = 8.391 GiB/s
C:\Data\x>.\align-data\farsh-x32.exe Hashing 100 GiB (fe82e924): 48586.506 milliseconds = 2.210 GB/s = 2.058 GiB/s
C:\Data\x>Rem .\align-data\farsh-x32-avx2.exe Not supported
C:\Data\x>.\align-data\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 15099.395 milliseconds = 7.111 GB/s = 6.623 GiB/s
C:\Data\x>.\align-key\farsh-x32.exe Hashing 100 GiB (fe82e924): 48682.345 milliseconds = 2.206 GB/s = 2.054 GiB/s
C:\Data\x>Rem .\align-key\farsh-x32-avx2.exe Not supported
C:\Data\x>.\align-key\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 15657.168 milliseconds = 6.858 GB/s = 6.387 GiB/s
C:\Data\x>.\no-align\farsh-x32.exe Hashing 100 GiB (fe82e924): 48682.790 milliseconds = 2.206 GB/s = 2.054 GiB/s
C:\Data\x>Rem .\no-align\farsh-x32-avx2.exe Not supported
C:\Data\x>.\no-align\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 19191.652 milliseconds = 5.595 GB/s = 5.211 GiB/s
1. i've found a way to make keys always aligned at 16 byte boundary - FARSH_KEYS by itself is aligned and each next (32-bit) hash will use key material at +16 offset compared to previous one. this means that overall key size for 4*n bytes of hash will become 1024+(n-1)*16 bytes instead of 1024+(n-1)*4 in the first version. in particular, 1024-bit hash result will require 1520 bytes of key material instead of 1148 bytes
2. and algnment of data being hashed will be checked in farsh_fast(), so the price will be one always-taken branch - very low for any CPU with branch-prediction capabilities
3. now i translate UHASH higher-level hashes (i.e bit mixers) into my code. so the whole FARSH becomes UHASH with prebuilt key material. The UHASH mixers are ~100 lines of code, so it may be not much price for guaranteed quality
UPDATE: i copied UHASH bit mixing algorithm for blocks up to 1024 bytes: code & report. But either my copy is wrong or it doesn't make SMHasher happy. Now i need to add vanilla UHASH to my SMHasher tests to find on what side is problem
Last edited by Bulat Ziganshin; 31st May 2015 at 16:00.
Intel Core i3-3220 CPU 3,3 GHz + Win 7 32 bit Czech + 4 GB RAM
.\align-both\farsh-x32.exe Hashing 100 GiB (fe82e924): 25963.731 milliseconds = 4.136 GB/s = 3.852 GiB/s
.\align-both\farsh-x32-avx2.exe Not supported
.\align-both\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 4696.721 milliseconds = 22.862 GB/s = 21.291 GiB/s
.\align-data\farsh-x32.exe Hashing 100 GiB (fe82e924): 25863.286 milliseconds = 4.152 GB/s = 3.866 GiB/s
.\align-data\farsh-x32-avx2.exe Not supported
.\align-data\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 4941.686 milliseconds = 21.728 GB/s = 20.236 GiB/s
.\align-key\farsh-x32.exe Hashing 100 GiB (fe82e924): 25962.321 milliseconds = 4.136 GB/s = 3.852 GiB/s
.\align-key\farsh-x32-avx2.exe Not supported
.\align-key\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 4883.320 milliseconds = 21.988 GB/s = 20.478 GiB/s
.\no-align\farsh-x32.exe Hashing 100 GiB (fe82e924): 25708.534 milliseconds = 4.177 GB/s = 3.890 GiB/s
.\no-align\farsh-x32-avx2.exe Not supported
.\no-align\farsh-x32-sse2.exe Hashing 100 GiB (fe82e924): 4999.023 milliseconds = 21.479 GB/s = 20.004 GiB/s
meanwhile, i've found interesting suite of PRNGs. among other things, their xorshift128+ may be used to initialize the FARSH_KEYS[]
i added uhash, vhash and poly-1305 algos to my SMHasher suite. of those, only VHASH succesfully passed all tests
i also extended information printed by the AvalancheTest and made conclusion that block hashes in the Farsh return enough entropy, only their mixing need to be improved. so i tried mixing from xxHash64 which, finally, made all SMHasher tests happy. also i added 64-bit 'seed' in the same way as it is implemented in xxHash64, in order to satisfy both SMHasher and most demanding users
the mixing algo kidnapped from xxHash64 made computations up to 40% slower, so i will look further into balancing speed&quality. but at least Farsh idea works, now passing all SMHasher tests while being faster than any other hashes. although numeric quality measures made by SMHasher tests are slightly worse than those for murmur3a and probably xxHash too
Last edited by Bulat Ziganshin; 7th June 2015 at 10:57.
Cyan (8th June 2015)
Timings from my old netbook, it may seem a bit derogatory, but I am surprised it got so fast there!
(Netbook Aspire One Intel(R) Atom(TM) CPU N2600 @ 1.60GHz, 2 cores Ubuntu 64-bit )
Code:~/farsh/farsh-alignment/no-align $ wine farsh-x64.exe Hashing 100 GiB (fe82e924): 99102.602 milliseconds = 1.083 GB/s = 1.009 GiB/s ~/farsh/farsh-alignment/no-align $ wine farsh-x86.exe Hashing 100 GiB (fe82e924): 132066.573 milliseconds = 0.813 GB/s = 0.757 GiB/s ~/farsh/farsh-alignment/no-align $ wine farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 87109.251 milliseconds = 1.233 GB/s = 1.148 GiB/s ~/farsh/farsh-alignment/align-key $ wine farsh-x64.exe Hashing 100 GiB (fe82e924): 65907.654 milliseconds = 1.629 GB/s = 1.517 GiB/s ~/farsh/farsh-alignment/align-key $ wine farsh-x86.exe Hashing 100 GiB (fe82e924): 131712.031 milliseconds = 0.815 GB/s = 0.759 GiB/s ~/farsh/farsh-alignment/align-key $ wine farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 61781.707 milliseconds = 1.738 GB/s = 1.619 GiB/s ~/farsh/farsh-alignment/align-data $ wine farsh-x64.exe Hashing 100 GiB (fe82e924): 65907.564 milliseconds = 1.629 GB/s = 1.517 GiB/s ~/farsh/farsh-alignment/align-data $ wine farsh-x86.exe Hashing 100 GiB (fe82e924): 131792.882 milliseconds = 0.815 GB/s = 0.759 GiB/s ~/farsh/farsh-alignment/align-data $ wine farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 61750.769 milliseconds = 1.739 GB/s = 1.619 GiB/s ~/farsh/farsh-alignment/align-both $ wine farsh-x64.exe Hashing 100 GiB (fe82e924): 43861.363 milliseconds = 2.448 GB/s = 2.280 GiB/s ~/farsh/farsh-alignment/align-both $ wine farsh-x86.exe Hashing 100 GiB (fe82e924): 131727.429 milliseconds = 0.815 GB/s = 0.759 GiB/s ~/farsh/farsh-alignment/align-both $ wine farsh-x86-sse2.exe Hashing 100 GiB (fe82e924): 46215.390 milliseconds = 2.323 GB/s = 2.164 GiB/s
it has speeds close to Pentium M. effect of alignment is up to 2.3x speedup that is absolute record. thanks!
Bulat, I have found minor issues in the docs:
1. https://github.com/Bulat-Ziganshin/FARSH#the-main-loop
The link to the source code is outdated
2.This is not just old, but outdated. I think it's best to remove it, otherwise please be clear that it doesn't apply anymore.OLD NOTE: FARSH isn't yet ready for practical use since SMHasher shows a lot of problems in the current implementation. But FARSH main loop implements universal hashing scheme that's mathematically proven to guarantee ideal hashing (as far as key material is random) and it employs the formula successfully used in cryptographic UMAC/VMAC algorithms.
The following paragraph (maybe two) is outdated too and it's not indicated at all.
3. Is the bitstream stable already? I don't see it mentioned and it's important for the person I'm trying to convert to FARSH.
Also, it would be good to compare the algorithm with its competitors.
Well, it was started as experiment - i want to check whether it will be possible to use UHash construction to provide non-crypto hash that is both very fast and has a high quality.
FARSH 0.1 reached the high-speed goal, but not passed some SMHasher tests. So i added an extra bit-mixing code from xxHash and produced FARSH 0.2. It passes all SMHasher tests and only ~10% slower than 0.1 version. So the code for 0.2 is finished, but i should finish updating the docs in order to release it.
Overall, this project wasn't finished, so i hardly can talk about bitstream stability. Even if i will update the docs and release 0.2, i still doesn't have answer to the question - is it really as high-quality as for example xxHash? It's possible that by adding extra mixing in 0.2 i just have hidden dust under a carpet.
I had a plan to check the same extra-mixing approach on other lower-quality hashes like CRC - if their SMHasher rank also can be improved this way, it would be a strong signal (against quality of FARSH 0.2 and usability of SMHasher overall). But i never implemented that plan, too.
Anyway, little may be done to improve this construction quality. Regarding speed, it may be improved by 10-20%, but no more. Overall, you can employ the current code as hash of unknown quality, but definitely better than CRC32 and faster than CRC32 in most configs. xxHash64 is a faster way to produce 64-bit hash result on 64-bit pre-AVX2 CPU, in all remaining situations FARSH should be faster. Bitstream most probably will be changed in future versions.
Analyzis of strong and weak points of FARSH and xxHash led me to the hash construction that should be best of both worlds - xxHash quality at FARSH speed.
That is the FARSH weak point? Each 4-byte word inside 4 KB block is processed independent of other words and results are just summed up. So the data from different words are mixed up in the rather simple and predictable way. This potentially simplifies bit flips that may remain undetected by the hash. xxHash looks stronger here - each input word is mixed up into 128-bit intermediate value whoise bits are continued to mix up on the every computation step. I feel more confident with such scheme since it better differentiates influence of different input words on the final hash value.
Also, FARSH intermediate value is only 64-bit, and i feel that for 32-bit final value 128-bit intermediate value would be preferable.
xxHash weak point is that you need to finish processing of current 128 bits prior to going to the next chunk. It's strongly optimized for x86 cpus, but doesn't leave room for further SIMD optimizations (aside of calculation miltiple hashes simultaneously). FARSH scheme is ideal here - you can run everuthing parallel and overall speed is limited only by serial (and rather slow) process of combining 8-byte intermediate values.
So, the ideal hash should split input data into fixed size blocks (say 4 KB) and process each block independently, returning 128 bits of intermediate hash. On the next level, those 128-bit values from each block should be combined with secondary hashing to produce the final result. Secondary hashing process may be stronger (up to cryptohashes), since it needs to process much less data.
The overall hash speed will be limited to the speed of secondary hashing (let's conservatively say 1 GB/s) multiplied by the primary hash reduction coeffecient (4KB/128bit=256), i.e. 256 GB/s even for this conservative estimation. And by adding one more level of secondary hashing, we may change the limit to 256*256 GB/s = 64 TB/s. Of course, it's just the upper limit for infinite computation resources, but f.e. GPUs will reach this limit in the next 5 years.
Overall, the idea of multi-level hashing is well known in cryptography as tree hahsing, and you can find in http://eprint.iacr.org/2009/210.pdf how to build cryptohash this way. While we don't have such high goals, obeying these rules may be useful to improve our hash quality.
While 128-bit blocks are ideal fit for SSE registers, they are sub-optimal for avx2/avx-512 and especially GPUs (nvidia/amd has 1024/2048-bit words while Intel GPUs support multiple widths but 512 bits is optimum). I think that 512-bit blocks would be the best choice, summing up all platforms.
Inside 512-bit chunk, we should have four 128-bit streams, processed independent on each other. This means that older platforms like i386 will process only one stream simultaneously, and then switch to the next stream. SSE2-capable cpus will process all 4 streams simultaneously - it will provide enough ILP to hide dealys. AVX2 cpus will process 2 streams per AVX register, 8 streams from 2 blocks simultaneously in order to hide delays. And AVX-512 devices will process 4 blocks simultaneously. So any device will get enough ILP to hide operation delays and fill all its computation resources.
nvidia/amd GPUs will need to process 2/4 blocks in the single register but their LD engines should easily handle such non-coalescing access (or extra warp shuffle operation). Overall, they can reach multi-terabyte/s speed, much faster than their on-board DRAM speeds.
We haven't discussed yet the internal hashing loop, but it can be any function squeezing two 128-bit values into one: hash128=update(hash128,data128 ). One possibility is the xxHash code. Another one is slightly modified FARSH code:
This algorithm mixes the input data into all 128 hash bits, but it's still simpler than xxHash inner cycle. It's just 4 SIMD operations per 16 input bytes:Code:uint32 data[4] uint32 hash[4] === uint64 hash64[2] === uint128 hash128 hash64[0] = (data[0]+hash[0]) * (data[1]+hash[1]) hash64[1] = (data[2]+hash[2]) * (data[3]+hash[3]) ROTATE hash128,1 // rotate the entire 128-bit register by 1 byte
PADD xmm0, data
PSHUFD xmm1,xmm0,xxx
PMULUD xmm0,xmm1
PALIGNR xmm0,xmm0,1
so we may expect the same speed as with FARSH on CPUs with SSSE3. Since AVX operations process each 128-bit lane of AVX register independently, the same sequence will work with AVX2 too, doubling the speed compared to SSE2. AVX-512 will probably require more commands to rotate each 128-bit lane independently. CPUs with SSE2 but no SSSE3 (Pentium4,PenitumM,Core1,K8..K10) require 3 operations - PSLLDQ+PSRLDQ+POR. Pre-SSE2 cpus will need 4 SHRD operations.
And finally, like FARSH we can use multiple initial hash128 values to produce multiple independent 32-bit hashes (and combine them into 64+ bit hashes), and XOR initvals with user-supplied SEED value in order to provide control to the user.
Last edited by Bulat Ziganshin; 28th June 2016 at 14:53.
I haven't done QA for FARSH, but it's core looks more trustworthy than the permutation+mul+rol of xxHash. Which I definitely don't consider to be a good quality.