As for locality: I'm upset about that most benchmarks in papers and presentations about sorting study bare integers sorting. Quicksort obviously shines in such benchmarks, because of the way memory prefetchers work and because radix-sort is not included in such benchmarks - why not include radix-sort where it can easily work?

What about multipartition algorithms? Most used quicksorts, mergesorts, heapsorts, etc are binary ie quicksorts have one pivot and partitions are splitted into two parts, mergesort divides arrays by two until reaching singletons and then merges exactly two merged subarrays, heapsorts always uses binary heaps, etc That's not good for cache locality as it requires more passes over input data and/ or more random memory accesses.

I was thinking about SIMD enriched integer sorting heapsort, but then I've switched to another experiments. Why use heapsort when you can use radix-sort?

Anyways, I couldn't find SIMD heapsort even though it would be easily doable - eg use 8-ary heapsort and at each step in siftDown use SSE2/ AVX/ whatever to select index of maximum element from the 8 element vector.

If you're aware of similar project to mine then please tell me. I would be extremely happy to know that. I would be even more happy if you could find a paper with similar tricks and observations as mine.

BTW:

Does anyone know some super effective way to select index of maximum element in a short vector using SSE2/ AVX/ etc? If someone gives me a complete solution then I can quickly build a complete primitive-sorting heapsort from that.