Their description of the permutation inverse algorithm is a little vague. I tried to implement it, but most of the time it was slower than the standard algorithm. Any thoughts?

I assume that the simple version of permutation inversion algorithm is:

for (i = 0; i < ARRAY SIZE; i++) Z[X[i]] = i;

And the cache-aware version is the version they posted.

You are telling that the simple version is faster. They are telling that their version should be fast on systems with high memory latency to processor speed ratio.

I think they only proved that their algorithm works pretty well on Pentium 4. The reason that on your processor the purportedly cache-aware algorithm is slower is that in modern high end desktop processors there is a write queue which can handle multiple writes in parallel, hiding memory latency.

I assume that the simple version of permutation inversion algorithm is:

That's right.

This was the guess I made at what it should look like, late last night. It's not clear how it should speed things up. Surprisingly, it was faster in one case. But it was much slower in every other case.

#define CACHE_SIZE (1024*1024)
#define BLOCK_LENGTH (CACHE_SIZE/2/sizeof(int))
void perm_inverse(int *perm, int *inv, int n)
{
int number_of_blocks = (n / BLOCK_LENGTH) + 1;
int *D = (int *)malloc(sizeof(int) * n);
int *D_prime = (int *)malloc(sizeof(int) * n);
int *D_ptr = (int *)malloc(sizeof(int) * number_of_blocks);

The second paper describes the inverse algorithm in a bit more detail.

Based on what I know about cpus, the part of the standard permutation inversion that would hurt performance is doing random stores of partial cache lines. The cpu has to read the old cache line, update it, then write it back out.

It turns out that modern CPUs are doing much better on random writes than on random reads. Unsuprisingly, random reads with added prefetching are even faster (much faster).

The simple permutation inversion algorithm they posted is a variant that stresses the random writes performance. You can modify it to make it stressing the random read performance and then add prefetch queue.

(Or maybe I messed up something and it cannot be modified as such)

It turns out that modern CPUs are doing much better on random writes than on random reads. Unsuprisingly, random reads with added prefetching are even faster (much faster).

The simple permutation inversion algorithm they posted is a variant that stresses the random writes performance. You can modify it to make it stressing the random read performance and then add prefetch queue.

(Or maybe I messed up something and it cannot be modified as such)

As far as I know, that is the unique simple algorithm for inverting a permutation.

I didn't have my hopes too high for finding a faster way, but there are always surprises.

It's possible that playing with the parameters in the code I posted could make a difference. They didn't give very specific guidance about how to set CACHE_SIZE, at least not for a modern cpu, so I picked a round number that's in between the L2 and L3 size for my cpu.

I tried to devise a random read oriented permutation inversion but didn't come up with anything. Probably you're right and there's no simple alternative.

Here's my Java version of proposed algorithm with some additional tweaks. Non-zero gap is crucial for performance, at least for random inputs.

Code:

package p;
import java.util.Arrays;
import java.util.Random;
public class Main6 {
public static void main(String[] args) {
Random random = new Random(0);
final int Size = 34567890; // problem size
final int Gap = 333; // tune that
final int BucketSizeLog = 16; // and that
final int BucketSize = 1 << BucketSizeLog;
final int Buckets = 1 + Size / BucketSize;
final int GapsSum = Gap * Buckets;
final int SizeWithGaps = Size + GapsSum;
final int[] X = new int[Size];
final int[] Y = new int[Size];
final int[] B = new int[SizeWithGaps * 2];
final int[] P = new int[Buckets];
for (int i = 0; i < Size; i++) {
X[i] = i;
}
// generating permutation
for (int i = 0; i < Size; i++) {
final int j = i + random.nextInt(Size - i);
final int old = X[j];
X[j] = X[i];
X[i] = old;
}
// simple algo
{
final long time = System.currentTimeMillis();
for (int i = 0; i < Size; i++) {
Y[X[i]] = i;
}
System.out.println("Time was: "
+ (System.currentTimeMillis() - time));
}
// fancy algo
{
Arrays.fill(Y, 0);
final long time = System.currentTimeMillis();
for (int i = 0; i < Buckets; i++) {
P[i] = i * (BucketSize + Gap) * 2;
}
for (int i = 0; i < Size; i++) {
final int bucket = X[i] >> BucketSizeLog;
B[P[bucket]] = i;
B[P[bucket] + 1] = X[i];
P[bucket] += 2;
}
for (int b = 0; b < Buckets; b++) {
for (int i = b * (BucketSize + Gap) * 2; i < P[b]; i += 2) {
Y[B[i + 1]] = B[i];
}
}
System.out.println("Time was: "
+ (System.currentTimeMillis() - time));
boolean good = true;
for (int i = 0; i < Size; i++) {
good &= Y[X[i]] == i;
}
System.out.println(good ? "Correct" : "Broken");
}
}
}

The output on my work laptop is:

Code:

run:
Time was: 566
Time was: 455
Correct
BUILD SUCCESSFUL (total time: 5 seconds)

Ie the fancy algorithm wins with that settings under installed JVM.

Difference under C/ C++ should be higher as they don't check for array boundaries.

Here's output from Sun Java 6 Update 24 with -server switch:

Code:

Time was: 547
Time was: 288
Correct

I'll try to test further when I get home.

Last edited by Piotr Tarsa; 17th July 2013 at 20:57.

Nice! I transplanted the algorithm into mk_bwts, and it's consistently faster on real data. I haven't yet gotten around to giving much thought to what it's doing. I just translated and timed.

Thanks.

Here is the test code you posted, translated into C: