Results 1 to 8 of 8

Thread: Faster permutation inverse algorithm?

  1. #1
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts

    Faster permutation inverse algorithm?

    I'd like to find a faster algorithm to get the inverse suffix array from the suffix array. I found these papers:

    http://www.sigsam.org/bulletin/artic.../cooperman.pdf
    http://www.ccs.neu.edu/home/kunkle/p...ci-issac10.pdf

    Their description of the permutation inverse algorithm is a little vague. I tried to implement it, but most of the time it was slower than the standard algorithm. Any thoughts?

  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,479
    Thanks
    26
    Thanked 122 Times in 96 Posts
    Regarding the first paper.

    I assume that the simple version of permutation inversion algorithm is:
    for (i = 0; i < ARRAY SIZE; i++) Z[X[i]] = i;
    And the cache-aware version is the version they posted.

    You are telling that the simple version is faster. They are telling that their version should be fast on systems with high memory latency to processor speed ratio.

    I think they only proved that their algorithm works pretty well on Pentium 4. The reason that on your processor the purportedly cache-aware algorithm is slower is that in modern high end desktop processors there is a write queue which can handle multiple writes in parallel, hiding memory latency.

    BTW what are your results?

  3. #3
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I assume that the simple version of permutation inversion algorithm is:
    That's right.

    This was the guess I made at what it should look like, late last night. It's not clear how it should speed things up. Surprisingly, it was faster in one case. But it was much slower in every other case.


    #define CACHE_SIZE (1024*1024)
    #define BLOCK_LENGTH (CACHE_SIZE/2/sizeof(int))
    void perm_inverse(int *perm, int *inv, int n)
    {
    int number_of_blocks = (n / BLOCK_LENGTH) + 1;
    int *D = (int *)malloc(sizeof(int) * n);
    int *D_prime = (int *)malloc(sizeof(int) * n);
    int *D_ptr = (int *)malloc(sizeof(int) * number_of_blocks);


    //printf("n = %d, BLOCK_LENGTH = %lu, number_of_blocks = %d\n", n, BLOCK_LENGTH, number_of_blocks);


    if(!D || !D_prime || ! D_ptr) {
    fprintf(stderr, "Malloc failed!\n");
    exit(1);
    }


    int i;
    for(i=0; i<number_of_blocks; i++) {
    D_ptr[i] = i * BLOCK_LENGTH;
    }
    for(i=0; i<n; i++) {
    int block_num = perm[i] / BLOCK_LENGTH;
    D[D_ptr[block_num]] = perm[i];
    D_prime[D_ptr[block_num]] = i;


    D_ptr[block_num]++;
    }
    for(i=0; i<n; i++) {
    inv[D[i]] = D_prime[i];
    }
    free(D);
    free(D_prime);
    free(D_ptr);
    }


    The second paper describes the inverse algorithm in a bit more detail.

    Based on what I know about cpus, the part of the standard permutation inversion that would hurt performance is doing random stores of partial cache lines. The cpu has to read the old cache line, update it, then write it back out.
    Last edited by nburns; 16th July 2013 at 20:37.

  4. #4
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,479
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I have once measured the speed of random reads and random writes. Results are here:
    http://encode.su/threads/1285-Random...-random-writes

    It turns out that modern CPUs are doing much better on random writes than on random reads. Unsuprisingly, random reads with added prefetching are even faster (much faster).

    The simple permutation inversion algorithm they posted is a variant that stresses the random writes performance. You can modify it to make it stressing the random read performance and then add prefetch queue.

    (Or maybe I messed up something and it cannot be modified as such)

  5. #5
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I have once measured the speed of random reads and random writes. Results are here:
    http://encode.su/threads/1285-Random...-random-writes

    It turns out that modern CPUs are doing much better on random writes than on random reads. Unsuprisingly, random reads with added prefetching are even faster (much faster).

    The simple permutation inversion algorithm they posted is a variant that stresses the random writes performance. You can modify it to make it stressing the random read performance and then add prefetch queue.

    (Or maybe I messed up something and it cannot be modified as such)
    As far as I know, that is the unique simple algorithm for inverting a permutation.

    I didn't have my hopes too high for finding a faster way, but there are always surprises.

    It's possible that playing with the parameters in the code I posted could make a difference. They didn't give very specific guidance about how to set CACHE_SIZE, at least not for a modern cpu, so I picked a round number that's in between the L2 and L3 size for my cpu.
    Last edited by nburns; 17th July 2013 at 07:21.

  6. #6
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,479
    Thanks
    26
    Thanked 122 Times in 96 Posts
    I tried to devise a random read oriented permutation inversion but didn't come up with anything. Probably you're right and there's no simple alternative.



    Here's my Java version of proposed algorithm with some additional tweaks. Non-zero gap is crucial for performance, at least for random inputs.

    Code:
    package p;
    
    import java.util.Arrays;
    import java.util.Random;
    
    public class Main6 {
    
        public static void main(String[] args) {
            Random random = new Random(0);
            final int Size = 34567890; // problem size
            final int Gap = 333; // tune that
            final int BucketSizeLog = 16; // and that
            final int BucketSize = 1 << BucketSizeLog;
            final int Buckets = 1 + Size / BucketSize;
            final int GapsSum = Gap * Buckets;
            final int SizeWithGaps = Size + GapsSum;
            final int[] X = new int[Size];
            final int[] Y = new int[Size];
            final int[] B = new int[SizeWithGaps * 2];
            final int[] P = new int[Buckets];
            for (int i = 0; i < Size; i++) {
                X[i] = i;
            }
            // generating permutation
            for (int i = 0; i < Size; i++) {
                final int j = i + random.nextInt(Size - i);
                final int old = X[j];
                X[j] = X[i];
                X[i] = old;
            }
            // simple algo
            {
                final long time = System.currentTimeMillis();
                for (int i = 0; i < Size; i++) {
                    Y[X[i]] = i;
                }
                System.out.println("Time was: "
                        + (System.currentTimeMillis() - time));
            }
            // fancy algo
            {
                Arrays.fill(Y, 0);
                final long time = System.currentTimeMillis();
                for (int i = 0; i < Buckets; i++) {
                    P[i] = i * (BucketSize + Gap) * 2;
                }
                for (int i = 0; i < Size; i++) {
                    final int bucket = X[i] >> BucketSizeLog;
                    B[P[bucket]] = i;
                    B[P[bucket] + 1] = X[i];
                    P[bucket] += 2;
                }
                for (int b = 0; b < Buckets; b++) {
                    for (int i = b * (BucketSize + Gap) * 2; i < P[b]; i += 2) {
                        Y[B[i + 1]] = B[i];
                    }
                }
                System.out.println("Time was: "
                        + (System.currentTimeMillis() - time));
                boolean good = true;
                for (int i = 0; i < Size; i++) {
                    good &= Y[X[i]] == i;
                }
                System.out.println(good ? "Correct" : "Broken");
            }
        }
    }
    The output on my work laptop is:
    Code:
    run:
    Time was: 566
    Time was: 455
    Correct
    BUILD SUCCESSFUL (total time: 5 seconds)
    Ie the fancy algorithm wins with that settings under installed JVM.

    Difference under C/ C++ should be higher as they don't check for array boundaries.

    Here's output from Sun Java 6 Update 24 with -server switch:
    Code:
    Time was: 547
    Time was: 288
    Correct
    I'll try to test further when I get home.
    Last edited by Piotr Tarsa; 17th July 2013 at 20:57.

  7. Thanks:

    nburns (21st July 2013)

  8. #7
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Thanks. I'll give that a try. It's probably worth testing on actual suffix arrays, as they could have different properties than random permutations.

  9. #8
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Nice! I transplanted the algorithm into mk_bwts, and it's consistently faster on real data. I haven't yet gotten around to giving much thought to what it's doing. I just translated and timed.

    Thanks.

    Here is the test code you posted, translated into C:



    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <time.h>


    int main(int argc, char *argv[])
    {
    srandom(time(NULL));
    const int Size = 34567890; // problem size
    const int Gap = 333; // tune that
    const int BucketSizeLog = 16; // and that
    const int BucketSize = 1 << BucketSizeLog;
    const int Buckets = 1 + Size / BucketSize;
    const int GapsSum = Gap * Buckets;
    const int SizeWithGaps = Size + GapsSum;
    int *X = (int*)calloc(Size, sizeof(int));
    int *Y = (int*)calloc(Size, sizeof(int));
    int *B = (int*)calloc(SizeWithGaps * 2, sizeof(int));
    int *P = (int*)calloc(Buckets, sizeof(int));
    int i, b;
    for (i = 0; i < Size; i++) {
    X[i] = i;
    }
    // generating permutation
    for (i = 0; i < Size; i++) {
    int j = i + (random() % (Size - i));
    int old = X[j];
    X[j] = X[i];
    X[i] = old;
    }
    // simple algo
    {
    clock_t time = clock();
    for (i = 0; i < Size; i++) {
    Y[X[i]] = i;
    }
    printf("Basic algorithm time was: %0.2f ms\n", (double)(clock() - time) * 1000.0 / (double)CLOCKS_PER_SEC);
    }
    // fancy algo
    {
    memset(Y, 0, sizeof(int) * Size);
    clock_t time = clock();
    for (i = 0; i < Buckets; i++) {
    P[i] = i * (BucketSize + Gap) * 2;
    }
    for (i = 0; i < Size; i++) {
    int bucket = X[i] >> BucketSizeLog;
    B[P[bucket]] = i;
    B[P[bucket] + 1] = X[i];
    P[bucket] += 2;
    }
    for (b = 0; b < Buckets; b++) {
    for (i = b * (BucketSize + Gap) * 2; i < P[b]; i += 2) {
    Y[B[i + 1]] = B[i];
    }
    }
    printf("Cache-optimized algorithm time was: %0.2f ms\n", (double)(clock() - time) * 1000.0 / (double)CLOCKS_PER_SEC);
    int good = 1;
    for (i = 0; i < Size; i++) {
    good &= Y[X[i]] == i;
    }
    puts(good ? "Correct" : "Broken");
    }
    return 0;
    }

Similar Threads

  1. Computing the BWTS faster
    By nburns in forum Data Compression
    Replies: 4
    Last Post: 20th July 2013, 21:17
  2. Help me to odentify algorithm
    By attio in forum Data Compression
    Replies: 0
    Last Post: 23rd September 2011, 21:04
  3. Replies: 1
    Last Post: 21st June 2011, 12:48
  4. GCC 4.6 faster than previous GCC, faster than LLVM
    By willvarfar in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 15th November 2010, 16:09
  5. Faster PAQ7/8
    By LovePimple in forum Forum Archive
    Replies: 6
    Last Post: 8th February 2007, 14:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •