Results 1 to 13 of 13

Thread: How to micro-optimize a compressor

  1. #1
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    How to micro-optimize a compressor

    The really major optimisations are macro - choosing the right algorithms, implementing them correctly.

    For example, I implemented my amateur PPM using a proper tree and got most confused because most of the descriptions of the 'partial updates' and 'exclusion' implicitly assume hash-maps for high-order contexts.

    However, it seems that once you've committed yourself to an algorithm and chosen the data structures and flow to achieve it, you end up wanting it to go as fast as it can so you have to get into micro-optimisation.

    But I thought a thread where people can give micro-optimising advice which is generally applicable or stories of particular cases is useful.

    Doubtless the first rule is to profile. What tools do people use to profile?

    So what advice and stories do you have to share?

  2. #2
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    New profiling tools for Linux

    (I seed the thread with some advice of my own)

    In the newest kernels (and included in the very newest Ubuntu 10.x) is the 'perf' profiler. This can produce nice sample-based profiling but also excellent system profiling.

    The perf timechart and schedule-profiling tools are absolutely awesome if you have multiple threads or IO.


  3. #3
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts

    GCC 4.4

    The 4.x series of the GCC compiler is fast. Really fast. Traditionally, GCC has been labelled a slouch but with the 4.x series it got a new vectorising engine and this is beginning to see some serious payoffs.

    With the 4.x series, my compute-intensive applications run faster than when compiled with the Intel Compiler.

    Additionally, GCC has always had a relatively-unknown trick up its sleeve. You can compile with -fprofile-generate and gather runtime statistics. Then you compile again with -fprofile-use and it'll use those statistics to produce a much faster binary. I've seen anywhere between 10 and 20% speed-up in compute-heavy code.

    Of course you might want to compile generic binaries that anyone can run. But in some cases, you're happy for the code to run on the local machine only. In those cases, compile with "-march=native" and, if appropriate, specify additional -msse. That really really makes a massive difference.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > For example, I implemented my amateur PPM using a proper
    > tree and got most confused because most of the
    > descriptions of the 'partial updates' and 'exclusion'
    > implicitly assume hash-maps for high-order contexts.

    Now that's surprising - which "most descriptions" assume that?
    Afaik at the time when PPM was in fashion there just wasn't
    enough memory for good compression with hashtables.
    Also, afair, you just somehow misunderstood what are
    contexts/orders. When you think in terms of specific
    coding loop (try highest order - escape - try next order)
    it all imho comes naturally. Well, "partial updates" also
    imply frequencies for statistics, but that was also common
    for some time.

    > Doubtless the first rule is to profile.
    > What tools do people use to profile?

    Code:
    typedef unsigned long long qword;
    typedef qword (*rd)(void);
    const rd rdtsc = rd((void*)"\x0F\x31\xC3");
    That's the RDTSC cpu instruction (well, rdtsc+ret).
    I've tried using some profiling tools too
    (windows-only though, like VTune and its AMD's counterpart),
    but they don't make much sense in the end with modern compilers -
    one has to avoid inlining and templates to even see comprehensible
    function names in the output.
    But rdtsc is always useful, as one can add timing to one specific
    block without messing up the whole program.

    > The 4.x series of the GCC compiler is fast. Really fast.
    > Traditionally, GCC has been labelled a slouch but with the
    > 4.x series it got a new vectorising engine and this is
    > beginning to see some serious payoffs.

    Well, considering vectorization, IntelC still wins afaik.
    But its true that gcc 4.3+ generates comparable
    (sometimes slightly better, sometimes worse) code with plain
    pentium instruction set.

    > With the 4.x series, my compute-intensive applications run
    > faster than when compiled with the Intel Compiler.

    And why do I have the impression that you didn't pay much
    attention to intelc actually, and just discarded it when
    it didn't beat gcc with default options? :)

    > Additionally, GCC has always had a relatively-unknown
    > trick up its sleeve. You can compile with
    > -fprofile-generate and gather runtime statistics. Then you
    > compile again with -fprofile-use and it'll use those
    > statistics to produce a much faster binary. I've seen
    > anywhere between 10 and 20% speed-up in compute-heavy
    > code.

    Actually that's IntelC which "always" had it :)
    And IntelC's implementation is still more convenient to
    use because it allows to collect many samples per executable,
    doesn't go crazy with profiles from previous compiler versions,
    and doesn't require recomputing the profiles after changing
    a symbol in the source.
    Note also that the point with multiple samples is very important
    for compressors specifically, because naturally one would need
    at least two profiles there - for encoding and decoding.
    Still, I have to say, that although IC's PGO is much more
    convenient to use, but I've seen much better effect with gcc's
    PGO recently - with gcc it commonly improves the performance,
    while with IC its a matter of luck - half of the time it makes
    things worse (but gcc with PGO still doesn't necessarily win vs IC without PGO).

    Well, from my observations, it all comes down to programming style
    somehow. IC is much more aggressive with code unrolling/inlining,
    and generates different code patterns for stack frames and class fields.
    So I noticed multiple times that programs developed (and optimized)
    for IntelC run faster with IntelC, and same applies to gcc :)

    But why not, its always better to have more tools :)
    Last edited by Shelwien; 12th April 2010 at 16:30.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts

  6. #6
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Well I used to use a paid copy of the Intel compiler until recently, so I don't think I'd characterise myself as dismissive of it. Just now, you can get that kind of thing for free. A long time ago it was the Metrowerks compiler that rocked - around 2001 or so iirc. Using mingw for win32 compilation instead of VC++ is, if you can live with the toolchain, a boon.

    GCC needs you to do alignment and use intrinsics to make much use of vectorisation, and then it works great. You have to design it in. You're basically telling it what to do, of course - its barely more abstract than assembly. I haven't seen any great examples of auto-vectorisation in my playing with other compilers, so I'm inclined to think auto-vectorisation is a myth

    I haven't seen good results from LLVM speed- nor size-wise - which surprises me, but maybe others have had more luck.

    An excellent site, I was going to recommend it myself: http://www.agner.org/optimize/

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    I have the same experience with LLVM - didn't appear useful.

    And as an example of intelc auto-vectorization, I'd suggest this:
    Code:
    //  for( i=0; i<10; i++ ) result.b[i] %= 62;
      #define vecmod(a,b) (a-((__min(1,byte(__min(a,b)-b))-1)&b))
      for( i=0; i<16; i++ ) { 
        result.b[i] = vecmod( result.b[i], 4*62 );
        result.b[i] = vecmod( result.b[i], 2*62 );
        result.b[i] = vecmod( result.b[i], 1*62 );
      }
    Its from my cracker for Winny "trip" codes
    (there's stuff like qwerpoiu -> Iws0WmvT3N)
    I won't say that this example itself is optimal or anything,
    I just made it specifically to test IntelC,
    and here's what it generates:

    Code:
    ;;;     result.b[i] = vecmod( result.b[i], 4*62 );
    ;;;     result.b[i] = vecmod( result.b[i], 2*62 );
    ;;;     result.b[i] = vecmod( result.b[i], 1*62 );
    
            movdqa    xmm6, XMMWORD PTR _2il0floatpacket$3      
            movdqa    xmm5, XMMWORD PTR _2il0floatpacket$4      
            movdqa    xmm3, XMMWORD PTR _2il0floatpacket$5      
            movdqa    xmm0, XMMWORD PTR _2il0floatpacket$6      
    [...]
            movdqa    xmm7, xmm6                                
    
            movdqa    xmm4, XMMWORD PTR _2il0floatpacket$8      
            movdqa    xmm1, XMMWORD PTR ?result@@3UHashValue@@A 
            pminub    xmm7, xmm1                                
            paddb     xmm7, xmm5                                
            movdqa    xmm5, xmm0                                
            pminub    xmm7, xmm3                                
            paddb     xmm7, xmm2                                
            pand      xmm7, xmm6                                
            psubb     xmm1, xmm7                                
            pminub    xmm5, xmm1                                
            paddb     xmm5, XMMWORD PTR _2il0floatpacket$7      
            pminub    xmm5, xmm3                                
            paddb     xmm5, xmm2                                
            pand      xmm5, xmm0                                
            psubb     xmm1, xmm5                                
            movdqa    xmm0, XMMWORD PTR _2il0floatpacket$8      
            pminub    xmm0, xmm1                                
            paddb     xmm0, XMMWORD PTR _2il0floatpacket$9      
            pminub    xmm0, xmm3                                
            paddb     xmm0, xmm2                                
            pand      xmm0, xmm4                                
            psubb     xmm1, xmm0                                
            movdqa    XMMWORD PTR ?result@@3UHashValue@@A, xmm1
    and gcc is still lacking that afaik.
    Last edited by Shelwien; 12th April 2010 at 22:01.

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    g++ 4.4.0 does a bit of vector optimization. For example, the following calculation is done 4 elements at a time using SSE2 instructions:

    Code:
    static void f(int *pa, int *pb, int n) {
      for (int i=0; i<100; ++i)
        pa[i]+=pb[i]+i+n;
    }
    
    int a[100], b[100];
    
    extern "C" void g() {
      f(a, b, 3);
    }
    g++ -c a.cpp -S -O3 -march=pentium4 -fomit-frame-pointer

    Code:
            .file   "a.cpp"
            .text
    .globl _g
            .def    _g;     .scl    2;      .type   32;     .endef
    _g:
            movdqa  LC0, %xmm1
            xorl    %eax, %eax
            movdqa  LC1, %xmm3
            movdqa  LC2, %xmm2
    L2:
            movdqa  %xmm1, %xmm0
            paddd   %xmm2, %xmm0
            paddd   _a(%eax), %xmm0
            paddd   _b(%eax), %xmm0
            movdqa  %xmm0, _a(%eax)
            addl    $16, %eax
            paddd   %xmm3, %xmm1
            cmpl    $400, %eax
            jne     L2
            ret
    .globl _a
            .bss
            .align 32
    _a:
            .space 400
    .globl _b
            .align 32
    _b:
            .space 400
            .section .rdata,"dr"
            .align 16
    LC0:
            .long   0
            .long   1
            .long   2
            .long   3
            .align 16
    LC1:
            .long   4
            .long   4
            .long   4
            .long   4
            .align 16
    LC2:
            .long   3
            .long   3
            .long   3
            .long   3
    The xmm instructions require 16 byte aligned addresses. If the arrays a and b are allocated then the compiler adds a lot of extra code to take care of the unaligned ends of the arrays.
    Last edited by Matt Mahoney; 13th April 2010 at 18:29.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Still, there're many interesting cases which gcc doesn't handle yet.

    As to alignment, intelc has pragmas for that:
    http://www.intel.com/software/produc...gma_vector.htm
    dunno whether gcc supports these.

    Update: also http://www.intel.com/software/produc...l_nounroll.htm
    Last edited by Shelwien; 13th April 2010 at 18:49.

  10. #10
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    GCC uses attributes on the declaration; here's some SAD code:

    Code:
    #include <stdio.h>
    #include <emmintrin.h>
    #include <assert.h>
    
    typedef unsigned char block4x4_t[4*4] __attribute__((__aligned__(16)));
    typedef unsigned char block8x8_t[8*8] __attribute__((__aligned__(16)));
    typedef unsigned char block16x16_t[16*16] __attribute__((__aligned__(16)));
    
    template<typename block_t> unsigned sad(const block_t& a,const block_t& b) {
        enum { N = sizeof(block_t)/16 };
        assert(sizeof(block_t)==sizeof(__m128i[N]));
        __m128i *a_sse = (__m128i*)&a;
        __m128i *b_sse = (__m128i*)&b;
        __m128i x = _mm_set1_epi8(0);
        for(int n=0; n<N; n++)
            x += _mm_sad_epu8(a_sse[n],b_sse[n]);
        __m128i y = _mm_shuffle_epi32(x,_MM_SHUFFLE(1,1,1,2));
        return _mm_cvtsi128_si32(_mm_add_epi16(x,y));
    }
    
    int main() {
            block8x8_t a = {10,20,30};
            block8x8_t b = {0,10,20};
            a[30] = 10;
            a[sizeof(a)-1] = 10;
            printf("sad=%u\n",sad(a,b));
            return 1;
    }

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I did some more testing to see if g++ 4.4.0 would vectorize dot_product() and train() in paq8. (options g++ -c -fomit-frame-pointer -march=pentium4 -O3 -S)

    Code:
    int dot_product(short *t, short *w, int n) {
      int sum=0;
      n=(n+7)&-8;
      for (int i=0; i<n; i+=2)
        sum+=(t[i]*w[i]+t[i+1]*w[i+1]) >> 8;
      return sum;
    }
    
    void train(short *t, short *w, int n, int err) {
      n=(n+7)&-8;
      for (int i=0; i<n; ++i) {
        int wt=w[i]+((t[i]*err*2>>16)+1>>1);
        if (wt<-32768) wt=-32768;
        if (wt>32767) wt=32767;
        w[i]=wt;
      }
    }
    It does not, but it will vectorize the following variations:

    Code:
    int dot_product(int *t, int *w, int n) {
      int sum=0;
      for (int i=0; i<n; ++i)
        sum+=t[i]*w[i]>>8;
      return sum;
    }
    
    void train(int *t, int *w, int n, int err) {
      for (int i=0; i<n; ++i)
        w[i]+=(t[i]*err*2>>16)+1>>1;
    }
    The resulting code has 15 instructions in the inner loop for dot_product() and 18 for train() if t and w are declared static or local and therefore known to be 16 bit aligned. Otherwise it is 16 and 21 instructions, plus some stuff before and after to take care of the ends. PAQ manually aligns the allocated memory but the compiler doesn't know that.

    I did not test, but this is surely slower because paq7asmsse.asm uses 6 and 10 instructions respectively, and operates on 8 16-bit elements at a time instead of 4 32-bit elements. If I leave the arrays as short, then dot_product() is vectorized but train() is not.

    These changes are not compatible. In dot_product() there is a shift after every multiplication instead of every other. In train() I removed the bounds checks on wt because g++ is not able to parallelize them. I'm not sure what effect this will have on compression. It might even be better
    It will double the memory used by the mixer, but that is not very much. It will waste cache, however.

    This might be a bad example because the original code was designed specifically to be efficient in MMX/SSE2. For example, the bounds check in train() is built into the paddsw instruction (parallel add with 16 bit signed saturation).

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    One interesting point here is that IntelC won't vectorize
    Code:
    if (wt>32767) wt=32767;
    but it would vectorize something like
    Code:
    wt = wt>32767 ? 32767 : wt;

  13. #13
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    So, generally, auto-vectorisation is turning up in GCC as well as Intel, but if its the tightest inner loop for your app you can't really rely on the compiler yet?

    I've taken the intrinsics route.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •