Results 1 to 19 of 19

Thread: Code Optimisation

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts

    Code Optimisation

    Hello

    I was looking yesterday at some simple piece of code, in order to improve what i could for the speed.
    Because i was getting pretty weird results, i finally tracked down the strange behavior to the following line of C Code :

    Code:
    while (in < iend) Stats[*in++]++;
    Looks simple enough.
    This simply go through data buffer, scanning statistics on characters. You end up with a simple table, containing the number of occurence for each character.

    But from a speed perspective, there is a strange unexpected behavior :
    speed vary depending on source distribution.

    Using this code on a erratic source, such as literals (or any binary file), i get a speed beyond 1100MB/s (up to 1200MB/s for already compressed files). This sounds nice.

    Using this code on a pre-sorted or nearly sorted data (such as matchlengthes for example), the speed goes down, as much as 650MB/s. That's twice slower ! So it is definitely noticeable.

    It *seems* that if one character is really more present than others, it makes the counting slower.
    This is especially true using Shelwien's "loong" test file which is full of space characters. In this specific instance, the scanning speed goes down below the 500MB/s mark.

    But there is nothing in this code line which should make any difference, whatever the distribution of characters.

    So, what could be the issue then ?
    Could it be related to some hardware CPU cache behavior ? (for example, updating the same memory address all the time is slower than updating different ones ?)
    FYI, I'm using a Core 2 Duo E8300 on test system.

    Quite a difficult question...

    Regards

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    1. It really sounds like a cache problem, but
    its not certain unless we'd look at the actual code for that loop -
    who knows what the compiler could generate there.

    But there're some optimizations applicable even without asm:

    Code:
    while( in<iend ) Stats[*in++]++;
    Code:
    byte* restrict p = &in[iend];
    uint i = -iend-1;
    while( ++i ) ++Stats[p[i]];
    Post-increments are sometimes bad for optimization because
    they imply the allocation of a temp variable:

    Code:
    Stats[*in++]++;
    
    =
    
    c = *in; *in=c+1;
    d = Stats[c]; Stats[c]=d+1;
    http://en.wikipedia.org/wiki/Restrict
    Also, pointers without "restrict" (or __restrict in gcc case)
    keyword are even worse, because the compiler doesn't normally know
    where they point to, so it has to take into account the chance
    that in[] would overlap with Stats[] here (maybe not for static arrays,
    but that certainly can happen with pointers).

    And the last point is that testing for zero doesn't require any
    additional instructions while in<end does.

    2. Its still possible that my suggestions won't remove the weird
    behavior which you described. Well, its reasonable that reading
    the cached value immediately after write could be bad - CPU has
    to wait at least for the L1 to complete its work, while when
    the address is different, it can overlap some operations.

    Then, it can be a good idea to try to unroll it a bit:

    while( i+=4 ) {
    ++Stats[p[i+0]][0];
    ++Stats[p[i+1]][1];
    ++Stats[p[i+2]][2];
    ++Stats[p[i+3]][3];
    }

    IntelC even might automatically vectorize it in this case

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Shelwien:
    Better use four different arrays for counting, and at the end compute one final table of frequencies. This way there will be less cache collisions.

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Maybe you're right, guess I was too scared of cache set collisions,
    though i guess that would be only relevant with more elements -
    there's a very limited amount of cache lines allocated to the
    same page offset (4k on 32-bit systems), like 2 sometimes.
    So normally having multiple structures with the same layout
    and alignment is really bad.

    But also there would be other cache effects (like, in my example,
    a long run of the same symbol would always work with the same cache line),
    so we won't really know whether you're right until a practical test :)

  5. #5
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Thanks for ideas, i will try some of them tonight.

    For the time being, i've just tried to fill 2 different tables, and sum them at the end.

    Code:
    while( in<iend ) { Stats1[*in++]++; Stats2[*in++]++; }
    It works well, although there are still some weird performance behavior to tackle.

    Some figures :
    1 Table as Global variable : 650 MB/s
    1 Table as Local (heap) variable : 650 MB/s
    2 Tables as Global variable : 350 MB/s
    2 Tables as Local (heap) variable : 900 MB/s
    1 Global + 1 Local : 1100 MB/s (at last...)

    So i got a work around, but this is nonetheless pretty interesting behavior. I wouldn't have expected such differences before testing them. And they are large enough to deserve some attention.

    Maybe a good exercise for some cache structure behavior explanations...
    Last edited by Cyan; 11th January 2010 at 19:45.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Its probably an effect of table alignment changes.
    You can use syntax like
    __declspec align(32) uint Stats[256]; // MSC / IntelC
    or
    __attribute__((aligned(32)) uint Stats[256]; // gcc

  7. #7
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    A few feedbacks on various optimizations suggested in this thread :

    1) Restrict & Pre-increment
    I tested the proposed code variation but it did not change anything. So i guess the compiler is already doing a pretty fine job at translating the initial line of code, and make all these optimisations by itself.

    2) Unroll in a 2 dimensional table
    This looked a good idea. In the end, i had the same result as 2 Tables as Local variables, around ~900MB/s, even when the 2 dimensional table was declared as Global variable. This is less optimal than with 1 Table local and 1 Global though.

    3) Aligned declaration
    This helped improving the performance of 2 Tables as Global Variable from 350MB/s to 900MB/s, i.e. same speed as Local Variables.
    Once again, less optimal than 2 different tables.

    So i still don't know why having 1 Table as Local & 1 Table as Global provides the best results so far.
    One thing that comes to mind is that 2 different tables are probably separated by much longer distances than 2 tables declared together, either global or local.
    But it's unsure why it would provide such 20% improvement.
    Maybe the "restrict" property is automatically attributed in such circumstance ?

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > 1) Restrict & Pre-increment
    > I tested the proposed code variation but it did not change
    > anything. So i guess the compiler is already doing a
    > pretty fine job at translating the initial line of code,
    > and make all these optimisations by itself.

    Its still uncertain... why don't you post asm code for that loop?
    For gcc, "g++ -S -masm=intel 1.cpp" would produce 1.s with your code.

    > 3) Aligned declaration
    > This helped improving the performance of 2 Tables as
    > Global Variable from 350MB/s to 900MB/s, i.e. same speed
    > as Local Variables.
    > Once again, less optimal than 2 different tables.

    They are not really any different... just print the addresses
    from your "Local and Global" version, and define global tables
    with the same alignment.

    To be specific, if you get:
    pa=0x403010
    pb=0x23C8C0
    it can be done like this:

    Code:
    #include <stdio.h>
    enum {
      pa=0x403010,
      pb=0x23C8C0
    };
    //__declspec(align(4096))
    //__attribute__(( aligned(4096) ))
    char data[0x10000]; 
    int main( void ) {
      char* p = &data[ 0xFFF - (unsigned(data+0xFFF) & 0xFFF) ];
      int (&stats0)[256] = (int(&)[256])p[pa&0xFFF];
      int (&stats1)[256] = (int(&)[256])p[0x2000+(pb&0xFFF)];
      printf( "p=%08X\n", p );
      printf( "%08X -> %08X\n", pa, stats0 );
      printf( "%08X -> %08X\n", pb, stats1 );
    }
    ...Unfortunately its troublesome to set a fixed page alignment in gcc...

    > So i still don't know why having 1 Table as Local & 1
    > Table as Global provides the best results so far.

    Its all about alignment. Page(4k/8k) offset and line(16/32/64) offset
    both matter for caching.

    > But it's unsure why it would provide such 20% improvement.
    > Maybe the "restrict" property is automatically attributed
    > in such circumstance ?

    Its possible I guess. To make sure we can use the quoted alignment example
    (with addresses from your fastest version) and only use the "restricted"
    pointers in the loop, instead of arrays.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Shelwien:
    Take note that cache can be used to store single continuous block. In such case we can have tables of any form that fits in 16kb continuous block and have no cache collisions, ie. entire cache would be fully and optimally utilized.

    BTW:
    Following Java code displays values around 1200 (miliseconds):
    Code:
    package Test;
    
    import java.util.Arrays;
    import java.util.Random;
    
    public class Test {
    
        byte[] input = new byte[1024*1024*1024];
        long[] stats = new long[256];
        long time;
    
        void test(boolean testRandom) {
            if (testRandom) {
            Random random = new Random();
            random.nextBytes(input);
            } else {
                Arrays.fill(input, (byte)67);
            }
            time = System.currentTimeMillis();
            Arrays.fill(stats, 0);
            for (byte b : input) {
                stats[((int)b) & 0xFF]++;
            }
            System.out.println(System.currentTimeMillis() - time);
        }
    
        public static void main(String[] args) {
            Test test = new Test();
            test.test(true);
        }
    }
    Ran on 64 bit Ubuntu Karmic. When array is filled with one value then is displays around 2250.
    Last edited by Piotr Tarsa; 13th January 2010 at 19:21.

  10. #10
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    btw, Piotr, is Shindlet still downloadable from anywhere ?
    Shindlet seems superior to other depth 0 entropy coder in both compression rate and speed, and i'm willing to test it.
    But link from Matt's LTCB seems dead.

  11. #11
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Cyan:
    It was written in 100 % pure assembly code, so the speed mainly comes from language choice.

    http://board.flatassembler.net/topic.php?t=3766 here I've got some improvement ideas. Maybe the last attachment is a version equal to those Matt tested. Don't hotlink it, mirror it .

    http://asembler.republika.pl/programy.html here are some other programs written by me, among others there is graphical huffman coder with selectable block size.

  12. #12
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Thanks Piotr !
    indeed the linked discussion needs some serious ASM skills for proper following.

    I've updated Shindlet results on my entropy coder page.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    What about
    http://ctxmodel.net/files/fpaq0pv4b3.rar
    http://ctxmodel.net/files/newbrc/newbrc_0.rar

    Edit: I forgot, the second one is a order-(-1) coder %)
    Last edited by Shelwien; 16th January 2010 at 18:30.

  14. #14
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Hi Eugene

    Regarding fpaq0 :
    I'm just a bit puzzled by Matt's description of fpaq0, i wonder if it really is "order-0".
    It is described as bit-oriented range coder with 8-bits history, which will provide a context advantage over real order-0 byte-oriented encoders (while in the middle of a character, the context encompass a part of previous character).
    But maybe i've not fully understood.

    The second question is about version : fpaq0pv4b3 has so many compiled versions that i'm a bit lost on which one to select.
    In the end, helped by the table provided, i settled with fpaq0pv4Bcls_O3_xi_ipo. Don't hesitate to hint to a preferred version should you wish it.

    fqap0pv4b3 is fast (in the range of 30MB/s) and compress better than other order-0 range coders (enwik8 : 1.631 / enwk8-lz4-ml : 2.965)


    Regarding newbrc0 :
    It is reasonably fast (20MB/s), but hardly compress anything (enwik8 : 1.004 / enwk8-lz4-ml : 1.073).
    Its results are comparable to some other "true order-0" (i.e. no context at all) bit-oriented range coders, so if this assumption is correct,
    the issue is that i'm not using test files suitables for such encoder (a sequence of flags would be much more appropriate), hence a desastrous comparison.
    Here again, you can correct me if i've not properly understood.

    Edit : just noticed your edit. Actually i'm not sure to understand what an order (-1) encoder is.

    Regards
    Last edited by Cyan; 16th January 2010 at 19:15.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    > I'm just a bit puzzled by Matt's description of fpaq0, i
    > wonder if it really is "order-0".

    There're some that are not order-0, mainly fpaq0f*.
    But fpaq0p derivatives, including fpaq0pv4B, are bitwise coders
    equivalent to bytewise order0.

    > In the end, helped by the table provided, i settled with
    > fpaq0pv4Bcls_O3_xi_ipo. Don't hesitate to hint to a
    > preferred version should you wish it.

    Its ok, the differences are really small anyway.

    > Regarding newbrc0 :
    [...]
    > Edit : just noticed your edit. Actually i'm not sure to
    > understand what an order (-1) encoder is.

    Its a common PPM notation for the context to which PPM
    "escapes" from order0, ie some static probability distribution.

    Well, here I patched it to use the same model as fpaq0p:
    http://www.ctxmodel.net/files/newbrc/newbrc_1.rar

  16. #16
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    IIRC order -1 is an order where you assign equal probabilities to all symbols, except escape symbol. So it can be very useful eg for DNA encoding.

    Cyan:
    Which variation of Shindlet was the fastest? I've looked into sources and noticed I've maintained count of symbols in a fixed size window, ie. when encoding new symbol I've increased count of current symbol and decreased count for symbol that was leaving the window.

  17. #17
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Thanks for explanation, Eugene and Piotr.

    > Well, here I patched it to use the same model as fpaq0p:

    This completely change the picture.
    Now, newbrc1 compresses better than any other Order-0 algorithm tested so far, including fqap0pv4b3 (albeit by a small margin).
    Speedwise, this has a cost though, as newbrc1 is approximately twice slower.

    Good job !

    > Which variation of Shindlet was the fastest?

    I only tested the bt version, because it was advertised as the fastest by both Matt and your comment in the forum.

  18. #18
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Cyan:
    Version sl should be fastest on bwt+mtf data of at least moderately compressible files, fs should be fastest on files with a small number of frequent symbols (so also on bwt+mtf data), and bt has more or less constant speed.

  19. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Cyan View Post
    btw, Piotr, is Shindlet still downloadable from anywhere ?
    Shindlet seems superior to other depth 0 entropy coder in both compression rate and speed, and i'm willing to test it.
    But link from Matt's LTCB seems dead.
    Thanks. I added a new link. http://mattmahoney.net/dc/text.html#6373

Similar Threads

  1. Optimized paq7asm.asm code not compatible with paq8px?
    By M4ST3R in forum Data Compression
    Replies: 7
    Last Post: 3rd June 2009, 16:34
  2. Replies: 8
    Last Post: 12th December 2008, 14:11
  3. Compiler related: Intel's code slower on AMD-CPUs?!
    By Vacon in forum Data Compression
    Replies: 5
    Last Post: 10th May 2008, 18:56

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •