Page 1 of 2 12 LastLast
Results 1 to 30 of 36

Thread: IntelC 11.1.065 vs gcc 4.5

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    IntelC 11.1.065 vs gcc 4.5

    I'm writing an archiver (with ppmd as the only codec for now)
    and made an interesting observation: seems that with gcc 4.5,
    intelc lost its superiority, at least on scalar integer apps
    (like most compressors are).

    Code:
    39.782s 42.500s 39.750s 42.594s // gcc345
    37.750s 40.032s 37.719s 40.000s // gcc441 [1]
    34.890s 36.750s 34.813s 36.687s // gcc450
    
    39.156s 36.297s 39.125s 36.250s // IC91040_defSJ
    38.828s 36.297s 38.922s 36.328s // ICA1025_defSJ
    37.578s 38.109s 37.687s 38.078s // ICB0074_PGO [2]
    39.484s 37.922s 39.531s 37.969s // ICA1025_defSJ_PGO
    39.390s 38.766s 39.390s 38.828s // ICA1025_defSJ_QxN_PGO
    (full table: http://nishi.dreamhosters.com/u/paf02b.txt)

    As we can see in this quote, until gcc 4.4.1, IC was still
    leading (compare [1] and [2]). It may seem wrong to compare
    intelc with PGO to gcc without, but gcc's PGO is still frequently
    hard to employ without major code changes, so afaik its ok.

    But anyway, looks like gcc made a breakthrough.
    I still don't know what's the reason though.
    My initial idea was about CMOVcc instruction usage
    (Intel's /arch:ia32 doesn't allow it for some reason,
    and they've discarded /Qxi), but now I don't see any CMOVs
    in either version of executable.
    Other candidates are PUSH'es before function calls
    (IC's version has a lot of these, gcc's doesn't)
    and branch replacement with logic.

    Now, if only I could find a way to deal with gcc's PGO...
    I already found -fprofile-correction which helps
    with MT profiling bugs, but got stuck again right after
    that because of gcc not understanding setjmp().
    (In fact I had to write a replacement setjmp library for gcc,
    because even without PGO it started having problems with
    the one in msvcrt.dll, but either setjmp library doesn't
    help with producing a working executable with PGO enabled).

    My current gcc options:
    -s -mtune=pentium2 -O3 -fno-exceptions -fno-rtti -fomit-frame-pointer
    -fwhole-program -fweb -fstrict-aliasing -ffast-math
    Some more which occasionally help but are currently disabled:
    -floop-strip-mine -funsafe-loop-optimizations
    -falign-loops=16 -funroll-loops -frerun-cse-after-loop

  2. #2
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi!

    You're actually generating i386 code:

    -mtune selects the processor to use optimization heuristics for,
    -march selects actual instructions to use.

    For more recent CPUs i found that:

    -mtune=core2 -march=i686

    produces the fastest code, when GCC doesn't need to vectorize anything. In the opposite case you may want to enable vectorization and SSE2.

    ftree-vectorize or -O3 and -msse2 or -march=pentium4 or above

    Why don't you use -O3?

    To improve optimization:

    -fwhole-program -combine

    The first switch assumes you're compiling everything at once (not on a module base), so the whole code gets optimized more aggressively by automatically making alot of stuff static. Combine causes to compile multiple passed cpp files as a single comilation unit.

    Cheers

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Thanks for anwering
    But,

    > -mtune selects the processor to use optimization heuristics for,
    > -march selects actual instructions to use.
    > You're actually generating i386 code:

    Yeah, I know.
    But check the results - http://nishi.dreamhosters.com/u/paf02b.txt
    These "gcc450_all-xxx" lines mean both march and mtune set to xxx.
    And other "gcc450_xxx" mean only mtune=xxx.

    I do want (very much ) to use PGO with gcc though, it could allow for higher march too.
    But afaik it requires
    1) Splitting ppmd codec into separately compiled object module
    2) Writing a single-threaded test utility to gather PGO stats for ppmd module
    3) Recompile the ppmd.o with PGO enabled and link into MT archiver
    However its a bit time-consuming and because of added overhead at (1)
    there're less chances to get a noticeable improvement.

    Here you can see some more (just retested):
    Code:
    33.219s 35.235s 33.265s 35.219s // gcc450 march=i686 mtune=core2
    33.187s 35.219s 33.234s 35.250s // gcc450 march=i686 mtune=pentium2
    38.218s 35.454s 38.203s 35.453s // gcc450 march=core2 mtune=core2
    38.219s 35.485s 38.172s 35.438s // gcc450 march=pentium2 mtune=core2
    I don't know what happens, but it either inlines too much,
    or gets stuck in huge new instructions.
    Afair this is not the first case too.

    > For more recent CPUs i found that:
    > -mtune=core2 -march=i686
    > produces the fastest code, when GCC doesn't need to
    > vectorize anything. In the opposite case you may want to
    > enable vectorization and SSE2.
    > ftree-vectorize or -O3 and -msse2 or -march=pentium4 or above

    Well, unfortunately (for intelc) there's nothing to vectorize in ppmd_sh8.

    > Why don't you use -O3?

    I do, look again.

    > To improve optimization:
    > -fwhole-program -combine

    I do use "-fwhole-program" too, and I don't have to add "-combine",
    because I always did it manually even before gcc got that switch

  4. #4
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Looks like i missed O3, didn't read carefully enough. I tried out gcc 4.5 a week ago and got slower code. Wonder why, normally GCC likes my coding style.

    You can look at the optimizations turned on when specifying --profile-use, which might help in your case, too (like -fweb). -fstrict-aliasing is automatically present at O2+.

    For M1 i came up with:

    -fno-exceptions -fno-rtti -fwhole-program -combine -mtune=core2 -march=i686 -ffast-math -fomit-frame-pointer (and --profile-generate/--profile-use)

    Combine doesn't make sense, since i compile everything as a single source, too. I just kept it inside to avoid forgetting about it

    No more hints atm, good luck!

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > tried out gcc 4.5 a week ago and got slower code. Wonder why, normally GCC likes my coding style.

    I don't know much, but guessing from this: http://tdm-gcc.tdragon.net/development
    you might have tried a different gcc 4.5
    And too much inlining alone is enough to get slower code (if there's more than code cache in internal loop).

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quick test: g++ 4.5.0 seems about 1% faster than 4.4.0 running zp with options -O2 -march=pentiumpro -fomit-frame-pointer -s (2 GHz T3200, 3 GB, Vista 32 bit).

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Got some more interesting timings:

    Code:
    33.797s 35.875s 33.797s 35.797s // baseline gcc450
    42.515s 40.344s 42.578s 40.359s // icc111 paf.cpp paf_a.cpp paf_x.cpp
    42.437s 40.375s 42.500s 40.375s // icc111 with #includes
    35.078s 37.250s 35.031s 37.203s // gcc450 -combine paf.cpp paf_a.cpp paf_x.cpp
    35.015s 37.172s 35.016s 37.187s // gcc450, no -combine
    33.657s 35.563s 33.703s 35.579s // gcc450 with #includes
    To be specific, there were separate utilities for archive creation (paf_a)
    and extraction (paf_x) and here I integrated them with two different
    methods.
    First is basically renaming of main() functions in paf_a.cpp and paf_x.cpp
    and compiling like "gcc paf.cpp paf_a.cpp paf_x.cpp", and another is
    adding #include "paf_a.cpp" etc to paf.cpp.
    And so, as you can see, for IntelC it didn't really matter, but the
    gcc version suddenly got +2s delay from somewhere.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    And another compiler comparison.
    Here I tried to compile paf with exe size optimization.
    As expected, MSC won, and as expected, the size-optimized executables are slower.
    Also I separately benchmarked the upx-compressed binaries, and seems that the delay
    added by upx is visible:

    Code:
    33.657s 35.563s 33.703s 35.579s // gcc450   exe=111616 
    33.984s 35.875s 34.032s 35.891s // gcc450    upx=48640
    
    51.718s 48.813s 51.813s 48.875s // gcc441_Os exe=48640 
    45.125s 42.000s 45.140s 42.032s // ic111_Os  exe=46080 
    60.062s 61.500s 60.110s 61.500s // vc8_Os    exe=23552 
    
    51.812s 48.891s 51.828s 48.906s // gcc441_os upx=29184
    45.187s 42.047s 45.219s 42.078s // ic111_os  upx=25088
    60.141s 61.593s 60.172s 61.578s // vc8_os    upx=21504

  9. #9
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    obviously it depends on your codebase.

    I found, on my code, that gcc 4.x was always superior to ICC. I'm sure I've said that before.

    Not doing much multi-threaded code (being on Linux I tend to prefer forking), I've never had a problem with -profile-generate/use, and it typically adds about 10% improvement.

    I tend to use -O9 -fomit-frame-pointer -march=native -msse4.1 (from memory)

    I also sometimes use templates to explicitly unroll recursion.

    Most speed of course comes from cache awareness and reorganising data structures for that.

    There was an interesting program here: http://www.coyotegulch.com/products/acovea/ but it seems to be abandonware and I couldn't get it to run out-of-the-box and I didn't try very hard...

  10. #10
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    We got working implementations of genetic algorithms here. I can provide my M1 genetic optimizer framework with a simple example on how to use the optimizer for other tasks. One needs to declare the variables (e.g. compiler options) like that:

    Code:
    ...
    BITSTR(FMERGE_CONSTANTS, uint, 1, NO_FLAGS) MORE // a single bit
    INTEGERL(FINLINE_LIMIT, uint, 0, 9999, NOLFAGS) MORE // values from 0 to 9999
    ...
    I think a perl script could create a dump based on the gcc manual html. Together with the optimizer framework it'll create and assign the variables based on the currently evaluated individual of the population. You need to write a function, which mapps the (un)set variables to a given compiler settings string:

    Code:
    ...
    if (FMERGE_CONSTANTS) { arg += "-fmerge-constants "; }
    arg += "-finline-limit=" + FINLINE_LIMIT;
    ...
    In addition you need to "evaluate" the cost function and return its value (which is beeing minimized):
    Code:
    ...
    system("g++ " + args + ...); // compile
    clock_t diff=clock();
    system("a.out"); // run
    diff = clock()-diff; // benchmark
    ...
    BTW -O9 doesn't make any sense, -O3 is the max. level. Read the docs.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > I found, on my code, that gcc 4.x was always superior to ICC.
    > I'm sure I've said that before.

    Well, 4.4 was already on par with IC in some cases, and
    somehow I doubt that you've really compared IC 11.1 vs gcc 4.3 or less.
    As I've seen it, early 4.x were weird and didn't necessarily win even vs gcc 3.4.

    Also, intelc on linux is significantly different. That is, they use
    the EDG frontend there,
    instead of their own, so practically its a different compiler.

    > I've never had a problem with -profile-generate/use, and
    > it typically adds about 10% improvement.

    Good, then maybe you'd explain how to properly use it?
    I mean, they have a lot of -fprofile-xxx options now and
    maybe I'm missing something.
    But IC accumulates profiles (generates a new random name
    on each program run) and merges stats in prof-use compile,
    and doesn't complain much about modified sources (says
    something like "computing a static profile" for a function).
    While with gcc I only have a single profile which gets
    discarded if even a single source line is modified anywhere.
    Its a major bother especially for compressors, because
    usually they have at least 2 different processing modes.
    Also gcc's PGO is very unstable - I use a profile collected
    on encoding, encoding goes faster, decoding crashes - like that.

    > I tend to use -O9 -fomit-frame-pointer -march=native -msse4.1 (from memory)

    Well, if you have some table transformations all over, that might be right.
    But for me these extended instruction sets just slow things down
    (because the code becomes more bloated).
    I think the biggest concern for my stuff is mostly whether the internal loop
    fits into code cache.

    > I also sometimes use templates to explicitly unroll recursion.

    I guess I just don't use the "automatic" recursion anymore.
    Instead, its commonly better to explicitly work with stacks and such.
    And I think I usually unroll _all_ the code. Most of the functions
    in my code are functions for editing convenience (modular design etc),
    but they're rarely called multiple times in the same loop,
    So I tend to only "uninline" some specific functions which "deserve" that.

    > Most speed of course comes from cache awareness and
    > reorganising data structures for that.

    Yeah... Unfortunately I'm not sure how to do that for ppmd :)
    btw, its http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar :)

    > I can provide my M1 genetic optimizer framework with a
    > simple example on how to use the optimizer for other tasks.

    Its kinda too straightforward imho.
    I mean, the main problem is that optimizations can be applied
    on much finer level (ie PGO and attributes), and it requires
    a full C++ parser to automatically add these things to a program.
    Also there'd be dozens of KBs of optimizer profile data, which
    is a problem (because of speed) even if optimizer can handle it,

    And there's another problem - you just can't simply enumerate
    all the gcc option combinations like that - there're high chances
    that with some combinations either program would crash, or it
    would generate undecodable files... and compiler can crash or
    generate garbage as well.
    For example, I have a "mapper" script, which patches fragments
    of executable with 0xCCs and runs a test to verify whether that
    part was necessary. Obviously, I have exception handlers in place
    there, and it mostly works as intended. But it still requires
    manual intrusion, because apparently there're still kinds of
    exceptions about which the system especially wants to inform
    the user - specifically, when the system loaded crashes while
    trying to load an executabled with butchered imports.

    > In addition you need to "evaluate" the cost function and
    > return its value (which is beeing minimized):

    There's still that - http://ctuning.org/wiki/index.php/CTools:CTuningCC
    Though I still wasn't able to evalulate it either :)

  12. #12
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Its kinda too straightforward imho.
    I mean, the main problem is that optimizations can be applied
    on much finer level (ie PGO and attributes), and it requires
    a full C++ parser to automatically add these things to a program.
    Well what people currently do (including me and you) is to tune compiler options manually like that. I don't see why automating this is a bad idea. There's no demand of overall completeness as a first step, so if one just wants to automate the code generation optimization via compiler switches that gonna be fine. Acoeva does nothing different.

    Making a program using a proved optimizer implementation is no hard work.

    BTW Didn't black_fox provide some "optimized" compiles which were generated by some script some time ago?!

    Also there'd be dozens of KBs of optimizer profile data, which
    is a problem (because of speed) even if optimizer can handle it,
    Why? Simply delte the old profile, run the program to generate a profile, recompile with the specified options using the just generated profile and measure its execution time.

    And there's another problem - you just can't simply enumerate
    all the gcc option combinations like that - there're high chances
    that with some combinations either program would crash, or it
    would generate undecodable files... and compiler can crash or
    generate garbage as well.
    Both cases can be detected (verify decompression and program exit codes) and rejection can be easily be integrated by evaluating the corresponding individual to have a cost function value of infinity (when minimizing).

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Sadly, but sometimes, even dummy Visual C++ 2008 outperforms ICL. New PPMX has many loops that can be easily vectorized. If compare compiles with no SSE2 enabled, VC wins here, if enable SSE2, VC compile becomes a little bit slower, ICL nicely gains some compression speed (all most performance important loops are vectorized). i.e. ICL with its vectorization wins, but what about new GCC? What release of GCC for Windows preferable and where I can get it?

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Download the following packages from the MinGW repository at sourceforge.net:

    binutils-2.20.51-1-mingw32-bin
    gcc-c++-4.5.0-1-mingw32-bin
    gcc-core-4.5.0-1-mingw32-bin
    gdb-7.1-2-mingw32-bin
    libexpat-2.0.1-1-mingw32-dll-1
    libgmp-5.0.1-1-mingw32-dll-10
    libmpc-0.8.1-1-mingw32-dll-2
    libmpfr-2.4.1-1-mingw32-dll-1
    libpthread-2.8.0-3-mingw32-dll-2
    mingwrt-3.18-mingw32-dev
    w32api-3.14-mingw32-dev

    And unpack everything into the same directory. That gonna give you GCC 4.5.0 with c and c++ support and gdb. Somehow for M1 gcc 4.4.0 produces faster code due to different inlining policies - but i didn't grab that tdm stuff. You may want to get pthread and/or openmp supprt.

    BTW: i noticed that, if one doesn't rely on standard conformance (e.g. ieee fpu math), one can compile with -Ofast instead of -O3. Additional non-standard switches, like -ffast-math and -fomit-frame-pointer, are turned on (in addition to -O3) that way.
    Last edited by toffer; 20th July 2010 at 13:45.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    @encode:

    > Sadly, but sometimes, even dummy Visual C++ 2008
    > outperforms ICL.

    That's hardly believable.
    Can you post timings for your VC2008 compiles
    comparing to my IC rip? (from http://encode.su/threads/605-Slow-Vi...ll=1#post12151)
    You can either test it with paq8 sources included there,
    or your own source compiled with make_paq.bat script.

    VS GUI builds don't mean anything in this case, because
    to do it right you have to be very careful about all
    the corners of project option dialogs, and make sure
    that its a release build, and the IC is actually enabled.

    > ICL with its vectorization wins, but what about new GCC?

    Well, I really don't see a way for VS2008 to win over IC
    when its used right (it'd be another matter if you talked
    about VS2010 as I still didn't test it, but VS2008 I use too).

    > What release of GCC for Windows preferable and where I can get it?

    I use this: http://tdm-gcc.tdragon.net/
    There's also the official one on http://mingw.org though

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > Well what people currently do (including me and you) is to
    > tune compiler options manually like that. I don't see why
    > automating this is a bad idea.

    Not that its really a bad idea, but I don't see how you
    can get anything from automation this way.
    I mean, to automate some option, you'd normally have to
    read the manual for that option at least, because such
    optimization would be a relatively slow process, with
    20s per iteration if you're lucky, and you'd want to
    reduce the parameter space.
    And if you'd read the manual, you'd likely end up with
    a list of applicable options, all of which would improve speed :)
    Well, I guess some bruteforce would still make sense with
    a "blind" optimization, where you take somebody's source
    and want to build the fastest exe from it.
    But it just doesn't make sense to try that for my own stuff,
    because I either know what's applicable already, or can
    easier bruteforce a few things with a shell script.
    Otherwise I'd use that approach myself since long ago :)

    > BTW Didn't black_fox provide some "optimized" compiles
    > which were generated by some script some time ago?!

    I think that was just a bruteforce shell script, with a
    list of all combinations of (a reduced set of) compiler options.

    > > Also there'd be dozens of KBs of optimizer profile data, which
    > > is a problem (because of speed) even if optimizer can handle it,
    >
    > Why? Simply delte the old profile, run the program to
    > generate a profile, recompile with the specified options
    > using the just generated profile and measure its execution
    > time.

    I talked about a different profile there, the optimizer's parameter
    profile. And only in the case where setting options for specific
    functions is considered.

    > > that with some combinations either program would crash, or it
    > > would generate undecodable files... and compiler can crash or
    > > generate garbage as well.

    > Both cases can be detected (verify decompression and
    > program exit codes) and rejection can be easily be
    > integrated by evaluating the corresponding individual to
    > have a cost function value of infinity (when minimizing).

    I don't know, maybe its really easier in linux.
    But in windows, I have to prepare an exception handler first
    (for that I had to load the target exe with "debug" option set,
    and handle system notifications), otherwise when target exe
    would crash, the system would show an exception window and
    wait for user input, which defeats the whole idea of automated
    optimization. And as I said, even with a debugger-like wrapper,
    there're still cases which I don't know how to "catch".

    And another important point is verification speed.
    It would be >2x slower if you'd have to verify the decoding,
    and 10x slower if there'd be a PGO profiling pass.
    And even the minimal possible iteration (compiler call + encoding run)
    would be already slow comparing to eg. usual m1 iteration.
    For example, gcc compiling time for paf is 3.266s - on a fast Q9450.

    Also don't forget that optimizing for speed is different from
    our usual compression tuning. I mean, its hard to precisely
    measure speed, and it takes even more time.

  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Shelwien's ICL bundle has the same performance as my latest ICL - ~25 sec (Dummy PPMX on ENWIK. Latest GCC - ~24 sec. VC - ~23 sec. ICL and GCC with SSE2 enabled (Pentium 4) has about the same performance - ~20 sec. As a note, I've tuned all options with VC, at the same time with other compilers I've tested just a few most important and expected combinations.

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Well, maybe there's really a specific programming style optimal for each compiler...
    I didn't expect it for MSC though... I know that it can generate the most compact exes,
    but never got anything close in speed to IC/gcc4.3+ from it...

  19. #19
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Shelwien View Post
    Well, maybe there's really a specific programming style optimal for each compiler...
    I didn't expect it for MSC though... I know that it can generate the most compact exes,
    but never got anything close in speed to IC/gcc4.3+ from it...
    Actually, Visual C++ executables, in my case, notable bigger - ICL - 89 KB, VC - 102 KB. Tested Visual C++ 2010. It is slower. Really liked its interface though. Anyway, it is possible to change the "Platform Toolset" from v100 to v90 and use Visual C++ 2008 compiler tools and libraries.
    Well, this PPMX is not that computationally expensive. Also, VC provides a little bit faster codes for my LZ-based compressors. i.e. the main performance hit is the memory access. Gonna test Visual C++ 2005...

    EDIT: Tested Visual C++ 2005... ...and it is notable slower than Visual Stidio 2008. As a bonus 2005's has a bug (_fseeki64).

  20. #20
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Anyway, the VC is dummy compiler. As example:
    (tot - class member)
    Code:
    			int sum=0;
    			for (int i=0; i<256; ++i)
    				sum+=(n[i]>>=1);
    			tot=sum;
    Is faster than
    Code:
    			tot=0;
    			for (int i=0; i<256; ++i)
    				tot+=(n[i]>>=1);
    With ICL there is no difference - i.e. compiler is smart enough.

  21. #21
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Well, 4.4 was already on par with IC in some cases, and
    somehow I doubt that you've really compared IC 11.1 vs gcc 4.3 or less.
    As I've seen it, early 4.x were weird and didn't necessarily win even vs gcc 3.4.
    well, really I did Every time I enter a competition, which is about every six months or so, I write a solver in python and then one in c++, and I've tried cuda a couple of times but without it really helping the problems I've been tackling yet. And then I play around with all the compilers, including paid-for copies of the IC that I have at work, and run the fastest on the computers I borrow. So whilst I have difficulty citing any particular versions and statistics, and I am pretty sure I have tried 4.3 verses a proper IC at some point in time.

    I've been doing this for years, and I remember that in 2002 or so the fastest I had was the Metrowerks C++ compiler for win32.

    Regarding -O9, yes I grasp that the current settings only go up to 3, but its an intent thing and it makes my intent clearer (to me).

    Regarding SSE, this is the kind of code I put in SSE: http://sites.google.com/site/william...te-differences

    Finally the memory structure thing for PPMD; hmm, looked at the varnish cache stuff? http://queue.acm.org/detail.cfm?id=1814327 and especially figure 6 http://phk.freebsd.dk/B-Heap/fig6.png

    I've been playing a bit with caching recently, you see: http://sites.google.com/site/william...cacher_proto_1

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Thanks for the links, but page-level optimizations are not usually
    applicable for compression - in fact the speed can be improved
    by using "large pages" (thus effectively disabling the paging) -
    eg. see http://www.7-max.com/,

    And as to cache line packing, we're well aware of that... but
    somehow weren't able to invent any impressive way to make
    use of it (packing a subtree into a line/page is pretty trivial imho).

    The task spec is fairily compact btw - for example, we want to
    optimize the hashtable layout for statistics from a fixed context.
    Then we can collect a sequence of context transitions from a sample,
    design some kind of a generalized hash function or state mapping,
    add an access speed evaluation function (preferably based on a
    memory system simulation - actual profiling in a modern OS would
    be rather imprecise) and tune the parameters.
    I guess there're no good ideas for that "generalized function" or
    something, but I don't think anybody actually implemented that.

    Also, as to ppmd, its even harder to optimize it that way
    (unlike some hashtables), because its relatively cache-friendly
    as it is - statistics for bytes in context are stored as plain
    tables, and symbols are even dynamically reordered to allow faster
    access to more "popular" ones.

  23. #23
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

    Lightbulb

    Visual C++ 2008 is quite sensitive to a variable declaration & placement. As example, changing just a variable declaration place (within same function), differeing in just a few lines below/above we may get compresison time - ~20 sec instead of ~23 sec on enwik8. A huge difference. The catch in memory access, it should be sequential. As example, if we have an array of struct:

    struct E {
    int a;
    int b;
    int c;
    } t[N];

    we should read it sequentially:

    const int a=t[i].a;
    const int b=t[i].b;
    const int c=t[i].c;

    Instead of, say:

    const int a=t[i].a;
    const int c=t[i].c;
    const int b=t[i].b;

  24. #24
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    It shouldn't do any reordering in such case.
    1. It shouldn't reorder structure elements because standard prohibits it.
    2. It shouldn't reorder instructions because either of them can throw exception (like access violation) and reordering would change code semantics.

    But it's good to know it's so important.

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    I guess MSC just doesn't care about standards that much.
    As a different example, I had a problem like this:
    Code:
    void func1( void ) {
      if( func2() ) {
        char buf[1000000];
        func3( buf );
      }
    }
    This worked fine with gcc and IntelC, but crashed at func2 with MSC.
    The reason being that MSC allocated the space for buf
    at the start of func1, so func2 didn't have enough stack space.

  26. #26
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Shelwien View Post
    I guess MSC just doesn't care about standards that much.
    As a different example, I had a problem like this:
    Code:
    void func1( void ) {
      if( func2() ) {
        char buf[1000000];
        func3( buf );
      }
    }
    This worked fine with gcc and IntelC, but crashed at func2 with MSC.
    The reason being that MSC allocated the space for buf
    at the start of func1, so func2 didn't have enough stack space.
    What version of the MSC do you use?
    I know that Visual C++ 2005 has many well known and serious bugs. Visual C++ 2008 SP1 looking good to me...

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Naturally, I have a few
    Been looking for VS 97 (=5.0) recently, in hope that its libs are more compact, but no luck still.
    And anyway many gcc's features are more annoying than other compiler's bugs

  28. #28
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    My thought was more concerned with how you (typically) have to have the same amount of memory for dictionary for encode and decode; command-line switches that affect compression usually come down to amount of memory.

    Of course this is an implementation limitation: green demonstrates how you can do it all with very little RAM if you trade RAM for runtime.

    Now imagine you kept your data in discardable pages, and did a full recalculation on miss?

    Its just an interesting thought experiment.

    Also, a nice link that seems topical: http://ridiculousfish.com/blog/archi...sh_made_a_mess

    Quote Originally Posted by Shelwien View Post
    Thanks for the links, but page-level optimizations are not usually
    applicable for compression - ...
    And as to cache line packing, we're well aware of that... but
    somehow weren't able to invent any impressive way to make
    use of it (packing a subtree into a line/page is pretty trivial imho).
    ...
    Also, as to ppmd, its even harder to optimize it that way
    (unlike some hashtables), because its relatively cache-friendly
    as it is - statistics for bytes in context are stored as plain
    tables, and symbols are even dynamically reordered to allow faster
    access to more "popular" ones.
    Last edited by willvarfar; 24th July 2010 at 02:49.

  29. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > Now imagine you kept your data in discardable pages, and did a full recalculation on miss?

    That was the initial idea in fact (and still is).
    In that sense, I'm more interested in compression of intermediate structures though - like http://encode.su/threads/379-BWT-wit...ull=1#post7540

    But neither of these ideas has anything to do with hardware considerations (well, except for general idea that memory access is slow)

  30. #30
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    http://encode.su/threads/257-PPMX-v0...ll=1#post22033 just illustrates that, whilst we all knew there was no ultimate compiler, that there isn't even a generally often best one!

    I remember sourceforge had a 'compile farm' that compiled your programs on different architectures and with different switches. Is there any similiar public service that compiles your code with different compilers and benchmarks the results? Or better yet, goes so far as to use some fancy genetic algorithm to find very good compiler options too....

    Here's to wishful thinking!

Page 1 of 2 12 LastLast

Similar Threads

  1. GCC 4.4.1 for Windows
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 15th January 2010, 23:39
  2. GCC 4.4 and compression speed
    By Hahobas in forum Data Compression
    Replies: 14
    Last Post: 5th March 2009, 17:31
  3. GCC mmx support
    By toffer in forum Forum Archive
    Replies: 6
    Last Post: 19th January 2008, 13:20

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •