Results 1 to 9 of 9

Thread: Out-of-Order Execution: how to optimize an ASM program?

  1. #1
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    176
    Thanks
    60
    Thanked 16 Times in 12 Posts

    Out-of-Order Execution: how to optimize an ASM program?

    ​What is the distance in bytes between instructions that can be executed in parallel? How many ALUs are there for this, and what instructions can they follow? I mean CPU I8700K.

  2. #2
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    735
    Thanks
    424
    Thanked 483 Times in 257 Posts
    That is a really difficult question.
    The topic is very deep. Just look at Computer Architecture 2013– Out-of-Order Execution.ppt. And it's not even about today's cpus, the current cpus are more complex than those mentioned on the slides.

    There are so many nuances. Even if you are certain that your cpu works this way or that way, there are surprises all over the place.
    Once I developed a cpu-heavy algorithm. As it was a very small algo, so I could easily try different versions: data/code alignments of different sizes, padding with nops of different sizes, calculating the same thing with different instructions in different order, to utilize the u/v-pipes differently, interleaving parallel sets of instructions to minimize stalls. I tried to see and understand which works best. I thought at least that I could find some patterns at least. Well…

    Many of the results were surprising, even counterintuitive. The best algo that worked really well on my intel cpu was not performing well at all on my amd cpu - and the difference was significant for no obvious reason. There was another version that worked well on my amd, but not that good on my intel.
    So my different approaches performed differently on the different cpus. And the difference was significant (by 10%).

    The good news was that I managed to speed up the calculation by 30% (compared to a c++ version), the bad news was that I had no idea which algo works best on which cpu.
    So I ended up with the following: at app startup I ran a quick benchmark (taking warmup and caching into account) to see which algo works best on the current cpu, and used that one on subsequent calls.

    Final word: you have to experiment. That's the best advice I can give.

  3. Thanks (2):

    lz77 (18th February 2021),Mike (18th February 2021)

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Modern cpu architectures are very complex and mostly undocumented.
    You can read this https://www.agner.org/optimize/#manuals but its not really up to date.
    I don't think there's such a thing as "distance in bytes between instructions that can be executed in parallel" -
    we can only assume that anything not in L1C also won't be in microcode cache.
    But mostly you can view modern cpu as an optimizing compiler (from x86/x64 to VLIW microcode).

    So afaik there's no need to explicitly interleave anything - just write as much as possible without branches.
    Even not-taken branches spawn speculative execution, which takes resources from the "main thread".
    Although there's this interesting new extension - https://stackoverflow.com/questions/...on-actually-do
    In manuals its described like that and doesn't seem useful, but IntelC seems to place it into branch targets when builtin_expect(0) is used,
    so I have a strong suspicion that it actually cuts speculative execution (also based on mnemonic).

    Also there's this: http://nishi.dreamhosters.com/u/instlat.html
    "L" values are instruction latency (cpu clocks until result is available and next instruction using it can be started)
    "T" values are "throughput" (cpu clocks until next independent instruction gets into pipeline).
    For simple instructions like "ADD reg,reg" it may be possible to execute 3-4 per clock (4 only on newer AMD cpus).

  5. Thanks (2):

    lz77 (18th February 2021),Mike (18th February 2021)

  6. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    lz77, the third Agner manual (microarchitecture) has all the details you need. The grand scheme hasn't changed since Pentium Pro, but Skylake lifted most of stupid restrictions of earlier Intel cpus. You may find useful my video: https://www.youtube.com/watch?v=XKnXq52OBb8 . Its description contains a lot of useful links, in particular high-level overview of all Intel architectures from Core2 to IceLake.

    Short answer to your question - a modern CPU preloads hundreds (224 for Skylake) of instructions-going-to-execute in the order of predicted branches, and then execute out-of-order up to 4 instructions on each cycle as far as their operands are ready (i.e. each cycle - the first 4 instructions in the stream whose operands are ready). When CPU mispredicts a branch, this means that all instructions after misprediction has to be reloaded that adds 14-cycle delay before first instruction of the new branch is started to execute.


    Eugene, AMD cpus starting with Zen and Intel ones starting with Haswell can execute 4 ADDs per clock. Moreover, it seems that Zen3 (as well as Apple A14) can do 6 ADDs per clock: https://www.agner.org/forum/viewtopic.php?f=1&t=56 . EDIT: No, it's 4 adds plus 2 jumps

    Also, very interesting Zen2 improvement that may speed up our programs: https://www.agner.org/forum/viewtopic.php?f=1&t=41


    Even not-taken branches spawn speculative execution, which takes resources from the "main thread".

    Not sure that you mean. Correctly predicted branch (both taken and not taken) is just one regular instruction executed similar to arithmetic ones.
    Last edited by Bulat Ziganshin; 19th February 2021 at 14:29.

  7. Thanks (3):

    lz77 (19th February 2021),Mike (19th February 2021),Shelwien (19th February 2021)

  8. #5
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    176
    Thanks
    60
    Thanked 16 Times in 12 Posts
    3 years ago I ported my decompressor implementing the fast algorithm to asm, and all the variables were in registers rax-r15 to reduce the number of memory accesses. It decompresses enwik8 in 0.4 sec. My new decompressor on C, which implements a slower algorithm, decompresses enwik8 in 0.328 sec....

  9. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    I bet it's because you need to learn cpu microarchitecture in order to optimize for it. It's not enough just to write ASM code

  10. #7
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    183
    Thanks
    30
    Thanked 80 Times in 47 Posts
    Quote Originally Posted by lz77 View Post
    3 years ago I ported my decompressor implementing the fast algorithm to asm, and all the variables were in registers rax-r15 to reduce the number of memory accesses. It decompresses enwik8 in 0.4 sec. My new decompressor on C, which implements a slower algorithm, decompresses enwik8 in 0.328 sec....
    0.328 seconds to decode is roughly 305MB/s, however there are codecs in the wild which can do 1-3GB/s decode. LZ4 is a good example and it's not written in assembly at all, it's written in C, it's the compiler that does the rest of the optimizations for you.

    The trick to writing fast C is to keep it simple and be aware of what instructions the compiler will emit, since this will produce the best assembly code possible.
    Though there are edge cases when the compiler cannot optimize certain code-paths, they are quite rare. ​The best example that comes to mind is rANS renormalization, hence why rANS static exists.

    Word of advice; try using godbolt to inspect the assembly output of your C/C++, it really helps you understand what the compiler thinks you're trying to do with your functions.

  11. #8
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    176
    Thanks
    60
    Thanked 16 Times in 12 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    I bet it's because you need to learn cpu microarchitecture in order to optimize for it. It's not enough just to write ASM code
    It may have happened because there are no registers left for OoO execution.

    Quote Originally Posted by Lucas View Post
    0.328 seconds to decode is roughly 305MB/s, however there are codecs in the wild which can do 1-3GB/s decode.
    LZ4 is ~ two times faster in decompression than mine. On CPU I5005U, Broadwell.

  12. #9
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    It may have happened because there are no registers left for OoO execution.
    modern OoO CPU has 100-200 real registers exactly for this reason

    btw, this is series of posts exactly about efficient bit reading: https://fgiesen.wordpress.com/2018/0...y-ways-part-1/

Similar Threads

  1. Optimize new markup language for compressibility
    By SolidComp in forum Data Compression
    Replies: 3
    Last Post: 31st August 2018, 10:01
  2. Replies: 4
    Last Post: 17th February 2018, 13:03
  3. The famous speculative execution security bug
    By Bulat Ziganshin in forum The Off-Topic Lounge
    Replies: 8
    Last Post: 13th January 2018, 01:05
  4. Concurrent kernel execution in OpenCL implementations
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 0
    Last Post: 16th April 2011, 17:04
  5. How to micro-optimize a compressor
    By willvarfar in forum Data Compression
    Replies: 12
    Last Post: 15th April 2010, 09:46

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •