What is the distance in bytes between instructions that can be executed in parallel? How many ALUs are there for this, and what instructions can they follow? I mean CPU I8700K.
What is the distance in bytes between instructions that can be executed in parallel? How many ALUs are there for this, and what instructions can they follow? I mean CPU I8700K.
That is a really difficult question.
The topic is very deep. Just look at Computer Architecture 2013– Out-of-Order Execution.ppt. And it's not even about today's cpus, the current cpus are more complex than those mentioned on the slides.
There are so many nuances. Even if you are certain that your cpu works this way or that way, there are surprises all over the place.
Once I developed a cpu-heavy algorithm. As it was a very small algo, so I could easily try different versions: data/code alignments of different sizes, padding with nops of different sizes, calculating the same thing with different instructions in different order, to utilize the u/v-pipes differently, interleaving parallel sets of instructions to minimize stalls. I tried to see and understand which works best. I thought at least that I could find some patterns at least. Well…
Many of the results were surprising, even counterintuitive. The best algo that worked really well on my intel cpu was not performing well at all on my amd cpu - and the difference was significant for no obvious reason. There was another version that worked well on my amd, but not that good on my intel.
So my different approaches performed differently on the different cpus. And the difference was significant (by 10%).
The good news was that I managed to speed up the calculation by 30% (compared to a c++ version), the bad news was that I had no idea which algo works best on which cpu.
So I ended up with the following: at app startup I ran a quick benchmark (taking warmup and caching into account) to see which algo works best on the current cpu, and used that one on subsequent calls.
Final word: you have to experiment. That's the best advice I can give.
Modern cpu architectures are very complex and mostly undocumented.
You can read this https://www.agner.org/optimize/#manuals but its not really up to date.
I don't think there's such a thing as "distance in bytes between instructions that can be executed in parallel" -
we can only assume that anything not in L1C also won't be in microcode cache.
But mostly you can view modern cpu as an optimizing compiler (from x86/x64 to VLIW microcode).
So afaik there's no need to explicitly interleave anything - just write as much as possible without branches.
Even not-taken branches spawn speculative execution, which takes resources from the "main thread".
Although there's this interesting new extension - https://stackoverflow.com/questions/...on-actually-do
In manuals its described like that and doesn't seem useful, but IntelC seems to place it into branch targets when builtin_expect(0) is used,
so I have a strong suspicion that it actually cuts speculative execution (also based on mnemonic).
Also there's this: http://nishi.dreamhosters.com/u/instlat.html
"L" values are instruction latency (cpu clocks until result is available and next instruction using it can be started)
"T" values are "throughput" (cpu clocks until next independent instruction gets into pipeline).
For simple instructions like "ADD reg,reg" it may be possible to execute 3-4 per clock (4 only on newer AMD cpus).
lz77, the third Agner manual (microarchitecture) has all the details you need. The grand scheme hasn't changed since Pentium Pro, but Skylake lifted most of stupid restrictions of earlier Intel cpus. You may find useful my video: https://www.youtube.com/watch?v=XKnXq52OBb8 . Its description contains a lot of useful links, in particular high-level overview of all Intel architectures from Core2 to IceLake.
Short answer to your question - a modern CPU preloads hundreds (224 for Skylake) of instructions-going-to-execute in the order of predicted branches, and then execute out-of-order up to 4 instructions on each cycle as far as their operands are ready (i.e. each cycle - the first 4 instructions in the stream whose operands are ready). When CPU mispredicts a branch, this means that all instructions after misprediction has to be reloaded that adds 14-cycle delay before first instruction of the new branch is started to execute.
Eugene, AMD cpus starting with Zen and Intel ones starting with Haswell can execute 4 ADDs per clock. Moreover, it seems that Zen3 (as well as Apple A14) can do 6 ADDs per clock: https://www.agner.org/forum/viewtopic.php?f=1&t=56 . EDIT: No, it's 4 adds plus 2 jumps
Also, very interesting Zen2 improvement that may speed up our programs: https://www.agner.org/forum/viewtopic.php?f=1&t=41
Even not-taken branches spawn speculative execution, which takes resources from the "main thread".
Not sure that you mean. Correctly predicted branch (both taken and not taken) is just one regular instruction executed similar to arithmetic ones.
Last edited by Bulat Ziganshin; 19th February 2021 at 14:29.
3 years ago I ported my decompressor implementing the fast algorithm to asm, and all the variables were in registers rax-r15 to reduce the number of memory accesses. It decompresses enwik8 in 0.4 sec. My new decompressor on C, which implements a slower algorithm, decompresses enwik8 in 0.328 sec....
I bet it's because you need to learn cpu microarchitecture in order to optimize for it. It's not enough just to write ASM code
0.328 seconds to decode is roughly 305MB/s, however there are codecs in the wild which can do 1-3GB/s decode. LZ4 is a good example and it's not written in assembly at all, it's written in C, it's the compiler that does the rest of the optimizations for you.
The trick to writing fast C is to keep it simple and be aware of what instructions the compiler will emit, since this will produce the best assembly code possible.
Though there are edge cases when the compiler cannot optimize certain code-paths, they are quite rare. The best example that comes to mind is rANS renormalization, hence why rANS static exists.
Word of advice; try using godbolt to inspect the assembly output of your C/C++, it really helps you understand what the compiler thinks you're trying to do with your functions.
modern OoO CPU has 100-200 real registers exactly for this reasonIt may have happened because there are no registers left for OoO execution.![]()
btw, this is series of posts exactly about efficient bit reading: https://fgiesen.wordpress.com/2018/0...y-ways-part-1/