Hello,
I would like to know how big the penality is when someone mixes data types (integers and floating point values) when executing SIMD instructions on the x86 architecture. For this purpose I have prepared a test programm that executes a few SIMD instructions and counts the number of machine cycles taken.
I have only 2 computers and I read that it has no penality in this situation but other processors do have a penality. For this reason I would appreciate it, if you folks could run the attached test software on your processor, report the measurement results and the name of your processor.
Test results (output of the programm):
Code:
what this programm does
-----------------------
This programm is going to run several pieces of code with SIMD-instructions and measure the execution time in machine cycles.
The code pieces are made to check wether processors exist which have a penality when it's about executing SIMD-instructions with mixed datatypes.
The programm is going to run several tests. Before executing some test code the programm will show the code. While running a test the programm will execute the code 4,096 times in a loop. The execution time of the loop foot is included in the measurement result. The measurement result will show the execution time of the whole loop (several tenthousand machine cycles) + a few instructions extra to keep track of the measurement results (less than 100 machine cycles). The loop is going to be executed many times and only the lowest execution time is going to be showed because it contains the lowest amount of interference from other programms and the operating system.
possible crash warning
----------------------
The programm is going to execute
- the instruction "read time-stamp counter" ("rdtsc"), which exists since 1993 with the introduction of the Pentium processor,
- instructions from the instruction set "streaming single instruction multiple data extensions" ("SSE"), which exists since 26th February 1999 with the introduction of the Pentium 3 processor, and
- instructions from the instruction set "streaming single instruction multiple data extensions 2" ("SSE2"), which exists since 20th November 2000 with the introduction of the Pentium 4 processor.
Even though the requirements for running this programm are probably met by most processors, there is still a slight chance that your processor does not support the execution of one of the instructions, even if your processor is relatively new. This simple programm is not going tocheck wether your processor meets the execution requirements and therefore might crash instead of outputting the measurement results.
test -1 (empty loop body)
-------------------------
data movement: none
arithmetic: none
code:
none
execution time in machine cycles: 8218
test 0
------
data movement: integer/general purpose
arithmetic: integer
code:
/*|*/
# move aligned double quadword (integer/general purpose; SSE2)
# movdqa xmm0, RAM[eax]
66 0F 6F 00
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm1
66 0F FA C1
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm2
66 0F FA C2
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm3
66 0F FA C3
/*|*/
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm4
66 0F FA C4
# move aligned double quadword (integer/general purpose; SSE2)
# movdqa RAM[eax], xmm0
66 0F 7F 00
execution time in machine cycles: 53284
test 1
------
data movement: floating point
arithmetic: integer
code:
/*|*/
# move aligned packed single-precision floating-point values (floating point; SSE)
# movaps xmm0, RAM[eax]
0F 28 00
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm1
66 0F FA C1
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm2
66 0F FA C2
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm3
66 0F FA C3
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm4
66/*|*/ 0F FA C4
# move aligned packed single-precision floating-point values (floating point; SSE)
# movaps RAM[eax], xmm0
0F 29 00
execution time in machine cycles: 53284
test 2
------
data movement: floating point
arithmetic: floating point
code:
/*|*/
# move aligned packed single-precision floating-point values (floating point; SSE)
# movaps xmm0, RAM[eax]
0F 28 00
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm1
0F 5C C1
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm2
0F 5C C2
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm3
0F 5C C3
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm4
0F 5C C4
# move aligned packed single-precision floating-point values (floating point; SSE)
# movaps RAM[eax], xmm0
0F/*|*/ 29 00
execution time in machine cycles: 5435423
test 3
------
data movement: floating point and integer
arithmetic: floating point and integer
code:
/*|*/
# move aligned packed single-precision floating-point values (floating point; SSE)
# movaps xmm0, RAM[eax]
0F 28 00
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm1
0F 5C C1
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm2
66 0F FA C2
# subtract packed single-precision floating-point values (floating point; SSE)
# subps xmm0, xmm3
0F 5C C3
# subtract packed integers (integer; SSE2)
# psub xmm0, xmm4
66 0F FA/*|*/ C4
# move aligned double quadword (integer/general purpose; SSE2)
# movdqa RAM[eax], xmm0
66 0F 7F 00
execution time in machine cycles: 2968889
I have no idea why the execution time of test 2 and 3 is so high.
I would appreciate your help in testing. Thank you.
Test results so far:
Code:
with wrap-around:
┌─────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────┬─────────────┐
│processor │processing time for the test │thanks to │
├───────────────────────────────────┬─────────────┼───────────┬──────────┬──────────┬──────────┬──────────────────┤ │
│manufacturer │model │-1 │0 │1 │2 │3 │ │
│ │ │move: none │move: int │move: flo │move: flo │move: flo and int │ │
│ │ │arith: none│arith: int│arith: int│arith: flo│arith: flo and int│ │
├───────────────────────────────────┼─────────────┼───────────┼──────────┼──────────┼──────────┼──────────────────┼─────────────┤
│Intel Corporation │Atom D2550 │ 8218│ 53284│ 53284│ 5435423│ 2968889│just a worm │
└───────────────────────────────────┴─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────┴─────────────┘
without wrap-around:
┌─────────────────────────────────────────────────┬───────────────────────────────────────────────────────────────┬─────────────┐
│processor │processing time for the test │thanks to │
├───────────────────────────────────┬─────────────┼───────────┬──────────┬──────────┬──────────┬──────────────────┤ │
│manufacturer │model │-1 │0 │1 │2 │3 │ │
│ │ │move: none │move: int │move: flo │move: flo │move: flo and int │ │
│ │ │arith: none│arith: int│arith: int│arith: flo│arith: flo and int│ │
├───────────────────────────────────┼─────────────┼───────────┼──────────┼──────────┼──────────┼──────────────────┼─────────────┤
│Advanced Micro Devices Incorporated│A8-5500 │ 7541│ 71059│ 71060│ 119603│ 112137│Kennon Conrad│
│Intel Corporation │Atom D2550 │ 8218│ 53291│ 53291│ 118832│ 86058│just a worm │
│Intel Corporation │Atom E3815 │ 4136│ 36905│ 36905│ 68310│ 49192│just a worm │
│Intel Corporation │Core i7-4790K│ 3747│ 37270│ 37270│ 67061│ 63338│Kennon Conrad│
└───────────────────────────────────┴─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────┴─────────────┘