Floating-point instructions are defined to have different accuracy, but for many operations the accuracy should be 100% - I think that's the case for at least addition (ie addition should behave identically on every processor, provided that it's done with the same precision). Here's a quote from Intel's paper:

I'm not 100% sure, but that could mean that DPPS result can be 100% reproducible without SSE4. The only caveat I'm thinking of is optimization. I would strongly recommend to use volatile asm instructions or maybe intrinsics to have 100% control over the code.2.2.14 IEEE 754 Compliance

The six SSE4.1 instructions that perform floating-point arithmetic are:

? DPPS

? DPPD

? ROUNDPS

? ROUNDPD

? ROUNDSS

? ROUNDSD

Dot Product operations are not specified in IEEE-754. When neither FTZ nor DAZ are

enabled, the dot product instructions resemble sequences of IEEE-754 multiplies and

adds (with rounding at each stage), except that the treatment of input NaN?s is

implementation specific (there will be at least one NaN in the output). The input

select fields (bits imm8[4:7]) force input elements to +0.0f prior to the first multiply

and will suppress input exceptions that would otherwise have been be generated.

But then you would have to restrict yourself from using instructions that aren't 100% accurate (per IEEE 754 standard) - that could negate performance advantages or make the code too complex to maintain.

Here's a quote from nVidia paper:

Overall, having reproducible results with floating point is tricky, but not impossible, I think.2.2 Operations and Accuracy

The IEEE 754 standard requires support for a hand-

ful of operations. These include the arithmetic opera-

tions add, subtract, multiply, divide, square root, fused-

multiply-add, remainder, conversion operations, scal-

ing, sign operations, and comparisons. The results of

these operations are guaranteed to be the same for all

implementations of the standard, for a given format and

rounding mode.