Okay, I messed with paq8f a bit. I know it's "old" and "obsolete", as are DJGPP and DOS, but those are the best tools for "real" benchmarking against raw hardware, IMHO. (Plus, it's all I know, and I'm comfortable with it.) DXE is basically a simplistic DLL subset.
Basically, this compile isn't really a full replacement, so technically my silly fastpaq2.asm/.s would be better for real use, but ...
... at least this way you can change the dot_product() or train() routine without recompiling the entire main .EXE. It defaults to NOASM if no .DXE is found. Includes ready-to-use NASM/YASM code for using MMX or SSE2 in dotprod.dxe or train.dxe. No runtime checks, so make sure your cpu supports it!!
I just barely got DXE1 working, and while I'm sure DXE3 (DJGPP 2.04 only) would probably be better (dlopen, dlsym) for using only one .DXE instead of two, at least this way (_dxe_load) works with 2.03p2 also. (Not entirely sure, but it seems you can't mix .DXEs from 2.03p2 and 2.04, though. Not a big deal, esp. since this is a source-only release.) I also didn't bother with the unofficial DJELF fork either (although it supports .so files) although that won't UPX anyways, so ....
In short, this isn't a perfect example, but at least this way I can easily/quickly test my old P166 (or P4 or AMD64x2 using FreeDOS bootdisk) with various compiles of the NOASM stuff (GCC 3.4.4 -march=pentium or -mtune=i686, GCC 3.2.3, 4.0.1, 4.2.3, 4.3.3, 4.4.2, etc) to see which is fastest.
Testers welcome!