I'm looking for a good, fast, free, and understandable SIMD library or two for x86 with SSE & ARM with NEON. (Support for other platforms would be nice, since I want to give away anything good I come up with, but those two platforms are the only two I personally need.)

I need to be able to do mostly simple things, like deinterleave three or four interleaved streams of bytes or shorts and reinterleave them.

Support for floats and miscellaneous simple media/numerical transforms would be nice---e.g., SIMD delta coding or linear prediction for audio & graphics streams---elegant coding & raw speed are more important than fancy features, and I'm mainly looking for a sound foundation to start writing custom code. (I'm a SIMD noob, and a good framework might reduce the stupid design mistakes I'll make.) Pre-built stuff specifically for graphics (e.g., JPEG LOCO-1 predictors) or audio would be nice, but is optional. I will mainly be punting to specific higher-level libraries for very common media formats like 24-bit RGB or 16-bit stereo audio, and focusing on detecting and exploiting arrays of structured records. (The sao star catalog from the Silesia corpus is a good example.)

I will be dealing with only small power-of-two-sized blocks of data (4KB or 16KB) and want to do things in chunks that will typically stay in modest-sized caches.

The ideal package would be a popular one that's foundation for other good packages---e.g. something that others have already built image-or audio- processing libraries on top of---so that it's likely to support new stuff like AVX on a reasonable timescale.

Any recommendations?