Using carryless multiplication to calculcate CRCs is not new, per se --- my code more or less follows this 2009 paper from Intel: http://www.intel.com/content/dam/www...lqdq-paper.pdf. And there are already some CRC implementations that use this method. It may not be well known just how much faster it is than the table-driven methods, though, and it may have gotten faster as new Intel/AMD CPUs have been released in the past few years.
The inner loop processes 64 bytes (512 bits) per iteration, essentially like this:
Code:
const __v2di multipliers_4 = (__v2di){ 0x8F352D95, 0x1D9513D7 };
__m128i x0 = p[0], x1 = p[1], x2 = p[2], x3 = p[3];
p += 4;
for (; p != end512; p += 4) {
__m128i y0 = p[0], y1 = p[1], y2 = p[2], y3 = p[3];
y0 ^= _mm_clmulepi64_si128(x0, multipliers_4, 0x00);
y1 ^= _mm_clmulepi64_si128(x1, multipliers_4, 0x00);
y2 ^= _mm_clmulepi64_si128(x2, multipliers_4, 0x00);
y3 ^= _mm_clmulepi64_si128(x3, multipliers_4, 0x00);
y0 ^= _mm_clmulepi64_si128(x0, multipliers_4, 0x11);
y1 ^= _mm_clmulepi64_si128(x1, multipliers_4, 0x11);
y2 ^= _mm_clmulepi64_si128(x2, multipliers_4, 0x11);
y3 ^= _mm_clmulepi64_si128(x3, multipliers_4, 0x11);
x0 = y0, x1 = y1, x2 = y2, x3 = y3;
}
Each carryless multiplication and XOR "folds" 64 bits into 95 bits that are 543 (= 512 + 95 - 64) or 479 (= 512 + 95 - 128 ) bits later in the message. The precomputed constants in 'multipliers_4' are x^543 mod G(x) and x^479 mod G(x) where G(x) is the degree-32 CRC-32 generator polynomial. Note that this is *not* using the SSE4.2 "crc32l" instruction which assumes a different generator polynomial, the "Castagnoli polynomial".
Essentially the same method can be used on any other CPU that supports carryless multiplication --- e.g. ARMv8, I think, but I don't have an ARMv8 CPU to test right now.