Damn.
Suppose that we have some simplified LZ code:
Code:
struct {
word offset;
byte length;
byte literal;
} lzdata[n_lzrecords];
Then, there's a common way to decode it, like
most LZ decoders work - parsing and interpreting
the "instructions" at once:
Code:
byte* p = output_buffer;
for( i=0; i<n_lzrecords; i++ ) {
p += memcpy( p, p-lzdata[i].offset, lzdata[i].length );
*p++ = lzdata[i].literal;
}
But it can be also implemented in a different way,
with 2 passes - parsing/compiling and actual execution:
Code:
// code generation
byte* q = code_buffer;
*q++ = 0x97; // xchg edi,eax
*q++ = 0x33; *q++ = 0xC9; // xor ecx,ecx
for( i=0; i<n_lzrecords; i++ ) {
// lea esi,[edi+offs]
*q++ = 0x8D; *q++ = 0xB7; (uint&)*q=-lzdata[i].offset;
*q++ = 0xF3; *q++ = 0xA4; // rep movsb
*q++ = 0xB0; *q++ = lzdata[i].literal; // mov al,literal
*q++ = 0xAA; // stosb
}
*q++ = 0xC3; // ret
// uncompressed data generation
((void(*)(byte*))(void*)code_buffer)( output_buffer );
The idea is that there could be a speed improvement, because
like this there's no inner loop in parsing, so more efficient code
can be generated by C++ compiler for it, and also there're only
2 data locations involved at each stage, instead of 3 in normal
LZ decoder, so the L1 data cache can be used more efficiently.
Also, while it may be a bit hard to observe a real speedup due
to this trick in a single-threaded decoder (the requirements
would be like lzdata size > 1M and uncompressed data size > 100M,
or something like that), it certainly makes some sense on a
multicore system, as it allows to run the parsing and uncompressed
data generation in parallel.