This is how zlib does it:
Code:
uint getbits( uint need ) {
uint val = bitbuf;
while( bitpos<need ) {
val |= uint(*inpptr++) << bitpos;
bitpos += 8;
}
bitbuf = val >> need;
bitpos -= need;
return val & ((1<<need)-1);
}
void putbits( uint val, uint len ) {
if( bi_valid+len > 16 ) {
bi_buf |= val << bi_valid;
pending_buf[pending++] = bi_buf & 0xff;
pending_buf[pending++] = bi_buf >> 8;
bi_buf = val >> (16-bi_valid);
bi_valid += len - 16;
} else {
bi_buf |= val << bi_valid;
bi_valid += len;
}
}
I don't like the implementation (though in fact, I didn't see _anything_ implemented properly in zlib),
but the code layout there (lsb-to-msb) is different from what I normally use, and that could be a decoding speed
optimization (shifting the new value can be faster than shifting the cache, then adding the new value).
Anyway, afaik, normally bitfield i/o should be very similar to rangecoder i/o (accumulating bitfields in a cache
register, output when it overflows). Of course, its also possible to read/write bitfields directly in the data,
but that would be _much_ slower.
As to possible optimizations, I guess, we can try msb-to-lsb vs lsb-to-msb layouts, aligned reads/writes
(see how zlib writes 16 bits at a time? but we can do 32 bits now), replacing branches with binary logic.
But overall its very simple anyway, and I really wonder whether it would matter on modern cpus...