I’ve been lurking on this forum for a while and I finally got around to reading some papers and articles on arithmetic coding to get started. This past week, I implemented my first arithmetic coder which I tested together with a fixed order-0 model for ASCII, an adaptive order-0 model, an adaptive order-1 model, and an adaptive order-2 model. The heart of the encoder/decoder is largely based on Mark Nelson’s example which operates on 8 bits at a time and has an end-of-stream symbol included in the model.
Running the algorithm with some verbose logging added in shows that the decoder needs to read a few more bits than the encoder outputs. Supposing it operates using unsigned 64-bit integral types, then the decoder would need to read in 33 bits (precision/2+1) at the very beginning of the file before it starts decoding even the first symbol. It’s possible that the encoder may produce an output that is less than 33 bits.
At the end of the stream where the end-of-stream symbol acts as a terminator, the encoder is able to output just a few bits and terminate the encoding process. But the decoder keep reading until all of the input bits have been shifted into the high positions. In both the start-of-stream and end-of-stream cases, I modified the decoder to keep reading past the end of the file, pretending that any non-existent bits are 0.
This is a problem because the encoded stream may be embedded into other streams with the end of the encoded stream directly adjacent to another stream. The decoder may end up consuming bits meant for another purpose. And depending on how the IO is implemented, the consumed bits may be lost—unavailable for that other purpose. The HP Laboratories paper suggests either padding the end of the encoded stream with extra bits or prefixing the stream with the length of the data. The former is what I’m trying to avoid as it is essentially the opposite of compression (adding spurious zero-entropy bits to the encoded data). The latter would not be possible if the consumer does not know in advance what the length of the data will be, and even so I suspect it would not avoid the problem of reading extra bits as there is a possibility of a very good model compressing so well that the encoded bits are few enough to disrupt our estimate of when we should stop reading.
I do have some ideas to make sure the decoder only reads the bits it needs instead of reading as many as it can to fill up the bit buffer, but it also involves a lot of extra CPU cycles and forces me to peel away a lot of the nice abstractions between the model and the decoder.