# Thread: Finite Precision Arithmetic Coder: How to know the right interval

1. ## Finite Precision Arithmetic Coder: How to know the right interval

Hi,

I am currently trying to figure out how to determine the correct binary value in the finite precision implementation of arithmetic coding (AC). I'd like to implement it with the side information about the number of symbols encoded so that I do not have to use an end of stream symbol.
I was not able to find such an implementation anywhere, so I'd like to implement it myself.

Lets go by example:
I have an alphabet with two symbols {a, b} and would like to encode the message "abab" with symbol probabilities P(a) = 0.8, P(b) = 0.2. Now going by infinite precision arithmetic coding I end up at the interval [0.7424, 0.768], use 7 bits and yield the codeword "1100000" (by using using the binary representation of the lower bound truncated after the 7th bit + adding 2^-7 to make it lie within the interval), corresponding to the probability 0.75.
Now using the usual E1-E3 scaling operations, for finite precision AC I get the following sequence of intervals:

Message to be encoded:
"abab":
i) [0, 1)
ii) [0, 0.8 ) (Encoded: "a")
iii) [0.64, 0.8 ) (Encoded: "b")
iv) [0.28, 0.6) (Scaling: E1) => Output "1"
v) [0.06, 0.7) (Scaling: E3)
vi) [0.06, 0.572) (Encoded: "a")
v) [0.4696, 0.572) (Encoded: "b")
vi) [0.3784, 0.788 ) (Scaling: 2x E3)

where the step vi) is not really required.

Now what do we have? We outputted a single bit "1" and are left with three unresolved E3 scalings. Which bit-sequence do we add to the output? It has to be "100000" to get the same codeword as above. But I am struggling with finding the rule to get the correct codeword.
I figured that the correct codeword length can be found by
Code:
`ceil(-log2(HIGH - LOW)) + 1 + numel(bin_seq) + E3_count;`
where numel(bin_seq) gives the number of bits already outputted (here: one) and E3_count is the number of E3 scalings unresolved and ceil is the round up function.

But I am still unsure about finding the correct codeword from the interval I ended up at. Anyone got an idea? I mean the codeword has to correspond to a probability within [0.64, 0.8), this can be deduced from ii).

Before I write unnecessary text, does anyone know how to correctly find the final codeword? That would be a great help, thank you :)  Reply With Quote

2. If its done only once, at the coder flush, can't you just write a loop to find the shortest possible codeword that fits within the interval?
I don't think its possible to just calculate the length, for example LOW=0111, HIGH=1111, -log(HIGH-LOW)+1=4,
but the shortest code would be "10" or something (can be also "1" or "0" depending on implementation).  Reply With Quote

3. Originally Posted by Shelwien If its done only once, at the coder flush, can't you just write a loop to find the shortest possible codeword that fits within the interval?
I don't think its possible to just calculate the length, for example LOW=0111, HIGH=1111, -log(HIGH-LOW)+1=4,
but the shortest code would be 10 or something.
Thank you for your reply! I dont want to find the length, I already know how to calculate it. I need to find a way to actually derive the correct codeword that lies in the desired interval without using an EOF symbol.

but the shortest code would be 10 or something
only in cases where you already know what you are going to code in the future. Else you need the sufficient condition of codewordlength = ceil(-log(HIGH-LOW))+1.

can't you just write a loop to find the shortest possible codeword that fits within the interval?
Well, I could write a loop where I test whether the decoder would decode the correct message, because the problem only arises at the end which keeps the complexity low in most cases. But this does not seem like an unnecessary workaround. To give context, I want to encode binary messages that are rather short (but too long for efficient huffmancode) and the overhead of an EOF symbol is crucial in this situation.  Reply With Quote

4. >> but the shortest code would be 10 or something

> only in cases where you already know what you are going to code in the future.
> Else you need the sufficient condition of codewordlength = ceil(-log(HIGH-LOW))+1.

Not quite.
If its actually EOF, we can tweak i/o to return 0s or 1s when reading after EOF.
Eg. "0" flush with padding of 1s would turn into "0111(1)" and would fit into interval.

As to "10", any "10xx(x)" fits, and we'd be able to return unnecessary bits after decoding.

>> can't you just write a loop to find the shortest possible codeword that fits within the interval?

> Well, I could write a loop where I test whether the decoder would decode
> the correct message, because the problem only arises at the end which keeps
> the complexity low in most cases.

Yes. But its not necessary to test full decoding - you just need to find the right sub-interval.
I meant something like this:
```x = HIGH;
n = ceil(-log2(HIGH));
for( l=0; l<n; l++ ) {
if( (xl>=LOW) && (xh<HIGH) ) break;
}```

> To give context, I want to encode binary messages that are rather short
> (but too long for efficient huffmancode) and the overhead of an EOF symbol
> is crucial in this situation.

I posted some similar coders here:

Although these are byte-aligned.  Reply With Quote

5. ## Thanks:

InfiniteAC (13th August 2019)

6. Maybe it is still unclear what I mean, so here is another take on it, at least I believe you are doing something else:
At the end of my encoding process I have my interval [LOW, HIGH]. Now I have another implementation of AC where I simply use doubles for simplicity and do not use any scaling operations (naive AC). This works for my current problem, because I can restrict myself to short messages and no extreme probabilities. But for some future development I need an AC that does not need to encode an EOF symbol, because I always know how many symbols will be encoded (it is a fixed number known prior both by the encoder and the decoder).  Reply With Quote

7. > I need an AC that does not need to encode an EOF symbol, because I always know how many symbols will be encoded

The thread I linked contains working implementations of various coders that do exactly that.
Most recent version even implements self-termination (AC EOF is created by adjusting flush code to cause decoding error at file EOF).
Modifying it to work with bit alignment should be simple enough, I can make a demo if necessary.
Basically just have to remove byte alignment loop from rangecoder_v0e/sh_v2f.cpp:
`      while( byte(bits>>24)!=0xFF ) outbit(1);`

(rangecoder is the same thing as AC, just optimized for bytewise i/o).  Reply With Quote

8. ## Thanks:

InfiniteAC (14th August 2019)

9. Actually here it is: http://nishi.dreamhosters.com/u/bitalign_v0.rar

I made some small changes to sh_v2f_static (like adding a flush at LF and bitoffset index in frq file).
Surprisingly, tail cutting actually reduced the compressed data size, so compressed file size for book1 is 433033 (but there's also the index).

So a script like this:
Code:
```coder c book1 1 1_frq
coder d 1 2 1_frq
md5sum 2 book1

echo line 0:
coder d0 1 con 1_frq

echo line 1:
coder d1 1 con 1_frq

echo line 16621:
coder d16621 1 con 1_frq```
works like this:
Code:
```C:\9A6R-bitalign_ari\!arc\bitalign_v0>test.bat
0a0fdbaf0589c9713bde9120cbb20199 *2
0a0fdbaf0589c9713bde9120cbb20199 *book1
line 0:
<Y 1874>

line 1:
<A T. HARDY>

line 16621:
THE END```  Reply With Quote

10. ## Thanks:

InfiniteAC (14th August 2019)

11. Thank you very much, I will have a look at it later   Reply With Quote

12. Alternatively you could just continue scaling (renormalization) until reaching [0,1) ...
Implementing AC/RC you should also remember about possibility of staying around renormalization boundary e.g. [0.499999,0.500001) - there is used readjustment to handle it (loosing ~0.5 bits each time) - see nice article https://en.wikipedia.org/wiki/Range_encoding

Also there are lots of available implementations, see e.g. references in https://sites.google.com/site/powturbo/entropy-coder  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•