# Thread: combining prediction with entropy coder

1. ## combining prediction with entropy coder

Hi all, here is my question:

There is a sequence of integers ( for simplicity, they are all 16-bit integers ) and there is a prediction model for the sequence, e.g., a_{n}^{~} = k_1 * a_{n-1} + k_2 * a_{n-2} + k_3. I can of course compute residues e_{n} = a_{n}^{~} - a_{n}, and then compress e_[n} with an entropy coder ( conventional arithmetic coder and ANS ), but better compression might be achieved if prediction sequence is fed to the entropy coder to compute adaptive probability distribution. While I understand the principle, I don't know how to do it in practice, since how do I map prediction values (integers ) to probabilities? Any concrete examples ( write-ups, source codes, existing packages ) to follow? Thanks in advance for any pointers.

2. If you compute e_{n} you already have enough information to pack it with ANS, you have an upper bound of 2^16, that's your max range, and if your residuals are low (hovering around 0) it should be feasible to use create a probability mapping for rANS simply by doing this: prob = (2^16 - e_{n}), this has the effect of creating a larger probability for lower values, even though there is no probability model involved, it's just a mapping from residual to a pseudo frequency.

Then just feed the prob and the max range into something like the public domain rANS implementation by Fabian Giesen. However the prob can never be zero, so just add +1 to it so the ANS state at least grows a little bit and remains fully decodable.

3. ## Thanks:

smjohn1 (13th November 2018)

4. Originally Posted by smjohn1
Hi all, here is my question:

There is a sequence of integers ( for simplicity, they are all 16-bit integers ) and there is a prediction model for the sequence, e.g., a_{n}^{~} = k_1 * a_{n-1} + k_2 * a_{n-2} + k_3. I can of course compute residues e_{n} = a_{n}^{~} - a_{n}, and then compress e_[n} with an entropy coder ( conventional arithmetic coder and ANS ), but better compression might be achieved if prediction sequence is fed to the entropy coder to compute adaptive probability distribution. While I understand the principle, I don't know how to do it in practice, since how do I map prediction values (integers ) to probabilities? Any concrete examples ( write-ups, source codes, existing packages ) to follow? Thanks in advance for any pointers.
https://github.com/webmproject/libwe...ctor_enc.c#L46

I use Shannon entropy with an exponential negative cost of -0.1 * exp(-A |x|) bits. The negative cost favors small values -- which are more likely to lead to good histogram clustering results later, even if Shannon entropy at this stage would be better otherwise.

5. ## Thanks:

smjohn1 (13th November 2018)

6. Thanks. But if I do this, what is the difference between just input e_{n} directly to rANS or FSE? FSE of course is different from rANS, but wouldn't rANS itself take care of prob. of e_{n}?

Originally Posted by Lucas
If you compute e_{n} you already have enough information to pack it with ANS, you have an upper bound of 2^16, that's your max range, and if your residuals are low (hovering around 0) it should be feasible to use create a probability mapping for rANS simply by doing this: prob = (2^16 - e_{n}), this has the effect of creating a larger probability for lower values, even though there is no probability model involved, it's just a mapping from residual to a pseudo frequency.

Then just feed the prob and the max range into something like the public domain rANS implementation by Fabian Giesen. However the prob can never be zero, so just add +1 to it so the ANS state at least grows a little bit and remains fully decodable.

7. Feeding those pseudo frequencies into rANS is equivalent to optimally bit-packing your error codes, FSE would use a static model to attempt to compress your data, so it may be better than bit packing if your data has weird distributions.

Since you're dealing with 16-bit data you could also just use an order-1 transpose and FSE for the highest ratio without having to develop a complex and wide probability model for 16-bit data:
```
void compute_freqs(uint32_t * histo, uint8_t * src, size_t size) {
uint32_t freqs_unroll[4][256] = {{0}};
size_t i = 0;
while((i + 4) < size) {
freqs_unroll[0][src[i + 0]]++;
freqs_unroll[1][src[i + 1]]++;
freqs_unroll[2][src[i + 2]]++;
freqs_unroll[3][src[i + 3]]++;
i += 4;
}

while(i < size) {
freqs_unroll[0][src[i]]++;
i++;
}

for(i = 0; i < 256; i++)
histo[i] = freqs_unroll[0][i] + freqs_unroll[1][i] + freqs_unroll[2][i] + freqs_unroll[3][i];
}

uint8_t order_1_transpose(void * src_v, void * dst_v, size_t size) {
uint8_t * src = (uint8_t*)src_v;
uint8_t * dst = (uint8_t*)dst_v;

size_t i = 0, pos = 0, start = 0;
uint32_t bucket[256], freqs[256];
memset(bucket, 0, 256 * sizeof(uint32_t));
memset(freqs, 0, 256 * sizeof(uint32_t));
compute_freqs(freqs, src, size);

for(i = 0; i < 256; i++) {
bucket[i] = start;
start += freqs[i];
}

uint8_t sentinel= src[size - 1];
pos = bucket[sentinal]++;
for(i = 0; i < size; i++) {
dst[pos] = src[i];
pos = bucket[src[i]]++;
}

return sentinel;
}

void order_1_untranspose(void * src_v, void * dst_v, size_t size, uint8_t sentinel) {
uint8_t * src = (uint8_t*)src_v;
uint8_t * dst = (uint8_t*)dst_v;

size_t i = 0, pos = 0, start = 0;
uint32_t bucket[256], freqs[256];
memset(bucket, 0, 256 * sizeof(uint32_t));
memset(freqs, 0, 256 * sizeof(uint32_t));
compute_freqs(freqs, src, size);

for(i = 0; i < 256; i++) {
bucket[i] = start;
start += freqs[i];
}

pos = bucket[sentinel]++;
for(i = 0; i < size; i++) {
dst[i] = src[pos];
pos = bucket[dst[i]]++;
}
}
```

This only has 1 byte of overhead since you must transmit the sentinel, but it'll break apart those 16-bit inputs into a bunch of 8-bit context sorted buckets so you can use a traditional byte-wise entropy coder like FSE or rANS.

8. ## Thanks:

smjohn1 (14th November 2018)

9. Here a hint on how to do it with a context model so that the preditions are learned to be the maximal possible probability for an arithmetic coder: https://encode.su/threads/1211-Compr...ll=1#post50274

10. ## Thanks:

smjohn1 (16th November 2018)

11. Originally Posted by smjohn1
How do I map prediction values (integers ) to probabilities?
This is typically done by "contexts", That is, instead of having a single statistical model from which you derive the probabilities (e.g. by counting symbols and then using relative frequency as an estimate of the probability), you have a statistical model that is indexed by the context. Or, to put it simple, instead of a 1-dimensional frequency table freq[symbol], you use a two-dimensional table freq[symbol][context] that is indexed by both the symbol and its context. Contexts can be as simple as the preceeding symbol, but typically you want to avoid such simple contexts as you will likely not collect enough samples (in a 2d table) to arrive at a robust probability estimate. Hence, one would typically not use the preceeding symbol directly, but some function derived from it. For example, for text compression, the context could say whether the preceeding symbol (=letter) was a consonant or a vowel.

12. Originally Posted by thorfdbg
This is typically done by "contexts"
Both are used. Context modeling is more noise resistant, but often simple prediction is more efficient.