# Thread: cmix compression of quintets independently sampled from non-random distribution

1. ## cmix compression of quintets independently sampled from non-random distribution

I ran cmix on the file packed.bin in the attached tar file, achieving about 4% compression (0.95895948855 = 1798050/1875001). To generate the file I ran the accompanying perl program, token_gen.pl, which, for each of the 32 quintets, assigns a random frequency from 0 to 7. The frequency distribution of quintets for packed.bin is in the accompanying sdist.txt which appears below for illustration.

Given such a distribution, how can one calculate the ideal compression ratio?

7:11010
7:11000
7:10111
7:10101
7:01001
7:01000
7:00001
6:11110
6:10100
5:11001
5:01110
5:00010
4:11011
4:10010
4:10001
4:01111
3:11111
3:01100
3:01010
3:00110
3:00101
3:00100
3:00011
2:11101
2:11100
2:10011
2:00111
2:00000
1:01101
1:01011
0:10110
0:10000  Reply With Quote

2. The order of the quintets are based on a random function (0 to 7), there's not much cmix or any codec can do with this type of distribution.
Replace the random function with something linear like this:
```quintet=counter%7;
counter++;```

Then you'll see higher compression since it introduces a repeating order of quintets.

Local ordering and frequency is more important for compression than global frequency distribution.  Reply With Quote

3. The entropy of this is 4.73624, so you can expect 0.9472 as the compression ratio (4.73624 / 5) for something that's optimized for this 5-bit alphabet.

You can easily compute this with Excel, but there is also a handy online calculator, just paste the frequencies "7 7 7 7 7 7 7 6 6 5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 1 1 0 0" in the event probabilities box: https://planetcalc.com/2476/  Reply With Quote

4. Welcome! I don't see an attach though.

> Given such a distribution, how can one calculate the ideal compression ratio?

By enumeration, as a combinatoric problem. Choose first quintet, then 2nd using last 4 symbols of first, etc.
I doubt that there's a universal mathematical solution, but we can implement number of instances as a recursive function
of frequency table, string length and 4-symbol suffix index, and save already calculated results.
Not sure about quintets though, just 32! is already ~2^118. Maybe some approximations and symmetries would work.
At least suffix index can be implemented as a permutation of frequency table.  Reply With Quote

5. > there's not much cmix or any codec can do with this type of distribution.

I think the idea is to evaluate cmix result on data with known entropy.
Of course, we can use a number like pi, which has small fixed entropy (based on Kolmogorov Complexity),
but can be compressed with cmix to infinite number of bytes.
However this specific problem about correct coding of data with known high-order frequency table
also has some practical uses.  Reply With Quote

6. ## Re: cmix compression of quintets independently sampled from non-random distribution Originally Posted by Shelwien Welcome! I don't see an attach though.
Find the attached tgz.

​Apparently tar is disallowed.  Reply With Quote

7. ## Sub-symbol bit pattern entropy? Originally Posted by Stefan Atev The entropy of this is 4.73624, so you can expect 0.9472 as the compression ratio (4.73624 / 5) for something that's optimized for this 5-bit alphabet.

You can easily compute this with Excel, but there is also a handy online calculator, just paste the frequencies "7 7 7 7 7 7 7 6 6 5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 1 1 0 0" in the event probabilities box: https://planetcalc.com/2476/

7:00000
7:00001
0:*

In other words, all quintets but 2 have zero occurrences and those two comprise bit patterns with very low entropy?  Reply With Quote

8. ah, this is completely different - token_gen.pl doesn't have any contextual dependency, so the result should be calculated based on order0 entropy.
cdm result for packed.bin is 1,857,708, nzcc result is 1,862,161 (it has a bitwise model).

This is generated by my script in attach from your sdist.txt:
Code:
```00000: p=2/125 cl=5.966
00001: p=7/125 cl=4.158
00010: p=5/125 cl=4.644
00011: p=3/125 cl=5.381
00100: p=3/125 cl=5.381
00101: p=3/125 cl=5.381
00110: p=3/125 cl=5.381
00111: p=2/125 cl=5.966
01000: p=7/125 cl=4.158
01001: p=7/125 cl=4.158
01010: p=3/125 cl=5.381
01011: p=1/125 cl=6.966
01100: p=3/125 cl=5.381
01101: p=1/125 cl=6.966
01110: p=5/125 cl=4.644
01111: p=4/125 cl=4.966
10001: p=4/125 cl=4.966
10010: p=4/125 cl=4.966
10011: p=2/125 cl=5.966
10100: p=6/125 cl=4.381
10101: p=7/125 cl=4.158
10111: p=7/125 cl=4.158
11000: p=7/125 cl=4.158
11001: p=5/125 cl=4.644
11010: p=7/125 cl=4.158
11011: p=4/125 cl=4.966
11100: p=2/125 cl=5.966
11101: p=2/125 cl=5.966
11110: p=6/125 cl=4.381
11111: p=3/125 cl=5.381
total: 1776090.5 bytes```  Reply With Quote

9. 1,875,001 packed.bin
3,000,002 packed.chr // 5to8.pl output
1,782,634 1 // nzcc
1,782,377 2 // freqtable_v2
1,776,698 3 // sh_v2f from rangecoders_v0e (precise AC)
1,776,090 // entropy target from my previous post

You have to understand that most of cmix models are byte-aligned.  Reply With Quote

10. Originally Posted by jabowery 7:00000
7:00001
0:*

In other words, all quintets but 2 have zero occurrences and those two comprise bit patterns with very low entropy?
That's kind of trivial - the entropy is 1 (two equally probable events), so you need one bit per symbol to encode. The compression ratio would be 1/5 = 20%; What Shelwien said about compressors often being byte-oriented just means that having this input that's packed 5-bit symbols makes it a bit tricky for general-purpose coders to achieve the best possible ratio. A purpose-built arithmetic coder will get very close to the theoretical compression ratio. I think a bitwise coder with at least 4 bits of context (O4+ bitwise) should also achieve near-optimal results on this data (even if packed). Shelwien's coder gets within 0.034% of the theoretical ratio.  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•