Results 1 to 10 of 10

Thread: cmix compression of quintets independently sampled from non-random distribution

  1. #1
    Member
    Join Date
    Apr 2020
    Location
    Shenandoah, IA
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    cmix compression of quintets independently sampled from non-random distribution

    I ran cmix on the file packed.bin in the attached tar file, achieving about 4% compression (0.95895948855 = 1798050/1875001). To generate the file I ran the accompanying perl program, token_gen.pl, which, for each of the 32 quintets, assigns a random frequency from 0 to 7. The frequency distribution of quintets for packed.bin is in the accompanying sdist.txt which appears below for illustration.

    Given such a distribution, how can one calculate the ideal compression ratio?

    7:11010
    7:11000
    7:10111
    7:10101
    7:01001
    7:01000
    7:00001
    6:11110
    6:10100
    5:11001
    5:01110
    5:00010
    4:11011
    4:10010
    4:10001
    4:01111
    3:11111
    3:01100
    3:01010
    3:00110
    3:00101
    3:00100
    3:00011
    2:11101
    2:11100
    2:10011
    2:00111
    2:00000
    1:01101
    1:01011
    0:10110
    0:10000

  2. #2
    Member
    Join Date
    Feb 2015
    Location
    United Kingdom
    Posts
    174
    Thanks
    28
    Thanked 73 Times in 43 Posts
    The order of the quintets are based on a random function (0 to 7), there's not much cmix or any codec can do with this type of distribution.
    Replace the random function with something linear like this:
    quintet=counter%7;
    counter++;

    Then you'll see higher compression since it introduces a repeating order of quintets.

    Local ordering and frequency is more important for compression than global frequency distribution.

  3. #3
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    54
    Thanks
    13
    Thanked 20 Times in 15 Posts
    The entropy of this is 4.73624, so you can expect 0.9472 as the compression ratio (4.73624 / 5) for something that's optimized for this 5-bit alphabet.


    You can easily compute this with Excel, but there is also a handy online calculator, just paste the frequencies "7 7 7 7 7 7 7 6 6 5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 1 1 0 0" in the event probabilities box: https://planetcalc.com/2476/



  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    Welcome! I don't see an attach though.

    > Given such a distribution, how can one calculate the ideal compression ratio?

    By enumeration, as a combinatoric problem. Choose first quintet, then 2nd using last 4 symbols of first, etc.
    I doubt that there's a universal mathematical solution, but we can implement number of instances as a recursive function
    of frequency table, string length and 4-symbol suffix index, and save already calculated results.
    Not sure about quintets though, just 32! is already ~2^118. Maybe some approximations and symmetries would work.
    At least suffix index can be implemented as a permutation of frequency table.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    > there's not much cmix or any codec can do with this type of distribution.

    I think the idea is to evaluate cmix result on data with known entropy.
    Of course, we can use a number like pi, which has small fixed entropy (based on Kolmogorov Complexity),
    but can be compressed with cmix to infinite number of bytes.
    However this specific problem about correct coding of data with known high-order frequency table
    also has some practical uses.

  6. #6
    Member
    Join Date
    Apr 2020
    Location
    Shenandoah, IA
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: cmix compression of quintets independently sampled from non-random distribution

    Quote Originally Posted by Shelwien View Post
    Welcome! I don't see an attach though.
    Find the attached tgz.

    ​Apparently tar is disallowed.
    Attached Files Attached Files

  7. #7
    Member
    Join Date
    Apr 2020
    Location
    Shenandoah, IA
    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Sub-symbol bit pattern entropy?

    Quote Originally Posted by Stefan Atev View Post
    The entropy of this is 4.73624, so you can expect 0.9472 as the compression ratio (4.73624 / 5) for something that's optimized for this 5-bit alphabet.


    You can easily compute this with Excel, but there is also a handy online calculator, just paste the frequencies "7 7 7 7 7 7 7 6 6 5 5 5 4 4 4 4 3 3 3 3 3 3 3 2 2 2 2 2 1 1 0 0" in the event probabilities box: https://planetcalc.com/2476/
    What about a distribution like:

    7:00000
    7:00001
    0:*

    In other words, all quintets but 2 have zero occurrences and those two comprise bit patterns with very low entropy?

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    ah, this is completely different - token_gen.pl doesn't have any contextual dependency, so the result should be calculated based on order0 entropy.
    cdm result for packed.bin is 1,857,708, nzcc result is 1,862,161 (it has a bitwise model).

    This is generated by my script in attach from your sdist.txt:
    Code:
    00000: p=2/125 cl=5.966
    00001: p=7/125 cl=4.158
    00010: p=5/125 cl=4.644
    00011: p=3/125 cl=5.381
    00100: p=3/125 cl=5.381
    00101: p=3/125 cl=5.381
    00110: p=3/125 cl=5.381
    00111: p=2/125 cl=5.966
    01000: p=7/125 cl=4.158
    01001: p=7/125 cl=4.158
    01010: p=3/125 cl=5.381
    01011: p=1/125 cl=6.966
    01100: p=3/125 cl=5.381
    01101: p=1/125 cl=6.966
    01110: p=5/125 cl=4.644
    01111: p=4/125 cl=4.966
    10001: p=4/125 cl=4.966
    10010: p=4/125 cl=4.966
    10011: p=2/125 cl=5.966
    10100: p=6/125 cl=4.381
    10101: p=7/125 cl=4.158
    10111: p=7/125 cl=4.158
    11000: p=7/125 cl=4.158
    11001: p=5/125 cl=4.644
    11010: p=7/125 cl=4.158
    11011: p=4/125 cl=4.966
    11100: p=2/125 cl=5.966
    11101: p=2/125 cl=5.966
    11110: p=6/125 cl=4.381
    11111: p=3/125 cl=5.381
    total: 1776090.5 bytes
    Attached Files Attached Files

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,943
    Thanks
    293
    Thanked 1,286 Times in 728 Posts
    1,875,001 packed.bin
    3,000,002 packed.chr // 5to8.pl output
    1,782,634 1 // nzcc
    1,782,377 2 // freqtable_v2
    1,776,698 3 // sh_v2f from rangecoders_v0e (precise AC)
    1,776,090 // entropy target from my previous post

    You have to understand that most of cmix models are byte-aligned.
    Attached Files Attached Files

  10. #10
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    54
    Thanks
    13
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by jabowery View Post
    What about a distribution like:

    7:00000
    7:00001
    0:*

    In other words, all quintets but 2 have zero occurrences and those two comprise bit patterns with very low entropy?
    That's kind of trivial - the entropy is 1 (two equally probable events), so you need one bit per symbol to encode. The compression ratio would be 1/5 = 20%; What Shelwien said about compressors often being byte-oriented just means that having this input that's packed 5-bit symbols makes it a bit tricky for general-purpose coders to achieve the best possible ratio. A purpose-built arithmetic coder will get very close to the theoretical compression ratio. I think a bitwise coder with at least 4 bits of context (O4+ bitwise) should also achieve near-optimal results on this data (even if packed). Shelwien's coder gets within 0.034% of the theoretical ratio.

Similar Threads

  1. cmix
    By Matt Mahoney in forum Data Compression
    Replies: 449
    Last Post: 4th April 2020, 00:11
  2. LSTM and cmix
    By AndrzejB in forum Data Compression
    Replies: 4
    Last Post: 10th July 2019, 00:24
  3. Trend-seeking adaptation of probability distribution?
    By Jarek in forum Data Compression
    Replies: 15
    Last Post: 12th February 2017, 16:00
  4. Replies: 75
    Last Post: 24th March 2014, 19:34
  5. Data Distribution Questions.
    By Tribune in forum Data Compression
    Replies: 13
    Last Post: 25th June 2008, 18:09

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •