Results 1 to 6 of 6

Thread: UIQ2 statistics

  1. #1
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    568
    Thanks
    218
    Thanked 201 Times in 94 Posts

    UIQ2 statistics

    After reading the post from Jan about the UIQ2 transformation, I decided to make some tests myself. I gathered statistics
    about the bit probabilities for each of the 256 possible bits (not counting the terminating NUL byte) of the Turing machine
    "strings".

    The main part, creating the Turing strings and initializing the RNG was completely taken over from uiq.cpp, only the strings
    are written to a 33 byte buffer instead of a file or stdout. The statistics part gathers two types of information at the moment.

    First, the total probabilites for each of the 256 bits are collected. This is a very basic test and not very useful. Many of the
    probabilities are between 45 and 55 percent for "bit is 0", so I guess you can't really take an advantage out of this except that
    you can see a "saw tooth" characteristic - looking at bytes, the probabilities for 0 get higher at the LSB (more on this later).

    Second, the probabilities for each bit are collected, but splitted, depending on the byte length of the strings (1-32). I'll post a
    part of the results here, a full result set with exact numbers can be found at http://schnaader.info/uiq2_stats.txt.

    I can provide the source code and binaries if somebody is interested, but at the moment I think the program isn't that useful.

    Numbers are integers ranging from 0 to 100 giving the probability of each bit being 0, bits are ordered from MSB to LSB.
    The number of strings examined for this is 620 million (about 5 hours of 800 MHz CPU time):

    Code:
    Length 1: 43 49 56 62 69 76 82 88
    Length 2: 49 49 49 49 49 49 49 49 41 50 59 67 74 81 86 91
    Length 3: 49 48 47 46 46 45 45 45 45 45 45 44 45 44 45 45 36 47 57 65 73 79 85 91
    Length 4: 48 46 45 42 41 40 39 38 39 36 36 36 36 34 36 34 35 35 35 34 36 34 34 35 31 38 48 54 63 71 78 85
    
    Length 16:
     47 45 43 41 40 38 37 37 36 35 34 35 33 34 34 33 33 34 33 34 33 33 33 35 32 33 34 34 34 34 32 34 33 34 34 34 32 34 34 34 33 34 33 34 
     34 33 33 34 32 34 33 33 33 34 33 33 34 33 34 35 32 33 34 34 33 34 32 34 34 34 33 34 32 34 34 33 33 34 33 34 33 33 33 35 32 33 34 33 
     33 34 32 34 33 33 33 34 32 33 34 33 33 34 33 34 34 33 33 34 33 34 34 33 33 34 33 34 34 33 34 35 31 38 46 53 62 71 81 91
    For most of the strings, the first bits are uniformly distributed, but the last byte always starts at low probabilities (30-40%) and
    steadily grows to very high probabilities (85-90%). This is because most of the programs will halt or time out at the start or the
    middle of a byte and the remaining bits are filled up with 0 bits.

    Also, lengths 4 and 16 do somehow favor 1 bits, other lengths do not show this.
    At length 22 (not shown here), it seems that every second bit is 0 only ~33% of the time while the other bits are 0 about 45-50%
    of the time. There were only 161851 strings of length 22 and this is decreasing fast (length 29 was seen only 38 times, lengths
    30-32 didn't even occur once), so the probabilities of higher lengths should be used with caution.

    After gathering this data, the next step I thought about would be to gather statistics about the probability of each bit
    being 0 or 1, depending on the byte length of the string and the previous 8 bits. But before doing this, I wanted to get
    some opinions to see if this could be useful to increase compression ratio on UIQ.
    My first thought was to flip each bit that has a high probability to be 1, so more of the bits would be 0, but I think this
    would only work for worse distributions. Perhaps the next step reveals some of these, but at the moment, lowest probabilites are
    at about 30% and they're not that common. Another disadvantage of this method is that it creates some NUL bytes inside
    of the strings, so there would be some overhead from switching to another format (f.e. "5 bits for the length of the
    encoded string + encoded string" instead of "string + NUL").
    Another question is: How much of this is already used (and so worthless)? I guess at least PAQ "sees" that the last bits
    before a 0 byte tend to be 0, so this isn't really worth it, is it? Same for the statistics gathered so far - I don't think that ratio
    can be improved much using this information. The next step (probabilities depending on the last 8 bits) might be worth a try,
    but even this could be useless. What are the expert's opinions?
    Last edited by schnaader; 16th September 2009 at 23:15.
    http://schnaader.info
    Damn kids. They're all alike.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Very interesting. I think the low probability of 0 bits is due to eliminating machines that output all zeros. Otherwise the probabilities should be 50%.

    The length 22 string statistics are probably due to machines that output 2 bits every 3 steps and reach the time limit.

    It seems like in the uiq2_t transform that there should be some correlation between the 2 files that could be exploited.

    I am still waiting for someone to write a compressor that pre-trains the model using the UIQ2 generator. As a simple example I create 2 test files:

    uiq2 u1 1
    uiq2 u2 2

    Compressing separately with zpaq c (default compression)

    u1 6517929 -> 3110433
    u2 6519743 -> 3111981

    Compressing both into one archive:

    u1 6517929 -> 3110433
    u2 6519743 -> 3006612
    -> 6117045

  3. #3
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    568
    Thanks
    218
    Thanked 201 Times in 94 Posts
    Quote Originally Posted by Matt Mahoney View Post
    The length 22 string statistics are probably due to machines that output 2 bits every 3 steps and reach the time limit.
    Ah yes, 256/3*2*8 = 21,33 bytes, that makes sense. Still, this would mean that machines outputting one of the strings "01" and "10" would be more probable, which is another interesting thing.

    Quote Originally Posted by Matt Mahoney View Post
    I am still waiting for someone to write a compressor that pre-trains the model using the UIQ2 generator. As a simple example I create 2 test files:

    uiq2 u1 1
    uiq2 u2 2

    Compressing separately with zpaq c (default compression)

    u1 6517929 -> 3110433
    u2 6519743 -> 3111981

    Compressing both into one archive:

    u1 6517929 -> 3110433
    u2 6519743 -> 3006612
    -> 6117045
    Ah, interesting. I wanted to test this (compressing two input files) but refused because I thought that most compressors wouldn't see the similarities and output space (theoretically there could be about 2^256 different outputs) would be too large, but I guess this is another example of how wrong this assumption is (another good example - where this is even more true - of course is enwik8/enwik9)
    Last edited by schnaader; 17th September 2009 at 00:46.
    http://schnaader.info
    Damn kids. They're all alike.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    There's no correlation between strings, but by pre-training the context model tables are already set to optimal values. Pre-training should be most helpful on compressors that use a lot of memory.

  5. #5
    Member
    Join Date
    Sep 2009
    Location
    USA
    Posts
    16
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I wrote a version of PAQ once that pre-trained by writing a UIQ file at runtime, but it didn't help. I don't remember the numbers but they were disappointing, and only valuable when compression very small files. If a file is just a few kilobytes then yes, it saved a few hundred bytes every time. But for normal size files, it would be about the same size, if not slightly worse.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    You're probably running into memory limits. Don't you have something like 1200 models with around 1 MB each? Those can fill up pretty quick on 6 MB of input.

Similar Threads

  1. Transformation for UIQ2
    By Jan Ondrus in forum Data Compression
    Replies: 49
    Last Post: 4th October 2009, 18:30
  2. Data Structures for Context Model Statistics
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th November 2008, 11:14

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •