1. ## UIQ2 statistics

After reading the post from Jan about the UIQ2 transformation, I decided to make some tests myself. I gathered statistics
about the bit probabilities for each of the 256 possible bits (not counting the terminating NUL byte) of the Turing machine
"strings".

The main part, creating the Turing strings and initializing the RNG was completely taken over from uiq.cpp, only the strings
are written to a 33 byte buffer instead of a file or stdout. The statistics part gathers two types of information at the moment.

First, the total probabilites for each of the 256 bits are collected. This is a very basic test and not very useful. Many of the
probabilities are between 45 and 55 percent for "bit is 0", so I guess you can't really take an advantage out of this except that
you can see a "saw tooth" characteristic - looking at bytes, the probabilities for 0 get higher at the LSB (more on this later).

Second, the probabilities for each bit are collected, but splitted, depending on the byte length of the strings (1-32). I'll post a
part of the results here, a full result set with exact numbers can be found at http://schnaader.info/uiq2_stats.txt.

I can provide the source code and binaries if somebody is interested, but at the moment I think the program isn't that useful.

Numbers are integers ranging from 0 to 100 giving the probability of each bit being 0, bits are ordered from MSB to LSB.
The number of strings examined for this is 620 million (about 5 hours of 800 MHz CPU time):

Code:
```Length 1: 43 49 56 62 69 76 82 88
Length 2: 49 49 49 49 49 49 49 49 41 50 59 67 74 81 86 91
Length 3: 49 48 47 46 46 45 45 45 45 45 45 44 45 44 45 45 36 47 57 65 73 79 85 91
Length 4: 48 46 45 42 41 40 39 38 39 36 36 36 36 34 36 34 35 35 35 34 36 34 34 35 31 38 48 54 63 71 78 85

Length 16:
47 45 43 41 40 38 37 37 36 35 34 35 33 34 34 33 33 34 33 34 33 33 33 35 32 33 34 34 34 34 32 34 33 34 34 34 32 34 34 34 33 34 33 34
34 33 33 34 32 34 33 33 33 34 33 33 34 33 34 35 32 33 34 34 33 34 32 34 34 34 33 34 32 34 34 33 33 34 33 34 33 33 33 35 32 33 34 33
33 34 32 34 33 33 33 34 32 33 34 33 33 34 33 34 34 33 33 34 33 34 34 33 33 34 33 34 34 33 34 35 31 38 46 53 62 71 81 91```
For most of the strings, the first bits are uniformly distributed, but the last byte always starts at low probabilities (30-40%) and
steadily grows to very high probabilities (85-90%). This is because most of the programs will halt or time out at the start or the
middle of a byte and the remaining bits are filled up with 0 bits.

Also, lengths 4 and 16 do somehow favor 1 bits, other lengths do not show this.
At length 22 (not shown here), it seems that every second bit is 0 only ~33% of the time while the other bits are 0 about 45-50%
of the time. There were only 161851 strings of length 22 and this is decreasing fast (length 29 was seen only 38 times, lengths
30-32 didn't even occur once), so the probabilities of higher lengths should be used with caution.

After gathering this data, the next step I thought about would be to gather statistics about the probability of each bit
being 0 or 1, depending on the byte length of the string and the previous 8 bits. But before doing this, I wanted to get
some opinions to see if this could be useful to increase compression ratio on UIQ.
My first thought was to flip each bit that has a high probability to be 1, so more of the bits would be 0, but I think this
would only work for worse distributions. Perhaps the next step reveals some of these, but at the moment, lowest probabilites are
at about 30% and they're not that common. Another disadvantage of this method is that it creates some NUL bytes inside
of the strings, so there would be some overhead from switching to another format (f.e. "5 bits for the length of the
encoded string + encoded string" instead of "string + NUL").
Another question is: How much of this is already used (and so worthless)? I guess at least PAQ "sees" that the last bits
before a 0 byte tend to be 0, so this isn't really worth it, is it? Same for the statistics gathered so far - I don't think that ratio
can be improved much using this information. The next step (probabilities depending on the last 8 bits) might be worth a try,
but even this could be useless. What are the expert's opinions?

2. Very interesting. I think the low probability of 0 bits is due to eliminating machines that output all zeros. Otherwise the probabilities should be 50%.

The length 22 string statistics are probably due to machines that output 2 bits every 3 steps and reach the time limit.

It seems like in the uiq2_t transform that there should be some correlation between the 2 files that could be exploited.

I am still waiting for someone to write a compressor that pre-trains the model using the UIQ2 generator. As a simple example I create 2 test files:

uiq2 u1 1
uiq2 u2 2

Compressing separately with zpaq c (default compression)

u1 6517929 -> 3110433
u2 6519743 -> 3111981

Compressing both into one archive:

u1 6517929 -> 3110433
u2 6519743 -> 3006612
-> 6117045

3. Originally Posted by Matt Mahoney
The length 22 string statistics are probably due to machines that output 2 bits every 3 steps and reach the time limit.
Ah yes, 256/3*2*8 = 21,33 bytes, that makes sense. Still, this would mean that machines outputting one of the strings "01" and "10" would be more probable, which is another interesting thing.

Originally Posted by Matt Mahoney
I am still waiting for someone to write a compressor that pre-trains the model using the UIQ2 generator. As a simple example I create 2 test files:

uiq2 u1 1
uiq2 u2 2

Compressing separately with zpaq c (default compression)

u1 6517929 -> 3110433
u2 6519743 -> 3111981

Compressing both into one archive:

u1 6517929 -> 3110433
u2 6519743 -> 3006612
-> 6117045
Ah, interesting. I wanted to test this (compressing two input files) but refused because I thought that most compressors wouldn't see the similarities and output space (theoretically there could be about 2^256 different outputs) would be too large, but I guess this is another example of how wrong this assumption is (another good example - where this is even more true - of course is enwik8/enwik9)

4. There's no correlation between strings, but by pre-training the context model tables are already set to optimal values. Pre-training should be most helpful on compressors that use a lot of memory.

5. I wrote a version of PAQ once that pre-trained by writing a UIQ file at runtime, but it didn't help. I don't remember the numbers but they were disappointing, and only valuable when compression very small files. If a file is just a few kilobytes then yes, it saved a few hundred bytes every time. But for normal size files, it would be about the same size, if not slightly worse.

6. You're probably running into memory limits. Don't you have something like 1200 models with around 1 MB each? Those can fill up pretty quick on 6 MB of input.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•