After reading the post from Jan about the UIQ2 transformation, I decided to make some tests myself. I gathered statistics
about the bit probabilities for each of the 256 possible bits (not counting the terminating NUL byte) of the Turing machine
"strings".
The main part, creating the Turing strings and initializing the RNG was completely taken over from uiq.cpp, only the strings
are written to a 33 byte buffer instead of a file or stdout. The statistics part gathers two types of information at the moment.
First, the total probabilites for each of the 256 bits are collected. This is a very basic test and not very useful. Many of the
probabilities are between 45 and 55 percent for "bit is 0", so I guess you can't really take an advantage out of this except that
you can see a "saw tooth" characteristic - looking at bytes, the probabilities for 0 get higher at the LSB (more on this later).
Second, the probabilities for each bit are collected, but splitted, depending on the byte length of the strings (1-32). I'll post a
part of the results here, a full result set with exact numbers can be found at http://schnaader.info/uiq2_stats.txt.
I can provide the source code and binaries if somebody is interested, but at the moment I think the program isn't that useful.
Numbers are integers ranging from 0 to 100 giving the probability of each bit being 0, bits are ordered from MSB to LSB.
The number of strings examined for this is 620 million (about 5 hours of 800 MHz CPU time):
Code:
Length 1: 43 49 56 62 69 76 82 88
Length 2: 49 49 49 49 49 49 49 49 41 50 59 67 74 81 86 91
Length 3: 49 48 47 46 46 45 45 45 45 45 45 44 45 44 45 45 36 47 57 65 73 79 85 91
Length 4: 48 46 45 42 41 40 39 38 39 36 36 36 36 34 36 34 35 35 35 34 36 34 34 35 31 38 48 54 63 71 78 85
Length 16:
47 45 43 41 40 38 37 37 36 35 34 35 33 34 34 33 33 34 33 34 33 33 33 35 32 33 34 34 34 34 32 34 33 34 34 34 32 34 34 34 33 34 33 34
34 33 33 34 32 34 33 33 33 34 33 33 34 33 34 35 32 33 34 34 33 34 32 34 34 34 33 34 32 34 34 33 33 34 33 34 33 33 33 35 32 33 34 33
33 34 32 34 33 33 33 34 32 33 34 33 33 34 33 34 34 33 33 34 33 34 34 33 33 34 33 34 34 33 34 35 31 38 46 53 62 71 81 91
For most of the strings, the first bits are uniformly distributed, but the last byte always starts at low probabilities (30-40%) and
steadily grows to very high probabilities (85-90%). This is because most of the programs will halt or time out at the start or the
middle of a byte and the remaining bits are filled up with 0 bits.
Also, lengths 4 and 16 do somehow favor 1 bits, other lengths do not show this.
At length 22 (not shown here), it seems that every second bit is 0 only ~33% of the time while the other bits are 0 about 45-50%
of the time. There were only 161851 strings of length 22 and this is decreasing fast (length 29 was seen only 38 times, lengths
30-32 didn't even occur once), so the probabilities of higher lengths should be used with caution.
After gathering this data, the next step I thought about would be to gather statistics about the probability of each bit
being 0 or 1, depending on the byte length of the string and the previous 8 bits. But before doing this, I wanted to get
some opinions to see if this could be useful to increase compression ratio on UIQ.
My first thought was to flip each bit that has a high probability to be 1, so more of the bits would be 0, but I think this
would only work for worse distributions. Perhaps the next step reveals some of these, but at the moment, lowest probabilites are
at about 30% and they're not that common. Another disadvantage of this method is that it creates some NUL bytes inside
of the strings, so there would be some overhead from switching to another format (f.e. "5 bits for the length of the
encoded string + encoded string" instead of "string + NUL").
Another question is: How much of this is already used (and so worthless)? I guess at least PAQ "sees" that the last bits
before a 0 byte tend to be 0, so this isn't really worth it, is it? Same for the statistics gathered so far - I don't think that ratio
can be improved much using this information. The next step (probabilities depending on the last 8 bits) might be worth a try,
but even this could be useless. What are the expert's opinions?