
Originally Posted by
Matt Mahoney
The vast majority of strings are random, meaning that for a chosen programming language, there is no program shorter than the string that outputs it. This is easy to see because for any length N, there are 2^N possible strings and only a small (relative to 2^N) number of valid shorter programs producing N bits of output.
There is also no algorithm that tells you in general which of those 2^N strings are non random. If such an algorithm existed, then you could write a small program that output the first random string of length N, where N is some big number greater than the length of your program. That would contradict your assumption that the algorithm exists.
A practical example would be an encrypted string of all zero bits. It would look random, and the only way you could tell the difference would be to guess the key.
In practice, a lot of strings such as text and video are not random because they are generated by (in theory) computable processes. This is why compression works, when in theory, it shouldn't.
It is sometimes still useful to have fast tests that usually detect random data. zpaq uses this test to store the data without bothering to try to compress it. The test is to look at the order 1 statistics left over from the fragmentation algorithm to see if they differ from random. The test is not perfect (it can't be), but it is usually an acceptable tradeoff of compression for speed.