Find the shortest Turing machine description of the input ----> gzip
Find the shortest Turing machine description of the input ----> gzip
Last edited by nburns; 2nd September 2013 at 06:57.
gzip + Turing machine = another Turing machine
I'm not sure I follow you. I'd gzip the Turing machine description. Are you saying you could treat the output of gzip as a Turing machine description?
I've always gotten the impression that a Turing machine description would have to look like a list of states (since a TM is a FSM + tape to write on). I don't envision it as being super-compact, and there would have to be a bunch of boilerplate just to interact with the tape and stuff. So it would probably help to compress it.
I know there are other formulations of Kolmogorov compressors that allow you to substitute whatever compact language you prefer, along with an interpreter. I don't put a lot of stock in what I know about the theory, since it comes from the web, not textbooks or the original research papers (which are probably in Russian). But I can't quite seem to appreciate why it must be the optimal form of compression. It seems to be in conflict with Shannon's model of information as the output of a random process. If the message is essentially random, then it's not an uncomputable problem. Of course, no input is actually random, but even if human beings are basically Turing machines, recreating the person that created the message doesn't sound efficient. It still might be better to model it as random.
Last edited by nburns; 5th September 2013 at 03:50.
You can assume that even the Turing machine description is as compact as possible and gzipping will lead to no major improvements.
To take away one layer of abstraction, look at the example of the shortest x86 code that leads to your data when executed. Like with the Turing machine description, you can try to gzip this, but this wouldn't lead to any improvent. Why not? Because you'll have to include the gzip decompression code to the new size, so you'd basically have this:
gzip decompression stub code (x86 code)
shortest x86 code (gzipped data)
If this is smaller than the "shortest x86 code" you found, you did something wrong! It is pure x86 code again, so if you did a true Kolmogorov compression before, you already found that self-decompressing program above as "shortest x86 code". In other words, there's no sense in trying to compress Kolmogorov representations of data because they will already contain their own compression layers.
Of course, this is different when working "cross-representational", e.g. gzipping Turing machine descriptions as you originally intended, but it doesn't change the argument that much and you still can assume that the Turing machine description will be a very compact representation of your data and gzip won't achieve major improvements. Besides, true Kolmogorov representations of anything you try to compress will be that compact and much much better than anything else you could try that you won't struggle to shave some more bytes off
http://schnaader.info
Damn kids. They're all alike.
gzip wouldn't be built into the archive. The result isn't meant to be a Kolmogorov Turing machine, it's a Kolmogorov Turing machine that's been gzipped.
In general, I don't see why the output must be a program in order to be minimal in space. There are arguments that it only would add constant or a small amount of overhead. That's a perfectly good argument for ignoring it if it's an infinite function and you're talking about the asymptotic behavior, but one single piece of data is constant to begin with. So you're adding a constant to a constant. That constant depends on what language you choose. So you're back to having an infinite number of choices that give an infinite number of sizes. I don't quite follow the concept of the perfect minimal description language that's in the theory. I'm assuming that even though I don't understand it, it must make sense under some assumptions, because the people that came up with it weren't dumb. I'm just not sure that the assumptions they're using (all math depends on assumptions) are relevant to the problem of compression that we actually deal with.
Shannon's model is that information is basically the output of a random process, i.e. a random number that's been transformed by some function. Or multiple random numbers, transformed by multiple functions and then combined -- it's probably equivalent on some level. Depending on what you assume, you could say that the functions are Turing machines, and therefore you can't compute the shortest versions of them. But the functions are not random for every message. And, obviously, since all you have access to is samples of their output, it's probably impossible to know what they are. But you find approximations, and since they generate infinitely-many messages and are assumed to be relatively fixed and not part of the message itself, you might as well build them into the compressor. Then, what you'd transmit as the compressed message is just the random inputs to these functions, which you can later use to regenerate the message.
This theory captures what we all know about compression, which is that no compressor compresses everything, therefore a compressor is a trade-off that involves a choice of what to compress.
Shannon entropy tells you the shortest possible encoding, assuming you know the probability distribution. In general you don't. Kolmogorov complexity is a "universal" way to estimate probability: p(x) = 2^-|M| where M is the shortest program that outputs x. But it depends on the choice of language used to encode M. Kolmogorov proved that all languages are equivalent up to a constant that does not depend on x. You can always translate from M1 to M2 by writing a fixed length translator in one language or the other and appending it to x. In practice, the translator can be big, like possibly 1 GB to translate English to C. It also does not help to use extremely simple Turing machines like https://en.wikipedia.org/wiki/Wolfra...Turing_machine because even very short strings would require huge programs (which we hardly know how to write).
That agrees with my formulation. I described the probability distribution as a function that accepts a random number and produces the message corresponding to that number. (The function could be composed of multiple smaller functions + a combining function, and you could imagine the small functions as taking a random number each.) Assume that the function can produce any message. The probability of a given message is the probability that the random number is equal to the number or numbers that produce the message when passed to the function. This definition is general enough to incorporate Shannon or Kolmogorov.
If the function is a universal Turing machine, and the input n corresponds to the nth Turing machine, then you basically have a Kolmogorov decompressor. The compressor, in general, depends on the inverse of the function. The universal Turing machine function isn't one-to-one, so it doesn't have an inverse, but if you require n to be the number of a minimal Turing machine, then the inverse is a Kolmogorov compressor.
That all makes sense. Assuming that you can solve uncomputable problems, that's a well-defined way to measure complexity (relative to some language, as long as you stay away from paradoxes). Of course, complexity itself is not well defined, so you could choose another definition and another way of measuring it.Kolmogorov complexity is a "universal" way to estimate probability: p(x) = 2^-|M| where M is the shortest program that outputs x. But it depends on the choice of language used to encode M. Kolmogorov proved that all languages are equivalent up to a constant that does not depend on x. You can always translate from M1 to M2 by writing a fixed length translator in one language or the other and appending it to x. In practice, the translator can be big, like possibly 1 GB to translate English to C. It also does not help to use extremely simple Turing machines like https://en.wikipedia.org/wiki/Wolfra...Turing_machine because even very short strings would require huge programs (which we hardly know how to write).
If you have a way of measuring complexity, then you can define a compressor that optimizes for it. All you need to do is rank all possible messages by their complexity, and then for a given message output its rank. You could find one that optimally compresses images, given a measure of image complexity. If you assume that the input is an image, then you can certainly do better than the Kolmogorov compressor.
I guess the question is what information you assign to be part of the message and what information you assign to its external context. I don't think this can be solved exactly.
Some things have crystallized in my mind. The Kolmogorov complexity is something approaching the most general unrestricted measure of information content, given the least assumptions about where the data came from.
The Kolmogorov complexity is not assumption-free. This is trivial: if data is random and uniformly distributed, then the information is irreducible. The assumption behind Kolmogorov complexity is that the data came from a Turing computable function, with probability inverse to machine size.
As such, Kolmogorov is like an *upper bound* on the best compression you could achieve for *any particular* Turing-computable model. It's trivially true that by optimizing for all computable functions simultaneously, you can't achieve the optimal for any restricted subset of functions.
I don't think anyone actually believes that real information comes from a uniform distribution of minimal Turing machines. If it did, then there's a good chance that compression in general would be impossible, and real compressors wouldn't work as well as they do.
My feeling is that compression requires you to make assumptions about the source of the data. Assumptions, by nature, are just choices. They can't be proven.
If your assumption is that the distribution induced by the Kolmogorov complexity is the true distribution, then compression is probably an impossible problem, and we might as well give up and go home.
The success of compression is evidence to the contrary. Luckily, some assumptions about where data comes from give exact computable solutions to compression, and others give good approximate solutions. Choosing one of these assumed distributions gives you a tractable problem to solve.
Matt -- what do you think? Am I making sense?
Actually, random Turing machines can easily produce compressible data. They must, if they produce more output than their own size.
Also, we can prove that every possible probability distribution over an infinite number of Turing machines must favor the smaller ones. For any program of length n with probability p > 0, there must be an infinite number of programs longer than n with probability less than p, but only a finite number of programs longer than n with probability greater than p.
I think these two facts prove the validity of a universal distribution but are not enough to prove how useful it is in machine learning and just about every branch of science because there still is the language dependent constant that can be quite large. Probabilities are mathematical models of belief, something our own brains compute. I think there is a bias toward using natural language to get good agreement with our intuitions. The models that have the shortest descriptions when expressed in English are the ones that make the most accurate predictions, because such models predict things that are important to us.
True. When I said impossible to compress, I meant because it's uncomputable.
True, but don't you think some Turing machines are more likely than others? I mean, in reality. And since the source of real data need not be small Turing machines, but very large and complex ones (like people), one could ask why minimal Turing machine size is the best measure of probability. A single message wouldn't be the complete output of the T.M. source, but just a fragment of it.Also, we can prove that every possible probability distribution over an infinite number of Turing machines must favor the smaller ones. For any program of length n with probability p > 0, there must be an infinite number of programs longer than n with probability less than p, but only a finite number of programs longer than n with probability greater than p.
And the usefulness rests on the usefulness of its assumptions.I think these two facts prove the validity of a universal distribution but are not enough to prove how useful it is in machine learning and just about every branch of science because there still is the language dependent constant that can be quite large.
I think I see your point. But there is probably a trade-off between accuracy and description length. The output of a Kolmogorov compressor would represent an optimal trade-off, but not necessarily the most accurate model.The models that have the shortest descriptions when expressed in English are the ones that make the most accurate predictions, because such models predict things that are important to us.
Last edited by nburns; 10th September 2013 at 03:58.
That's a good question. Probability is a mathematical model we use to describe what we don't know. In reality, the universe could be deterministic, maybe the output of a program only a few hundred bits long to describe the laws of physics and its state as it evolved from the big bang. We don't know what this program is because we don't have a complete model of physics (quantum mechanics conflicts with general relativity) and there might be additional information to describe an initial state that allowed intelligent life to evolve. Anything we say about one program being more likely than another is just a guess.
Occam's Razor (short programs are more likely than longer ones) is required for all possible probability distributions over infinite sets of programs in any possible language. All computers (including human brains) must also have probabilistic models of the universe that contains them because two computers cannot both mutually simulate each other, and the universe "simulates" us. No computer within the universe can have a memory large enough to describe the state of the universe (about 10^120 bits) that it is part of.
But I am not sure these two facts completely answers the question of why short programs are better at predicting the output of the program that describes our universe, assuming such a program exists (and there is no way to know for sure, because such a program would also encode our beliefs). Is it possible to create any "universe" program where shorter programs are not better predictors?
To put it in terms of propositional logic-But I am not sure these two facts completely answers the question of why short programs are better at predicting the output of the program that describes our universe, assuming such a program exists (and there is no way to know for sure, because such a program would also encode our beliefs). Is it possible to create any "universe" program where shorter programs are not better predictors?
this is probably true:
good predictor -> short program
but the converse is not implied:
short program -/> good predictor
Actually the other way around. If two programs are both consistent with past evidence and that is all you know, then the shorter one will predict better. But the best predictor will be exactly equivalent to the source.
That's not a safe assumption, and I didn't imply that.
I sense that either I am missing some nuance of the argument you're making, or you are missing some nuance of mine (or both). Part of why I think the Kolmogorov compressor is not useful is that it's so slippery and hard to argue against. I think that slipperiness is why we seem to be going in circles.
I mean that if you want to predict a sequence of bits, then you find the shortest program that produces the bits seen so far, then let it run and predict the next bit that it outputs.
Short programs are only useful if they predict past data.
Clearly. And if they have not yet predicted past data, but will do so at some future time, then that also makes them useful to assign short codes to when designing a compressor. Obviously, predicting the future will always be impossible to do consistently well, especially when making predictions that are expected to be equally valid at all future points through the end of eternity.
I don't really question your assertion that small programs are the best predictors -- although, I don't see a way to actually prove this; I think it's something that has to be taken on faith.
But I think much stronger claims are routinely made about Kolmogorov compression -- i.e., that it represents the best possible form of compression, for all data, for all time, period. That's what I take issue with. I don't see any reason why nature should be obeying the distribution that's implied by the Kolmogorov compressor (to the extent that it can be said to imply a particular distribution at all -- a lot of important details are left unspecified, e.g. the language). No one has proved that nature doesn't obey this distribution -- it might be impossible to answer that question, but I have a feeling that almost nobody has even really tried -- maybe, some progress would turn out (surprisingly) to be possible if enough people actually tried. But even if it's impossible to disprove, and it, too, is something that has to be taken on faith -- taking things on faith is completely useless in math. Something accepted on faith can't teach anybody anything -- it's nothing but prejudice. So there would forever be just as good a reason to dismiss Kolmogorov compression as worthless nonsense than to take it seriously.
I said that I think Kolmogorov compression seems like it would represent the best possible way to design a compressor for data that is coming from a uniform distribution of Turing machines weighted by length (with a few more little caveats added). Actually, that pretty much says nothing, because that happens to be the distribution that Kolmogorov compression is designed for. So it's circular. Also, I think it could be interpreted as an upper bound on the best compression achievable for any proper subset of all Turing machines, weighted by length. But those are very limited claims -- much weaker than being the best possible, anytime, anywhere.
The evidence for a universal distribution is mostly empirical. In all branches of science, the shortest or most "elegant" theory consistent with the data is usually the best predictor. A complex theory can predict anything.
Here is an example of its application to data compression. Suppose you want to predict the next bit in these two sequences.
1000000000
0000000001
Simply counting bits, you would say p(1) = 0.1. But we don't actually write compressors that way. In the second sequence, you would give p(1) a higher value. The reason is the relative lengths of programs that output these sequences. In the first case, you have a short description that predicts 0 and a longer one that predicts 1.
a. Output 1, then 0 forever.
b. Output 1, then 0 9 times, then 1 forever.
In the second case, the 2 choices are closer in length.
a. Output 0 forever except bit 10.
b. Output 0 9 times, then 1 forever.
This also is true if you replace the English descriptions with programs in most languages.
That is a good rule of thumb for people, because people easily get off track. I don't really see why it should formalize into hard math and apply to all programs in general, which in this case are not chosen by people but are chosen to be exactly optimal. We have no experience of this kind of process, because it is not computable and therefore absolutely impossible to implement.
...
Again, that only deals with one part of my argument, the part I've chosen not to stress. Even if shortness tends to be essential to good predictions, that doesn't explain why all programs should have probabilities determined solely by their length.
Last edited by nburns; 19th September 2013 at 07:13.
What other probability distribution could you have? For any given program, description, or theory, there are an infinite number of longer once with lower probability but only a finite number with higher probability.
Yes! What other choice is there? It makes some sense, and the resulting distribution would not be completely irrational -- messages with similar Turing machine descriptions would tend to be put in the right order. But I don't see any logic in using that metric on dissimilar Turing machines. I think the ordering there would be a meaningless artifact.
It's about the best stab you could make at a universal distribution. But it seems like a lot of hand-waving. The idea of a mathematically-based, universal best compressor is kind of dubious in my view.
If I had access to something like an oracle for creating optimal Turing machines, here's how I would use it:
I would collect a large sample of actual data, something like the collection of cached web pages that Google has on its servers. I would feed this sample to the oracle, like training data, and ask it to create the best compressor for data like that. The result would be a general-purpose compressor. Rather than outputting a brand new Turing machine for each new input, it might, for instance, start by outputting a single number that selected one of the pregenerated models, then follow with parameters to that model that describe the specific input file. If it was possible to ever resolve, I'd be willing to bet that this method would beat the Kolmogorov compressor on real data virtually 100% of the time. The only time it would routinely lose would be for things like the first billion digits of pi, which artificially favors the Kolmogorov approach (what would be the point of compressing pi, anyway?).
Using a Turing machine as an encoding would incur substantial overhead. It would be useful only when the savings from having the full power of a Turing machine were significant enough to offset all of the overhead and then some. You could definitely contrive situations like that, but for real data, like English text, I doubt that the power of a unique Turing machine for each input could be put to good use. By placing the model logic in the (de)compressor and not inside the message, the model could be made arbitrarily complex, and you could do things like include a sophisticated statistical model of English with dictionaries of tens of thousands of words and rules for grammatical structure. The output would be limited to the parameters to feed to the model. With Kolmogorov, the model would be stripped to the bare minimum, because every opcode is overhead and needs to be precisely balanced with the "parameters" to the model (to the extent that a distinction could be made). It would be an extraordinarily difficult thing to beat the global best general-purpose model for English, with no limit on resources available, in a self-decompressing program that needs to fit a full model into a tight overhead budget. The only premise I imagine that supports the self-contained program being optimal is that, in some profound and inexplicable way, text documents are all chosen, mysteriously, from a list of short computable functions that happen to use English as something like a transmission format. If this was true, then the ultimate best compression might involve unraveling this secret code and the rules by which it translates back into its English representation, encoding it with help from the Kolmogorov oracle. The overhead incurred by the Turing machine encoding could be reduced, perhaps, by replacing the actual machine description with an index into the list of all minimal Turing machines -- making decoding also uncomputable, but so what? But there would still remain an impact on the ordering due to the overhead of the English language model.
Last edited by nburns; 20th September 2013 at 07:25.
If you are comparing to Kolmogorov compressors, then you have to include the size of your decompression program. The size will depend on what language it is written in, C or x86 or whatever. I would choose a language specifically designed for representing compressed data, like ZPAQL
Could you name any compressor that generates Turing machines that isn't the Kolmogorov compressor? Sure, you can have an already-made TM and attach it to the end. And you could write your own in ZPAQ and have it put inside. But isn't it impossible (uncomputable) for a machine to improve compression by outputting a TM? I think the Kolmogorov is the sole member of its class. So it's the best, but, also the worst, and the only.
An alternative concept of the "best compressor" would be one that optimizes for real data by sampling what's really out there. It would have to change over time to reflect the current mix. I imagined some kind of committee with infinite amounts of time watching all the data in the world go by and judging its importance. But lately, it occurred to me that there's actually a real analog to this type of system: the economy. The market sets prices on everything you can buy or sell, and the price is a measure of how scarce or common it is. If you wanted to create a universal model of every good and service in the economy, you could just sort by price on the grounds that less costly commodities are the most common. Determining the right number of bits to represent some chunk of data is like adding up the prices of all the individual bits of information and figuring out a total price. There would probably be discounts when some bit of information is used repeatedly, or when related items appear as a unit, similar to how things work in the real economy, with wholesale, etc.
Last edited by nburns; 3rd October 2013 at 00:15.
For the right price you can compress the Calgary corpus files to 1 byte each (or 0 bytes using recursion).
http://mattmahoney.net/dc/barf.html
What if you create a file that's zero bytes called CalgaryCorpus.compressed? That's unambiguous, and the person receiving it could easily download the Calgary Corpus and fill it in as the result.
Yep, that's the general idea. You can compress any file to 0 bytes by putting the data in the file name. (Well, you may have to split it into smaller files).