
Originally Posted by
Paul W.
It seems to me that low-level compression-type estimates of genomic information content are probably way too high.
You have to realize that you're looking at inefficiently coded instructions in a programming language, and what kind of programming language, to see which information is interesting and which is mostly arbitrary stuff that could be replaced with other arbitrary stuff with no loss of function.
As I understand it, the vast majority of functional genes seem to be mostly productions in a fuzzy propositional production (rule) system.
A typical gene is just a rule with a boolean-like conditional and a consequent that may encode several propositions, like
If A and B and not C then E and F and G
The (fuzzy) values of propositions A B and C are implemented as concentrations of molecules with particular-shaped binding sites on them, which can bind to the inhibitory or promoting regions of the gene (the Left Hand Side of a rule) to encourage or discourage it from "firing" (in rule-based system parlance), i.e., being "transcribed" (in molecular genetics parlance).
When a gene is transcribed, the base sequence in the coding region directs the construction of a corresponding RNA molecule, which may itself be a signaling molecule, or may be further transcribed to create a corresponding protein, which is the signaling molecule.
Whichever happens, the usual thing is that all that matters about that transcribed sequence of bases or amino acids is the SHAPE it naturally folds up into, (given molecular kinetics in cytoplasm/nucleoplasm) and all that matters about that shape is certain importantly active REGIONS of that shape---areas on the surface of the molecule that geometrically fit (more or less) into the promoting or inhibiting binding sites in the control region of other genes (and/or the same gene).
At a short timescale, what you have is basically a discrete but stochastic rule-based system, where the probabilities of the firing of different rules depends on the concentrations of molecules with suitably-shaped regions exposed. The more promoting molecules you have bouncing around in the plasm, the more often one will dock for a while to the promoter region of a gene, and the fewer you have, the less often one of them will dock to the inhibiting region of a gene.
Whether a "rule" actually fires at a given moment depend on those concentrations, chance bouncing around of signalling molecules, and whether the gene transcription machinery is around there at the time and ready to grab it and go.
The upshot of this is that the information content of genes must be a whole lot lower than it looks---astonishingly low.
Most sequences of coded RNA or protein are just structural stuff to affect how the signalling molecule folds, and any number of sequences would do the same job---a small minority of sequences, but still quite a large number. Those sequnces are just there to ensure that the actually important regions of the molecule---the ones whose shape determines its activity in promoting or inhibiting gene firing---end up exposed and not interfering with each other.
Changes to the RNA or protein sequence that don't affect the shapes of promoting or inhibiting regions, or affect whether they're exposed appropriately, do not matter for normal gene function.
You see this in patterns of variation of highly conserved genes---genes that have been around for a zillion years, because they're important. You get a lot of random mutations in some that don't matter much, because they don't affect the function of the gene. You also get some regions that mutate interestingly, where most mutations are bad and go away, but others hang around because they're equivalent, and a gene may mutate back and forth between various equivalent forms over time.
As I understand it, the actual information content of the genome must therefore be shockingly low. Most parts of most coding sequences are just there for spacing, not actually encoding interesting information.
You could probably tighten a guessed-at upper bound on the information content of the gene by taking that into account, and noticing which regions of genes seem to vary phylogenetically and within species, without much effect. (I.e., noticing stretches that have a lot of randomish variants.)
Such an estimate would still be way too high, though, because it wouldn't take into account the fact that of all the equivalent sequences of some uninteresting section of coding DNA, evolution's only going to find a few of them, because its search is very greedy, and favors isolated, locally harmless changes. Anything that requires a combination of two compensating mutations to generate a functionally equivalent molecule is much less likely to be found than a single harmless mutations, and sets of three or four are very unlikely to be found.
For example, if one mutation shortens the folded molecule in between two crucial active sites, such that they interfere, and another lengthens it such that they don't, you might find that if you find the latter first---assuming lengthening doesn't hurt mach---and the compact it again with the first. RNA- and protein-folding effects tend to be bizarrely nonlinear, so harmless combinations of two or three mutations are at a huge disadvantage to singleton harmless mutations.
That means that if evolution preserves sequence information, that may only be because it doesn't discover the vast majority of (more or less) equivalent sequences. Most of the information in the genome at the codon level isn't there because it "really matters" to what the gene is actually for, but because evolution doesn't know any better in the short run---it could be stripped out and replaced by something much simpler, but evolution doesn't know how.
To understand the actual information content of the genome, we need to get a handle on a couple of basic things:
1. What do the genes look like when viewed as a productions in a fuzzy propositional production system---how complicated are the boolean expressions on the left hand side (regulatory region) and the right hand side (coding region). To figure that out, we need to know what bind s with what.
2. What else is going on, like conditional transcription, and with what effects. Genes often don't code for a single molecular product, but for a family of products, with odd editing things going on before the final molecule is produced. (AIUI, a gene that at first glance seems to produce a protein with one sequence may under different conditions produce 20 variant versions.)
Depending on how that conditional transcription works and how it's effectively used, that could change things a lot. It may essentially act as a macro preprocessor that allows genes to encode significantly more information than they might naively seem to.