Page 2 of 2 FirstFirst 12
Results 31 to 53 of 53

Thread: Information content of human genome

  1. #31
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    "We don't know which genetic information is important. But we do know there is a wide range of genome sizes due to differences in levels of duplication and differences in genetic pressure to keep the genome small to enable rapid growth and reproduction. The best we can do is consider the smallest genomes among large groups of species."

    Yes. We're pretty sure at this point that for many species with large genomes, the bulk of the differences in size are due to different loads of parasitic, nonfunctional genes like LINEs and SINEs, or to relatively recent doublings of large sections of the genome, much of which is useless.

    Doubling genomes is actually a crucially important kind of mutation in evolutionary biology---e.g., some of our Hox genes that control basic embryonic development have been doubled and doubled again (quadrupled), with extra copies getting specialized for new purposes. But it's also clear that it's brutally inefficient---most copies of duplicated genes eventually get weeded out, or just hang on for a free ride without too much negative effects.

    Compared to actually doubling the genes you're going to end up reusing, which evolution doesn't know how to do in any general way, it's very, very noisy.

  2. #32
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    The genetic code is a programming language in the sense that it's a sequence of "opcodes" which "program" the cell's protein-synthesizing machinery to produce various proteins. That part of the programming appears relatively straightforward and predictable. The proteins themselves, once synthesized, though, go on to lead very complex lives, governed first by chemistry, and then by higher-level emergent systems of unlimited complexity. Even the lowest level, the chemistry, is too complex to be predictable by equations. The only way to approach getting decent predictions is with supercomputer simulations. A programming language based on proteins would be basically unusable for anything recognizable as programming by humans. Each protein, defined by its amino acid sequence, would be kind of like a randomly-chosen Turing machine. There would be no way to fully discern its behavior. To do programming with proteins, about the most effective technique would be to construct reductionistic models and then test various protein combinations guided by the models to see if you got anywhere close. That's pretty much how drug discovery works. But drug molecules are typically much simpler (and less poweful) than proteins.
    Last edited by nburns; 6th December 2013 at 22:58.

  3. #33
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns,

    The fact that it's messy doesn't mean it isn't exactly and literally programming. Programming a stochastic rule-based system is a bitch, for sure, as I can tell you from experience, but it's very much programming.

    "The proteins themselves, once synthesized, though, go on to lead very complex lives, governed first by chemistry, and then by higher-level emergent systems of unlimited complexity."

    Most genes are purely computational genes, serving to switch other genes on or off, or to regulate graded function in a more or less fuzzy logic kind of way. Most transcription products are purely informational, serving only as the binary or fuzzy variable values.

    Most of the genome is nothing but a computer program.

    Not all of it, for sure, because a cell is basically a reactive robot, with structural proteins also being generated that become part of the robot, and some informational products serving to control various actuators outside the computer program proper.

    That should not be surprising, and doesn't mean that the bulk of the genome isn't just the computer program that orchestrates it all. It's a robot programming language, so it does i/o that controls the robot.

    And in fact, gene expression viewed as the working of an asynchronous production system for a reactive robot looks a whole lot like some robot programming languages invented by Rod Brooks and his students at MIT for programming reactive mobile ("bug") robots.

    I'm pretty sure that's not just a coincidence. Nature had to solve many of the same problems of programming reactive robots that Brooks and Chapman and Agre did in the 1980's. Chapman and Agre limited themselves to stuff that could be implemented by certain kinds of simple, incredibly fast neural networks, for scientific reasons, but the problems are very similar to those of using gene expression to control a cell---you have a production system with no general variable binding mechanism, and you have a lot of rules that fire in asyncrhonous parallel by default, operating on fuzzy-valued "variables" that are essentially atomic propositions, not variables in the sense of logical variables.

    It's kind of eerie how similar gene expression looks to "programming" stuff I've seen before in other contexts that are only similar in a very abstract way.

  4. #34
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    "Even the lowest level, the chemistry, is too complex to be predictable by equations."

    I'd say that this depends a lot on a exactly what you're talking about, and what you mean by "predictable," and that it's mostly false.

    Lots of things are too complicated to be predictable (or understandable) in the general case, but are sufficiently understandable and predictable in specific cases that we can use them.

    Consider von Neumann machines and their potentially self-modifying code. It's too complicated to be predictable in the general case, but that doesn't mean there aren't plenty of use cases where humans can use von Neumann machines.

    Likewise for stochastic, fuzzy production systems. I've actually known people who did in fact program such systems, so you can't tell me it can't be done.

    "The only way to approach getting decent predictions is with supercomputer simulations. A programming language based on proteins would be basically unusable for anything recognizable as programming by humans."

    No, you're operating at the wrong level of abstraction. This is somewhat like saying you can't build computers out of analog devices because it would be unusuable by humans. But that is exactly how we do build all of our computers---we use analog devices in constrained and stereotyped ways that allow us to regard them as digital for most purposes.

    The same thing applies to stochastic rule firing and fuzzy variable values implemented as the concentrations of signalling molecules. We don't program the general case. We program specific cases that we do understand.

    That is what geneticists do, too. They do not sit around and simulate the operation of the genome in molecular detail. They can't. And if they could, they wouldn't understand what they were looking at---it would all be a blooming and buzzing confusion, just as complicated as the real molecular thing.

    So what the do is identify genes that seem to operate as binary rules, and signalling molecules whose values seem to be effectively binary, and they identifty the feedbacks that tell them whether they can count on that. And they identify other rules that seem to operate in a graded fashion, and whose output concentrations vary continuously and regulate the graded (or binary) operation of other rules.

    And they map out "genetic regulatory networks," which are exactly stochastic production systems.

    They can only do this because most of the time, nature programs gene expression in the same basic handful of ways that people I know program production systems. (Some of which I've done myself.)

    Nature has to do things that way, because the general case of stochastic production systems isn't tractable for evolution, either. Nature usually finds simple-ish hacks that ensure that the system is mostly comprehensible at a high-level of abstraction.

    It has to do that. That is what life is---it is mainly self-regulating self-preserving and self-reproducting computation.

    The problem of growing, maintaining and reproducing an organism is a reactive computational one, and the solutions available to nature are the same ones availble to us, because any computer has to be made out of some kind of matter that you have lying around.

    The problem of evolving such organisms over time is essentially a software engineering problem---how can you organize things such that there's a way of changing and replicating the crucial control information.

    Nature hit upon programming a mostly-simple production system. (With a lot of cruft, admittedly, but the basic operation is simple and programmatic at a low level, and stereotyped and tractable at a high level, and that is why it works.)
    Last edited by Paul W.; 6th December 2013 at 23:25. Reason: deleted last para, an edit-o

  5. #35
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    "Each protein, defined by its amino acid sequence, would be kind of like a randomly-chosen Turing machine."

    I'm not sure what point you're making here. Proteins are not interestingly Turing machines, so far as I know, if you're talking about a sequence of amino acids being like a sequence of marks on a tape. But if you're just saying that a given protein can interact in complicated ways with its "inputs," and is difficult to model in detail, I'd agree.

    But for most purposes, we don't need to model proteins in detail. For signalling proteins, we usually only care about a few things that affect gene expression----the availability and spacing and specificity of thing binding sites on its surface that make it act as a logical conjunction, like A and B and D and Q."

    "There would be no way to determine the complete behavior of most of them."

    I'm not sure what you mean by "complete behavior," or why you care. We mostly don't care about the complete behavior, because it doesn't matter much. We care about signalling function, and it's usually pretty robust---either binary or graded but with nothing depending much on fine graditions. The concentration of a typical signalling protein typically conveys one bit of information---yes, the conjunction is true---or a just a few bits; the ways that the rules respond is fairly reslient in the face of noise.

    If those things were not true, more or less, most of the time, the system would not work. There's a lot of noise intrinsic in the system, and the system evolves rule sets that are fairly resistant to effects of noise. (Lots of binary rules, for example, and signalling molecules whose concentrations are generally either high or low.)

    "You would have to run them and see what they did under experimental conditions, and try to extrapolate and guess what they'd do in combination under real circumstances."

    Well, yes, in the general case, but not usually. The rapid progress we've made in mapping genetic regulatory networks is only possible because we don't usually encounter the general case. We find a lot of instances of stereotyped patterns, with binary or crudely graded gene expression implementing binary or fuzzy logic with only a few significant bits to the values.

    Further progress---lots and lots of it---would be possible if we mapped out 100 billion chemical shapes and noticed which promoting and inhibiting sites they bound with, with what strength, and similarly categorized the signalling molecules in terms of which of those shapes occur on the surfaces of signalling molecules.

    For the most part, we don't actually need to understand protein folding---we can construct proteins by brute-force automatic sequencing, and see how they actually fold. (There will be some errors, though, because some proteins can fold different ways, and there are subtle mechanisms to ensure that they usually fol the right way in the nucleoplasm or cell plasm.)

    That would allow us to automatically construct a good first draft of the genetic regulatory mechanisms of the entire genome, and we wouldn't have to do it with supercomputer simulations---we could do it by brute force generation of actual protein sequences, seeing how they actually fold, and matching them to binding sites on arrays of gene chips. It'd be a big project, but feasible, and would tell us what we most need to know---which transcription products represent which propositions that the rules implemented by genes operate on.

    You seem to think that it's essential that we are able to completely specify the behavior of the system, and know that our specification is correct, for it to be a programmed computer system.

    I don't think so. I think it's more understandable than you make it out to be at a higher level of abstraction than you're talking about, even if we never fully debug it, because we miss some possible interactions, don't understand why certain apparent race conditions don't matter in practice, etc.
    Last edited by Paul W.; 7th December 2013 at 15:04. Reason: moved last para up 2 para

  6. #36
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I broke the problem into two parts:

    - the logic for controlling transcription and gene expression, which is encoded in DNA
    - the amino acid sequences that give proteins their shapes and functions, which is also encoded in DNA

    You seem to be focusing almost entirely on the first part, which I admitted is fairly computer-like and programmable. The second part is how DNA actually produces an organism, and that part is far, far more complex and harder to understand.

    Proteins can absolutely behave like Turing machines: The machinery of transcription, which looks very much like a Turing machine, is entirely orchestrated by proteins. But proteins also operate in ways that look nothing like Turing machines. Proteins have three-dimensional interactions with all of the chemistry that surrounds them, and they tend not to do things sequentially like (deterministic) Turing machines do. From the standpoint of computation theory, this may not make them fundamentally more powerful than Turing machines, but it makes them very complex in a quantitative sense.

    I saw a result on wikipedia that indicates that it's proven that a chemical reaction network, which is obviously intended to model chemistry, leads quite quickly to uncomputable behavior.

    If you hold the view that every system is basically a Turing machine, then it's trivially true that any type of system is fundamentally equivalent to a computer program. But the kinds of programming that people actually do is far simpler than this. If programming languages were this complicated, programming would be almost impossible. That's my point.

  7. #37
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by nburns View Post
    I broke the problem into two parts:

    - the logic for controlling transcription and gene expression, which is encoded in DNA
    - the amino acid sequences that give proteins their shapes and functions, which is also encoded in DNA

    You seem to be focusing almost entirely on the first part, which I admitted is fairly computer-like and programmable. The second part is how DNA actually produces an organism, and that part is far, far more complex and harder to understand.
    I think the first part is sufficient for my point---it's an asynchronous, stochastic rule computer. Subtleties of RNA transcription and protein folding are mostly irrelevant, because from the point of view of the rule computer, they either happen "under the hood" (in the case of signalling molecules) or implement I/O operations (in the case of structural proteins). The former is just how the computation is implemented, and the latter are an interface between the main computational system and the other mechanical stuff in the cell. Once you recognize that, you can recognize the computer, its implementation, and its outputs.

    (The other parts of the cell may have information processing going on, and even genetically programmed computation in the case of mitochondria, but that sort of thing is true in any complicated robot, factory, or computer---some of the peripherals have little computers in them, too.)

    All that really matters about most signalling proteins is the logical conjunction they implement by exposing areas with the right shapes to promote or inihibit the firing of the right genes.

    A coding gene sequence typically implements that by generating an RNA with a base sequence that in turn generates a protein with a corresponding sequence that happens to fold up a particular way... but all that's mainly an implementation detail about how the rule computer works. (Like pipelining, instruction scheduling, and branch prediction in a "von Neumann" computer, all made of a gazillion analog devices laid out in such a way as to function as digital devices.)

    The fact that the implementation is hairy and looks very different from the much simpler higher-level abstraction it implements isn't an argument against it being a computer---it's a strong argument for it being a computer. If you can find a conceptual lens like that to look at something complicated and functional, and elegantly explain why it actually works in terms of it being a computer, what you're looking at actually is a computer. That's all it means to be a computer.

    Genetics and Darwinian evolution work because precisely because nature stumbled on way to implement a fairly simple kind of programmable computer, and hacked up some complicated programs that work.

  8. #38
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    Proteins can absolutely behave like Turing machines: The machinery of transcription, which looks very much like a Turing machine, is entirely orchestrated by proteins.
    There are two very different issues there.

    (1) Sure, the machinery of transcription looks sorta like a Turing machine (or is a very simple boring kind of Turing machine) because it operates on a sequence---we can see the usual coding DNA as a tape, which encodes information to construct another tape (RNA), which (usually) encodes information to construct a protein sequence. But just transcribing a tape isn't like a very interesting Turing machine.

    (Conditional transcription can make it somewhat more interesting---especially if there's any interestingly general pattern-matching going on---but I don't think that's what we're talking about or really turns out to be important for this discussion. We can view any interesting computation going on during RNA/protein transcription as the operation of coprocessor. The main CPU core is the asynchronous rule system of basic gene expression, and the conditional transcription facility gives you conditional string-processing instructions, too---that's the sort of thing we might expect if we think it's an evolved computer.)

    But proteins also operate in ways that look nothing like Turing machines. Proteins have three-dimensional interactions with all of the chemistry that surrounds them, and they tend not to do things sequentially like (deterministic) Turing machines do. From the standpoint of computation theory, this may not make them fundamentally more powerful than Turing machines, but it makes them very complex in a quantitative sense.
    It doesn't matter that they do nondeterministic, nonsequential stuff. That's to be expected in a stochastic computer that operates in parallel by default. IMO, it has nothing to do with the issue of whether it IS a computer, or whether it's specifically a rule-based programmable computer with string-processing extensions.

    That's just saying what KIND of computer it is---that it's not a deterministic, sequential, binary digital computer. But I said that up front, and I don't think it makes a bit of difference as to whether it is a computer, and a programmed computer.

    I'm not making a metaphor to the von Neumann machine on your desktop, or to naive people's idea of what makes a computer "a computer" or a program "a program", where the less it looks like FORTRAN or BASIC, the less it counts.

    I'm saying that there's nothing very special about von Neumann machines or mostly sequential programming in languages that are mostly sequential by default, in terms of what counts as a computer or a program---they're just special cases of programmed computing we find particularly handy right now.

    Consider this. Real computers are made out of stochastic devices---quantum gadgets---and we're getting close to being unable to hide that, as the devices get smaller.

    Imagine that in 30 years, we are programming computers stochastically because they don't behave deterministically anymore, and if you want programs to execute efficiently, you have to accept that in your programming model---you can't afford layers of redundancy that give you confidence that very nearly always, every instruction will give you exactly the right result.

    In that scenario, we might end up programming stuff that happens within a CPU the way people program fault-tolerant distributed systems and network protocols---always taking noise and the possibility of error and failure into account, but trying not to bog the whole thing down with double-and triple-checking everything all the time at several levels. So we end up coming up with noise-resilient parallel-by-default asynchronous, nondeterministic algorithms.

    Would you say that's not programming a computer anymore? I wouldn't. It'd say it's programming a computer that's a real bitch to program.

    I don't think it matters much that nature came up with and programmed that kind of computer before we had to---if we'd come up with it, it would be absolutely clearly computers and programming, and we shouldn't be so biased against naturally-occurring computation.

    I saw a result on wikipedia that indicates that it's proven that a chemical reaction network, which is obviously intended to model chemistry, leads quite quickly to uncomputable behavior
    I don't know what this means or why it's relevant. Stochastic or analog computers can get lost in undefined- and/or unanalyzable-behavior space if you're not careful, but that doesn't mean they're not computers.

    If you hold the view that every system is basically a Turing machine, then it's trivially true that any type of system is fundamentally equivalent to a computer program.
    I don't usually talk that way, and I think it's usually a red herring. If it's not doing conditional operations on "marks" on a sequential "tape", I don't usually see the point of talking about it as a Turing machine. (Although it can be useful to talk about what kind of Turing machine something is equivalent to.)

    But the kinds of programming that people actually do is far simpler than this. If programming languages were this complicated, programming would be almost impossible. That's my point.
    Sure, but my point is that nature doesn't program that way either, because it doesn't work. It programs genes as an asynchronous rule-based system in mostly the way a human would, and for the same fundamental reason a human would---because it's relatively simple and workable to do it that way. It

    (0) uses most rules and variables in a binary way, avoiding middle states
    (1) introduces data dependencies into rule preconditions to force strict sequencing and phased behavior, etc.,
    (2) constructs clock synchronization variables to allow things to avoid race conditions by checking synchronization and timing flags
    (3) constructs clock circuits and timing variables that allow more precise and flexible sequencing.
    (4) does some more analogish stuff but usually keeps it tractable with crude values and noise-resilient resilient programming that generally avoids chaotic cases that matter. (Though it allows them for things that don't matter, like fingerprint patterns, and may sometimes exploit them for PRNG-like stuff), and
    (5) relies on the law of large numbers to get reliability or acceptable precision by using redundancy to minimize the effects of noise, at multiple scales. (E.g., many randomized rule firings can be used as an analog firing rate at a larger scale, and at a still larger scale, many cells doing the same thing can yield fairly precise values of morphogen concentrations.)

    Hox genes do all these things, for example, and they're crucial for development and evolution.

    Maybe there's something much subtler and un-program-like going on in the genome somewhere, but as far as I know geneticists haven't found it yet, and don't expect to. Everything we do know looks very much like asynchronous parallel rule-based computing, more or less like humans would do it, faced with that kind of stochastic rule computer and a job to do.

  9. #39
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I don't think much of that is relevant to the original issue, which was your contention that the genome probably contains much less information than it appears to.

    Maybe there's something much subtler and un-program-like going on in the genome somewhere, but as far as I know geneticists haven't found it yet, and don't expect to. Everything we do know looks very much like asynchronous parallel rule-based computing, more or less like humans would do it, faced with that kind of stochastic rule computer and a job to do.
    What we know at this point is probably no more than 1% of all there is to know. We've probably found lots of systems that look like computers, because computers are something we can understand and can relate to. The stuff we can't understand is obviously going to be in the not-yet-understood part.

    By focusing on one particular function played by a gene/protein in a particular system, you're ignoring a lot of the information that would be in that gene. Just because the wild type of a gene performs function A in system X and otherwise doesn't interfere with anything else, that doesn't mean that the bare mechanics of function A represent the only important information in that gene. If you tried to replace that gene with something else that you predict to be equivalent, you would probably quickly find that a lot of parts of the gene that you assumed to be deadweight or completely arbitrary actually have important consequences. The consequences of changing those things could crop up anywhere. Unlike a computer program, a bug in a gene could affect anything, not just the system and task it was meant to act on.

    Like I said already, the bottom line issue is that it's still very, very premature to try to calculate the amount of important information in genes and DNA. The problem is that we don't know what we don't know. And absence of evidence is not evidence of absence.

    The positions we're arguing over come down to speculation piled on top of speculation. At root, you're arguing that we already know about enough to make a well-founded guess as to the information content of the genome, and you're predicting that the number would come out low, much lower than previously expected. At root, I'm arguing that we most likely don't know nearly enough to make a meaningful guess about the information content of the genome, and, furthermore, I'm challenging the assumptions that led to your low prediction, mostly because I favor a more conservative style of accounting. These are our positions, and I don't think there's much more that can usefully be said about it.
    Last edited by nburns; 7th December 2013 at 22:27.

  10. #40
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I don't think much of that is relevant to the original issue, which was your contention that the genome probably contains much less information than it appears to.
    I think the discussion of what genes basically do (or don't do) is entirely relevant---vastly more relevant than counts of base pairs.

    Once you realize that genes function as rules that have discrete propositional preconditions on the left hand side and consequents that are conjunctions of discrete propositions on the right, as the current understanding of most gene function says, the big questions about significant information content are about

    (1) how many promoters and repressor sites are there in the control region of the average gene,
    (2) how many propositions are conjoined in the average consequent, and
    (3) the size of the vocabulary of propositions---how many shapes are there that match any promoters and repressors and could thus regulate gene function.

    The number of possible binding site shapes is large but not infinite, because the size of docking sites is not generally very big, and the matches don't have to be exact. (Actually you can get graded degrees of probability of docking, but it appears that only a few bits of information generally matter. The system is just too noisy to depend on much fine details of probabilities of docking.)

    So suppose we have a genetic rule of the form

    if P1 and P2 and P3 ... and PN // the function of the control region of the gene
    then C1 and C2 and C3 ...and CM // the function of the coding region of the gene,

    Its functional information content as a rule is mainly in
    (1) the number of preconditions, multiplied by the number of bits necessary to specify any proposition from the set of propositions plus a very few bits to specify whether it's a promoter or repressor and of what binding strength, plus
    (2) the number of consequent propositions encoded by the transcription product, likewise multiplied by the number of bits necessary to specify a proposition out of the number of propositions in use, plus
    (3) a much smaller number of bits for various overheads

    We don't have a good handle on this, but I think at this point it's pretty clear that genes generally have tens or maybe hundreds of preconditions involving any of thousands or tens of thousands of propositions, and gene products encode from a few to maybe hundreds of consequents propositions, but probably not thousands or millions.

    Suppose a big multifunctional gene has several hundred propositions on its left hand side and several hundred on its right hand side---which would be a lot---and there are a few tens of thousands of propositions in use.

    Each occurrence of a proposition could be alternatively encoded in about 20 bits---say 16 bits to specify a proposition plus 4 bits for significant variations in specificity (docking probability). Multiply that by by a few hundred LHS preconditions plus another few hundred RHS conjuncts, and it's still a few kilobytes per gene at most. Multiply that by 20,000 functional genes and you get a few megabytes, not tens of megabytes, or the hundreds of megabytes as you'd get by counting base pairs and multiplying by two bits.

    And that's in a very horizontal representation, which could presumably be compressed significantly, so the actual information content is probably a few megabytes at most. It's 20,000 "lines of code," and the lines of code are not enormously complex.

    There are very good basic physical reasons to think that this is a much better estimate than counting base pairs and multiplying by two, or running that representation through PAQ as a big string.

    One is that we are pretty sure at this point that most short sequences of base pairs do not matter much---they don't directly affect the shape of the docking sites on the surface of the folded signalling molecule, because they end up inside the folded molecule or on the surface in a place that doesn't stick out such that it's available for docking. They become part of the hidden, nonsignalling structure of the signalling molecule, and could be replaced by any number of functionally equivalent subsequences with equivalent effect---so long as the right-shaped parts stick out someplace, we're good. For any given big signalling molecule, there are a zillion equivalent signalling molecules you could use instead, with equivalent effects modulo the kind of minor variation that's masked by the intrinsic noise in the system.

    It's clearly a very noisy representation, where most stuff is just random filler.

    Given that, and the fact that much of the genome is pretty clearly nonfunctional parasitic retrotransposon crap in the first place, it would be very surprising if the actual information content of the genome weren't at least and order of magnitude less than the number of base pairs times two, and more likely two orders of magnitude, or even more. Nobody thinks it's an efficient representation in terms of how many base pairs are used to implement a functional difference in signalling; we're pretty sure it's far from that.

    It's possible that there are interesting things going on in terms of the instruction set architecture that make the language more interesting, e.g., if conditional transcription has cool pattern matching and acts as a fairly powerful macro preprocessor, you could in principle have a significant information we don't recognize encoded in a funny way in DNA/RNA sequences.

    That appears to either not be the case, or not get used much, such that the program isn't actually that much more complicated. Some proteins do have several conditionally transcribed variants, but so far as we can tell, not usually a large number. Apparently an alternative language for specifying how those things are conditionally transcribed would not add a lot to the above estimate of information content---it would still be dominated by the number of preconditions and consequent conjuncts multiplied by the number of propositions in use.

  11. #41
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    nburns:

    What we know at this point is probably no more than 1% of all there is to know.
    At this point we actually have a decent handle on a lot of important stuff, from the very small scale of the instruction set architure---the stochastic rule-firing system implemented as gene expression---to crucial high-level software, like Hox genes controlling the basic growth and development of the organism, and various other genes regulating metabolism and so on.

    Truly, we only have a decent handle on about 1 percent of the code, but we have a decent general grasp of how much of the most important 1 percent basically works, and that's actually a whole lot.

    Suppose that you were trying to understand the Linux kernel and only understood 1 percent of the code. Would that mean you just don't understand Linux? It depends very much on which code you do understand.

    Suppose you have a general grasp of the fact that it runs on a von Neumann machine, more or less, and you have a general understanding of how processes switching and virtual memory are implemented in terms of von Neumann instructions.

    I'd say you understand the most crucial mechanisms of Linux, even if you only have a somewhat fuzzy understanding of less than a tenth of one percent of the 17 million lines of Linux kernel code. You get basically what an operating system kernel is, and you get how it basically works in terms of the underlying instruction set architecture. You're doing pretty good, like a good student in my intro to OS class.

    Even if you don't understand the hardware in detail---you don't understand instruction scheduling or register renaming or branch prediction, and you don't even realize there's a graphics coprocessor---you still basically get how Linux works. And even if you don't understand the network protocol stack, but have a general sense that it exists and what it does, and how such a thing might plausibly be implemented on a von Neumann machine, you're really doing quite well in terms of a basic scientific understand of what Linux is and how it works, even if the protocol stack has twice as many levels as you think, and you don't understand any level correctly enough to program to it yet.

    I think we're in that kind of situation with respect to genetics. We've already answered many of the most important questions we've always had about how this stuff works, and the answers don't suggest that there are equally big things we haven't discovered yet---we have a workable first draft of the organization and functioning of the system from very near the bottom (basic gene expression and genetic regulatory networks) to very near the top (e.g., Hox and evolutionary developmental biology).

    Sure, we have a lot of fairly big questions left and may stumble into fairly big interesting new stuff, but I think we pretty well get the basics of the architecture and the basics of how very important high-level stuff maps onto very low-level stuff.

    We've probably found lots of systems that look like computers, because computers are something we can understand and can relate to.
    No, I don't think so. I think we're discovering systems that not only look like computers but literally are computers, because the problems of making living things grow and maintain themselves and adapt and reproduce are mostly computational problems. They are complicated problems of internal self-regulation, adaptation to external conditions, and coordinating a bunch of complicated processes. It's like a robot control system and an industrial process control system for a factory at the same time, because a cell is a little robotic factory that can duplicate itself.

    Unless you believe it's literally magic, there's a whole lot of information about complicated things to be processed in complicated ways, to monitor internal and external events and decide what to do about them, and the only way to do that is with some kind of computer, whether it's digital, analog, or hybrid, deterministic or stochastic, serial or parallel. Any mechanism that can process information with that kind of complexity and reliability is some kind of computer, doing fancy online reactive computation of some kind.

    And for it to evolve in a Darwinian fashion, it's really handy to have it be a programmed and reprogrammable computer, with discrete instructions that can be recombined. Nothing else can do the job, if you think about it. You have to have a control program that can be transmitted from one generation to the next, with parts that can be changed and mixed-and-matched.

    The simplest way to do that is with discrete instructions or subroutines at some granularity (genes) that come in tuples with different versions (alleles), which you mix and match from multiple parents.

    The simplest version of that simple way is to have exactly two parents, and exactly two alleles per gene, with each allele being one line of code in the ISA. So it shouldn't be very surprising if we do have that very simple setup---nature discovered something very simple and workable, and kept with it.

    (It's not the only way to do evolution, or even specifically Darwinian evolution, but it's about the simplest way to do more or less Darwinian evolution, which is what nature seems to have discovered and kept doing with fair success.)

    The stuff we can't understand is obviously going to be in the not-yet-understood part.
    Well, sure, but in any science, when you get a basic handle on things there are good reasons to expect really big surprises, and bad reasons to expect them.

    You should expect big, revolutionary surprises when there are really mysterious things you can't even approximately explain in the current paradigm. You should be skeptical of the likelihood of such shocking revelations when you can already answer most of the big questions within the current paradigm, and there's no indication that the others aren't similarly tractable.

    What do you think are the big mysteries that the current understanding of gene expression leaves unaddressed?

    In lieu of anything that genes are able to accomplish that can't be explained in terms of the current basic understanding of gene expression, Occam's razor suggests that nature has used the same basic ISA and same basic software tricks for a lot of the stuff we don't understand yet as it does for a lot of the important stuff we do understand.

    So far as I know, geneticists are pretty confident that they're on the most credible track, if not the right one, in thinking that they'll continue to be able to map out more and more genetic regulatory networks in more or less the same way they've had so much success in doing already. There are plenty of things we don't have well-mapped-out yet, like how the brain gets wired up such that it does all the amazing things it does, but there's no reason to think that requires a different low-level architecture for gene expression.

    By focusing on one particular function played by a gene/protein in a particular system, you're ignoring a lot of the information that would be in that gene. Just because the wild type of a gene performs function A in system X and otherwise doesn't interfere with anything else, that doesn't mean that the bare mechanics of function A represent the only important information in that gene.
    Of course not. I never said that it did, and I don't think it does, in the general case. I do think we have good reasons to think that such things don't make the system overwhelmingly more complicated or impossible to analyze in the same general terms, at least in principle.

    We have had enough success in mapping out crucial genetic regulatory networks that it's clear there's a useful degree of modularity in the genome---we can identify and study and mess with more or less modular sub-networks, like the Hox complexes and their several sub-complexes, similarly more-or-less hierarchical stuff regulating metabolism, etc. We're always overlooking some interactions that we missed, but we would not have gotten as far as we've gotten if we were mostly missing the forest for the trees.

    Some of this stuff we may never figure out in practice, or only when we have vastly better technology, but I think that's more plausibly because there's there's mind-bendingly awful krufty spaghetti code in there (that works and has been conserved) than because the basic instruction set architecture needs drastic revision.

    (Which is not to say we won't discover interesting and important coprocessors that extend the architecture in very interesting and useful ways, and that will be fascinating, but it's I think it's pretty clear already that for the most part, most of the genes function mainly as stochastically firing rules in a production system.)

    If you tried to replace that gene with something else that you predict to be equivalent, you would probably quickly find that a lot of parts of the gene that you assumed to be deadweight or completely arbitrary actually have important consequences.
    People have actually done some experiments sorta like this, at least with respect to parasitic retrotransposons. I forget what little organism it was, but they actually stripped out most of the LINEs and SINEs---a big fraction of the organism's total DNA---and the organism still lived and reproduced, albeit not as well. So apparently none of that "junk" DNA had any obviously necessary function. It only messed things up a little bit, which was a big surprise to people who thought a lot of that junk had to be in there for some very good reasons.

    Certainly, taking out junk or random filler and replacing it with other junk or random filler is tricky, if the organism has evolved to adapt to its presence.

    For example, LINEs and SINEs affect the spacing of genes along a chromosome, and that can subtly affect rates of gene transcription. If there are enough of those changes, you may tweak race conditions that are implicitly avoided by slowing down the rate of gene transcription with junk DNA spacers, which act like big NOPs. (So it's not surprising if stripping out all the LINE and SINE junk DNA can have some deleterious effect. It wouldn't even be surprising if it was fatal, if there were even a few bits of crucial information in all that junk.)

    The fact that such junk and filler has some such effects does not mean that they contain much useful information, and it's pretty clear that they don't. A LINE or SINE that "functions" in that minor way is like a big series of consecutive NOPs, which only contains a very few bits of of information, but represented as big series of useless codons containing a lot of garbage instead of obviously compressible repetition of NOPs.

    The consequences of changing those things could crop up anywhere.
    Sure. That doesn't mean that the information content of such things is high, and it's pretty clear it's generally not.

    What it means is that it's hard to analyze and debug, like any other complicated program where you don't understand all of the interactions.

    Unlike a computer program, a bug in a gene could affect anything, not just the system and task it was meant to act on.


    I'd say that's just like a computer program.

    Suppose I mutate a RETurn instruction into a NOP, or a NOP into a RETurn, or an indirect jump into a different indirect jump. Nobody knows what will happen.

    If I can mutate a jump instruction to jump anywhere at random in memory, even into data bits that weren't meant to be executed as code, pretty much anything could happen.

    None of that means that the information content of those instructions is particularly high, or that von Neumann computation is "incomputable." It's just the usual hairiness that we expect when we mess around stupidly with programs.

    Genetics is hairy just like computation, because most gene function is just computation.

    But back to the issue of replacing filler with different filler. Surely, we can't just modify a piece of filler in the coding sequence of a gene without unpredictable things happening to the gene product shape, if we do it stupidly.

    Suppose for example, the subsequence that we want to replace makes a bump under the surface of the folded molecule, which in turn makes some important shape on the surface stick out where it can dock to a repressor or promoter region of a gene. Whatever we replace it with has to have a comparable spatial effect, as a support for that protruding bit, but lots of sequences will do---there's some information there that matters, but not much.

    There's even less information in the coding region of the gene than it might seem from that, because what matters is not the equivalence of short subsequences, but the equivalence of entire coding regions. We could not only replace the subsequence with a different subsequence with a similar protrusion-causing effect, we could often delete it entirely if we made a small compensating change somewhere else in the coding sequence---e.g., making the part that actually matters stick out a bit more of its own accord, once it's all folded, so that it doesn't need that filler under it to make it stick out. Or we could replace the entire coding sequence with a completely different coding sequence that generates a radically different overall molecular shape, with equivalent regions that do matter sticking out somewhere on the resulting surface.

    In general, there's a vast number of equivalent molecules with entirely different amino acid sequences that generate the same protruding shapes that matter, but in different ways and in different places. It's a very inefficient way to represent a conjunction of propositions.

    The actual functional information in the coding sequence of a gene is thus much, much lower than it appears. You can use thousands and thousands of amino acids in a particular sequence, just to generate any lumpy ball that happens to have about the right shapes at the tips of the dozen or so lumps that matter. And there may be a whole bunch of other lumps on the surface that don't matter because their shapes don't dock to any repressors or promoters with any significant probability. They're just the usual noise that doesn't matter much.

    The evolved shapes of signalling molecules are determined by random mutations of sequences with unpredictable effects on three-dimensional shapes, so shouldn't be surprising if even the subsequences that do create lumps on the surface that could dock to something don't actually do anything. Even most mutations that affect shapes of the tips of protrusions on the surface of the molecule are probably mostly neutral, because most of the lumps are just lumps. These lumpy balls are not intelligently designed.

  12. #42
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Ok, here's the issue in simple compression terms:

    There's lossless and there's lossy compression. What you're talking about is lossy compression. When you're trying to estimate information content, it's based on lossless compression, because otherwise, you can't say precisely what you're measuring: lossy is an approximation, and the approximation you find acceptable is subjective, so lossy sizes are subjective.

  13. #43
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    Certainly, taking out junk or random filler and replacing it with other junk or random filler is tricky, if the organism has evolved to adapt to its presence.
    I think that's pretty much my whole point right there.

  14. #44
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by nburns View Post
    Ok, here's the issue in simple compression terms:

    There's lossless and there's lossy compression. What you're talking about is lossy compression. When you're trying to estimate information content, it's based on lossless compression, because otherwise, you can't say precisely what you're measuring: lossy is an approximation, and the approximation you find acceptable is subjective, so lossy sizes are subjective.
    Ah, okay, I think I see why we've been kinda talking past each other. We may actually entirely agree.

    I was addressing the question of what the actual information content of the genome really is, which I took to be what Matt's OP was really about, e.g., in the bit about estimating what it would take to create something similar in complexity to the human brain at birth.

    (That's a subject of great interest to me as well. I think the apparently surprisingly low information content of the genome has a lot of fascinating and important scientific implications, especially for neuroscience. It puts a bound on how complicated the genetic program for constructing brains can be---the design of the brain are apparently not nearly as complicated as I always thought before learning about molecular genetics. Our main problem in figuring it out is not likely to be that the program is just too big.)

    For that kind of exploration, the ability to compress a representation of the genome is mainly to putting an upper bound on the minimal size of a program that could build a computer of comparable complexity to the brain. If we know that an estimate from applying PAQ to a bit-pair string representing a genome sequence is an order of magnitude or more high, that's very relevant.

    For the more practical question of how to store gene sequences compactly in practice, though, it's pretty much irrelevant. We can't get anywhere near that kind of compression performance without modeling all kinds of crazy stuff like protein folding and even hairier stuff after that, to find "equivalent" representations that are different 3D shapes. Not worth it.

    (Some of my non-compression software is used in genome databases at the Whitehead Institute, so that is of practical interest to me; I just didn't think that was what was being discussed.)

  15. #45
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    Ah, okay, I think I see why we've been kinda talking past each other. We may actually entirely agree.

    I was addressing the question of what the actual information content of the genome really is, which I took to be what Matt's OP was really about, e.g., in the bit about estimating what it would take to create something similar in complexity to the human brain at birth.
    The paper linked to talks about Kolmogorov complexity, so I think that's the standard Matt uses.

    (That's a subject of great interest to me as well. I think the apparently surprisingly low information content of the genome has a lot of fascinating and important scientific implications, especially for neuroscience. It puts a bound on how complicated the genetic program for constructing brains can be---the design of the brain are apparently not nearly as complicated as I always thought before learning about molecular genetics. Our main problem in figuring it out is not likely to be that the program is just too big.)
    I know that kind of enthusiasm -- it's what you feel before you dig in to a project and learn how miserably complicated and full of details it is.

    For that kind of exploration, the ability to compress a representation of the genome is mainly to putting an upper bound on the minimal size of a program that could build a computer of comparable complexity to the brain. If we know that an estimate from applying PAQ to a bit-pair string representing a genome sequence is an order of magnitude or more high, that's very relevant.
    Nitpick: it can't just be as complex as the brain. It also has to function like the brain.

    For the more practical question of how to store gene sequences compactly in practice, though, it's pretty much irrelevant. We can't get anywhere near that kind of compression performance without modeling all kinds of crazy stuff like protein folding and even hairier stuff after that, to find "equivalent" representations that are different 3D shapes. Not worth it.

    (Some of my non-compression software is used in genome databases at the Whitehead Institute, so that is of practical interest to me; I just didn't think that was what was being discussed.)

  16. #46
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by nburns View Post
    The paper linked to talks about Kolmogorov complexity, so I think that's the standard Matt uses.
    Straightforward Kolmogorov complexity is the wrong measure for an inefficiently and noisily represented program in a stochastic computer.

    I know that kind of enthusiasm -- it's what you feel before you dig in to a project and learn how miserably complicated and full of details it is.
    I know that kind of condescension, and you might want to rein it in until you demonstrate that you actually get what I'm saying.

    Nitpick: it can't just be as complex as the brain. It also has to function like the brain.
    Of course. I thought it was obvious that I was talking about the information content of the human genome, including the parts that direct the wiring up of the human brain.

    I'm talking about how complicated the genetic program that generates the brain, etc. is or isn't.

  17. #47
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    Straightforward Kolmogorov complexity is the wrong measure for an inefficiently and noisily represented program in a stochastic computer.



    I know that kind of condescension, and you might want to rein it in until you demonstrate that you actually get what I'm saying.

    I didn't mean for you to take that personally. I think an artificial human is likely to be impossible to engineer, and it's doubtful that any person will ever do it. I was joking by making it sound like some mundane project that's frustrating in a routine, ordinary way. It was meant as absurd understatement.


    Of course. I thought it was obvious that I was talking about the information content of the human genome, including the parts that direct the wiring up of the human brain.

    I'm talking about how complicated the genetic program that generates the brain, etc. is or isn't.
    I'm not sure what you're saying. The challenge isn't to create something that's extremely complicated. I don't see any roadblocks to making something as complicated as the brain, however you would measure it. I don't expect to ever see a working artificial brain, however.
    Last edited by nburns; 12th December 2013 at 11:26.

  18. #48
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    I admit that sometimes I don't really know what you're driving at:

    What do you think are the big mysteries that the current understanding of gene expression leaves unaddressed?
    Are you talking about solving the mystery of gene expression, or solving the mystery of how DNA leads to a fully-functioning human? I know that there are processes related to DNA that we seem to have gotten a pretty good grip on. But that doesn't mean that we are anywhere near solving the problem of life itself. Has anyone ever designed a useful protein from scratch?

    Unless you believe it's literally magic, there's a whole lot of information about complicated things to be processed in complicated ways, to monitor internal and external events and decide what to do about them, and the only way to do that is with some kind of computer, whether it's digital, analog, or hybrid, deterministic or stochastic, serial or parallel. Any mechanism that can process information with that kind of complexity and reliability is some kind of computer, doing fancy online reactive computation of some kind.
    If you adopt a broad enough definition of airplane, you could say that birds are airplanes. So what?

    You're asserting that a lot of things are easy, but until you've actually done those things, you can't know that.
    Last edited by nburns; 12th December 2013 at 21:38.

  19. #49
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    I've always thought that advanced intelligences are likely to be artificial and digital, and propagate themselves by transmitting their plans at the speed of light to anybody willing to reconstruct them---just transmit a blueprint and memory dump and say, "built this and load this data into it".

    It's by far the most efficient way to travel, and any civilization willing to propagate itself that way would have a gigantic advantage over any civilization that wasn't.
    That's very apropos to a story that came up on Hacker News, to the effect that most jobs are likely to eventually be performed by computers. You could interpret this as software-based organisms becoming the dominant life form on earth. They seem to have better evolutionary and survival characteristics than we do, and they propagate themselves just as you describe. Maybe it's the destiny of all advanced civilizations to eventually phase out the flesh-bound beings that built them and hand over control to more-efficient digital successors.

  20. #50
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    It seems to me that low-level compression-type estimates of genomic information content are probably way too high.

    You have to realize that you're looking at inefficiently coded instructions in a programming language, and what kind of programming language, to see which information is interesting and which is mostly arbitrary stuff that could be replaced with other arbitrary stuff with no loss of function.

    As I understand it, the vast majority of functional genes seem to be mostly productions in a fuzzy propositional production (rule) system.

    A typical gene is just a rule with a boolean-like conditional and a consequent that may encode several propositions, like

    If A and B and not C then E and F and G

    The (fuzzy) values of propositions A B and C are implemented as concentrations of molecules with particular-shaped binding sites on them, which can bind to the inhibitory or promoting regions of the gene (the Left Hand Side of a rule) to encourage or discourage it from "firing" (in rule-based system parlance), i.e., being "transcribed" (in molecular genetics parlance).

    When a gene is transcribed, the base sequence in the coding region directs the construction of a corresponding RNA molecule, which may itself be a signaling molecule, or may be further transcribed to create a corresponding protein, which is the signaling molecule.

    Whichever happens, the usual thing is that all that matters about that transcribed sequence of bases or amino acids is the SHAPE it naturally folds up into, (given molecular kinetics in cytoplasm/nucleoplasm) and all that matters about that shape is certain importantly active REGIONS of that shape---areas on the surface of the molecule that geometrically fit (more or less) into the promoting or inhibiting binding sites in the control region of other genes (and/or the same gene).

    At a short timescale, what you have is basically a discrete but stochastic rule-based system, where the probabilities of the firing of different rules depends on the concentrations of molecules with suitably-shaped regions exposed. The more promoting molecules you have bouncing around in the plasm, the more often one will dock for a while to the promoter region of a gene, and the fewer you have, the less often one of them will dock to the inhibiting region of a gene.

    Whether a "rule" actually fires at a given moment depend on those concentrations, chance bouncing around of signalling molecules, and whether the gene transcription machinery is around there at the time and ready to grab it and go.

    The upshot of this is that the information content of genes must be a whole lot lower than it looks---astonishingly low.

    Most sequences of coded RNA or protein are just structural stuff to affect how the signalling molecule folds, and any number of sequences would do the same job---a small minority of sequences, but still quite a large number. Those sequnces are just there to ensure that the actually important regions of the molecule---the ones whose shape determines its activity in promoting or inhibiting gene firing---end up exposed and not interfering with each other.

    Changes to the RNA or protein sequence that don't affect the shapes of promoting or inhibiting regions, or affect whether they're exposed appropriately, do not matter for normal gene function.

    You see this in patterns of variation of highly conserved genes---genes that have been around for a zillion years, because they're important. You get a lot of random mutations in some that don't matter much, because they don't affect the function of the gene. You also get some regions that mutate interestingly, where most mutations are bad and go away, but others hang around because they're equivalent, and a gene may mutate back and forth between various equivalent forms over time.

    As I understand it, the actual information content of the genome must therefore be shockingly low. Most parts of most coding sequences are just there for spacing, not actually encoding interesting information.

    You could probably tighten a guessed-at upper bound on the information content of the gene by taking that into account, and noticing which regions of genes seem to vary phylogenetically and within species, without much effect. (I.e., noticing stretches that have a lot of randomish variants.)

    Such an estimate would still be way too high, though, because it wouldn't take into account the fact that of all the equivalent sequences of some uninteresting section of coding DNA, evolution's only going to find a few of them, because its search is very greedy, and favors isolated, locally harmless changes. Anything that requires a combination of two compensating mutations to generate a functionally equivalent molecule is much less likely to be found than a single harmless mutations, and sets of three or four are very unlikely to be found.

    For example, if one mutation shortens the folded molecule in between two crucial active sites, such that they interfere, and another lengthens it such that they don't, you might find that if you find the latter first---assuming lengthening doesn't hurt mach---and the compact it again with the first. RNA- and protein-folding effects tend to be bizarrely nonlinear, so harmless combinations of two or three mutations are at a huge disadvantage to singleton harmless mutations.

    That means that if evolution preserves sequence information, that may only be because it doesn't discover the vast majority of (more or less) equivalent sequences. Most of the information in the genome at the codon level isn't there because it "really matters" to what the gene is actually for, but because evolution doesn't know any better in the short run---it could be stripped out and replaced by something much simpler, but evolution doesn't know how.

    To understand the actual information content of the genome, we need to get a handle on a couple of basic things:

    1. What do the genes look like when viewed as a productions in a fuzzy propositional production system---how complicated are the boolean expressions on the left hand side (regulatory region) and the right hand side (coding region). To figure that out, we need to know what bind s with what.

    2. What else is going on, like conditional transcription, and with what effects. Genes often don't code for a single molecular product, but for a family of products, with odd editing things going on before the final molecule is produced. (AIUI, a gene that at first glance seems to produce a protein with one sequence may under different conditions produce 20 variant versions.)

    Depending on how that conditional transcription works and how it's effectively used, that could change things a lot. It may essentially act as a macro preprocessor that allows genes to encode significantly more information than they might naively seem to.
    So, I think what you're saying is this:

    1. Most genes are only involved in controlling transcription, and transcription is relatively simple
    2. The active sites of proteins and RNA are relatively small portions of the molecules, and the rest is essentially inactive filler
    3. Many nucleotide sequences end up coding for the same, or functionally-equivalent proteins

    Whichever happens, the usual thing is that all that matters about that transcribed sequence of bases or amino acids is the SHAPE it naturally folds up into, (given molecular kinetics in cytoplasm/nucleoplasm) and all that matters about that shape is certain importantly active REGIONS of that shape---areas on the surface of the molecule that geometrically fit (more or less) into the promoting or inhibiting binding sites in the control region of other genes (and/or the same gene).
    Proteins don't just have static shapes. When they bind to a ligand, the electric interactions cause the protein to change conformation. These conformational changes can expose other, transient properties of the molecule, such as new active sites, and the energy and kinetics of the transitions can, e.g. impact enzyme activity. The changes set off other changes and can trigger intricate cascades. This creates new dimensions of complexity.

    For example, if one mutation shortens the folded molecule in between two crucial active sites, such that they interfere, and another lengthens it such that they don't, you might find that if you find the latter first---assuming lengthening doesn't hurt mach---and the compact it again with the first. RNA- and protein-folding effects tend to be bizarrely nonlinear, so harmless combinations of two or three mutations are at a huge disadvantage to singleton harmless mutations.
    So you're viewing evolution as a mathematical optimization problem, and you're proposing that it tends to get stuck in local optimum states -- like simulated annealing. The nonlinearity you mention could make it impossible to model simply. For instance, there might many paths between states, ensuring that there's almost always a way to transition between two states without passing through a lethal state. That could help explain the mysterious way that evolution seems to find solutions that would seem to have no plausible evolutionary pathway connecting them.

    Depending on how that conditional transcription works and how it's effectively used, that could change things a lot. It may essentially act as a macro preprocessor that allows genes to encode significantly more information than they might naively seem to.
    It's never safe to assume that complicated things are used sparingly by nature, like they would be by human engineers. Programmers tend not to go overboard with C++ templates, for instance, even though there are theoretically powerful things you can do with them, because it's not worth making your program that complex. But nature doesn't care, and it's solving a problem so hard that it might not admit any unsophisticated solutions.

    Here's something highly salient to this discussion that I just saw on Hacker News: http://www.washington.edu/news/2013/...-genetic-code/ I bet that there's enough going on in DNA that no model will hold up forever. There will always be some new twist that confounds any reductionistic model.
    Last edited by nburns; 12th December 2013 at 23:42.

  21. #51
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Here's something highly salient to this discussion that I just saw on Hacker News: http://www.washington.edu/news/2013/...-genetic-code/
    Don't believe the hype. http://www.forbes.com/sites/emilywil...duon-dna-hype/

    The press release you linked to is full of bogosity. Everyone's known for decades that gene sequences do a whole lot more than just code for protein transcripts. Everyone knows there's a lot going on in the control region. (Pretty much everything I've said hinges on understanding that, which the press release makes sound like something shocking that the UW people just discovered.)

    Never believe University press releases about allegedly revolutionary discoveries in genetics. They're almost always full of shit.

    I bet that there's enough going on in DNA that no model will hold up forever. There will always be some new twist that confounds any reductionistic model.
    Why on Earth do you think this? (And what do you mean by "reductionistic" anyhow? It can mean several radically different things.)

    I'm not saying I think all this stuff will ultimately be easy, or even possible, to model very well. If it's not, I don't think that will be because the actual information content of the genome is so all-fired high. If we miss a few percent, but it's an important few percent and crucial to understanding what's really going on, we'll be hosed.

    I don't know or know of any geneticists who really think that the genome is efficiently coded. The parts we do understand are clearly very, very inefficiently coded, and they clearly do account for much of what's going on with genes.

    E.g., everyone agrees that understanding control regions mostly in terms of binding sites that promote or inhibit gene expression based on shapes of exposed hunks of signalling molecules is a huge, huge advance. It explains a whole lot of stuff we always wanted to know, and pretty clearly can explain a lot more. It's dead obvious at this point that a lot of the most important stuff going on in genetics is explicable in the "rule firing" terms I've been using.

    But look at the gene sequences for those binding sites---they're hundreds of codons long, and codons are 3 bases long, so we're talking on the order of kilobits per site, and it's clear the main function of most of those sites is to act as a symbol on the left hand side of a rule, plus a few bits for binding valence and strength. Even if there are a million symbols in use, and there are no regularities you could use to compress them, that's not more than a very few bytes of information per LHS symbol, and the useful information content is going to be very low. (My own guess would be 10-20 bits, because of modularity in genetic regulatory networks that could in principle let you compress the symbols a lot.) If that's mainly what's going on in control regions, the information content is a fraction of one percent of what you'd get by looking at by multiplying the number of codons by 6.

    Even if there's a bunch more stuff going on in control regions, and we're missing 90 percent of the information being coded, that's still only a few percent useful information.

    Similar arguments apply to the coding regions. Even if most of the useful information in the coding region is not what I've been talking about, but is a subtle matter of conditional transcription and conditional folding and re-folding of the resulting protein, it would have to be a whole hell of a lot of that kind of information to actually amount to a whole lot of bits.

    There's no reason to think that there's a secret, shockingly efficient code that accounts for 99 percent of the information in the genome, such that we're looking at the tip of the tip of a vast iceberg, and there are good reasons to think that there isn't.

    If gene function were that mysterious and magically good at its job, we'd never have figured out the Hox complex---and there's nothing more important than Hox for understanding genetics and evolution. It's staggeringly important.

    Do you think we are only scratching the surface of how the Hox complexes work? I don't, and I don't think anybody who understands the science does. Sure, there's a lot of messy stuff going on that we can't model in fine detail, but it's not because we're missing the forest for the trees. We see the forest pretty clearly, and it's clear that what we don't see is less important undergrowth. There's no enormous mystery there, just the usual scientific mysteries.

    Or if you think Hox works pretty much the way I'm describing, as is the received view, what do you think genes are doing that requires a fabulously more efficient encoding, and leverages it to good effect?

    Hox and related genes are old, evolutionarily, and highly conserved, but they still get duplicated and culled and re-specialized. Evolution gets a huge amount of mileage out of them, and it's pretty clear it does it in the same dumb stochastic production system way, because it hasn't come up with anything better, or hasn't managed to adapt it to those kinds of purposes.

    Maybe that's because evolution got stuck using Hox a long time ago, and keeps doing it because it works, but has more recently invented better "programming language" technology, and uses it for other things which it does far more elegantly and with far more compact encodings.

    But what things? I'm stumped.

    I think that if genes weren't mostly productions in a production system, it would show, one way or another, and it doesn't. Absence of evidence is evidence of absence if the presence of the thing in question would likely have observable effects, and you don't see them.

    I'm not saying that the basic genetic assembly language is clean, or that it's only a matter of discrete rule firing with fuzzy-valued propositional symbols. There's a variety of things going on as elaboration of that programming framework, with conditional splicing, epigenetics, etc., but none of those things is a good candidate for vastly upping the estimate of the information content of the genome. They may make the programming language somewhat more expressive, but they don't seem to put us in a different ballpark. Even if the resulting representation 10x as efficient as it looks right now, it's still very inefficient in our terms. The genome may contain some very special information encoded in some very special ways, doing very special things, but it's not likely to encode vast amounts of that kind of information.

  22. #52
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    So you're viewing evolution as a mathematical optimization problem, and you're proposing that it tends to get stuck in local optimum states -- like simulated annealing.
    Something like that, but I'm unclear on what you're making of it.

    So far as I know, that's the very-nearly-universal view among evolutionary biologists and population geneticists.

    Evolution does a kind of beam search, and clearly does tend to get stuck in local optima. It generally can't backtrack very far.


    The nonlinearity you mention could make it impossible to model simply. For instance, there might many paths between states, ensuring that there's almost always a way to transition between two states without passing through a lethal state.
    I'm not sure what you mean by that. I think there are usually a lot of ways of mutating gene sequences into functionally equivalent sequences, but that doesn't help much with the larger problem of finding workable intermediate phenotypes.

    Evolution's search is a fairly greedy beam search, and it's often infeasible for it to find the right intermediate forms to avoid passing through lethal states. That's why 99+ percent of all species are extinct.

    That could help explain the mysterious way that evolution seems to find solutions that would seem to have no plausible evolutionary pathway connecting them.
    Huh? Like what? The more we know, the more we find that things are plausible that aren't immediately intuitive, and that evolution doesn't make big brilliant leaps. It just keeps grinding away at its beam searches and finds alternative easy routes we'd have missed.

  23. #53
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Paul W. View Post
    Something like that, but I'm unclear on what you're making of it.

    So far as I know, that's the very-nearly-universal view among evolutionary biologists and population geneticists.

    Evolution does a kind of beam search, and clearly does tend to get stuck in local optima. It generally can't backtrack very far.
    I'm not sure that you could liken it to a beam search. Mutations are totally random. There is no algorithm.

    Since genes respond nonlinearly to changes, multiple changes may interact in unpredictable, nonlinear ways. These may offer paths out of local optima.

    It doesn't go in a single direction, either. If mutations are harmless, they'll tend to build up over time in the population. Branches only get pruned if they're harmful.

    I'm not sure what you mean by that. I think there are usually a lot of ways of mutating gene sequences into functionally equivalent sequences, but that doesn't help much with the larger problem of finding workable intermediate phenotypes.

    Evolution's search is a fairly greedy beam search, and it's often infeasible for it to find the right intermediate forms to avoid passing through lethal states. That's why 99+ percent of all species are extinct.
    Species go extinct because the environment changes faster than they can evolve. It's slow, and it's hard to imagine it being much faster. I'm not sure that tells us much about DNA.

    Huh? Like what? The more we know, the more we find that things are plausible that aren't immediately intuitive, and that evolution doesn't make big brilliant leaps. It just keeps grinding away at its beam searches and finds alternative easy routes we'd have missed.
    I wouldn't say that it happened in one big leap, but evolution produced us, and we're here analyzing it. That's pretty brilliant.

    Evolution's search is massively parallel and exhaustive, so if there's one brilliant move out there, it will tend to find it.

    I'm willing to buy that, say, only 1% of the genome is absolutely necessary. But how would you find out which 1%?
    Last edited by nburns; 17th December 2013 at 09:21.

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Estimating mutual information
    By Matt Mahoney in forum Data Compression
    Replies: 8
    Last Post: 18th February 2013, 01:16
  2. Online Content Management Services
    By Karhunen in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 10th February 2012, 00:57

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •