Our favorite tweet about ENCODE went something like this: "I have no idea what ENCODE is but I know it's making everyone mad." There were valid reasons for all the to-do, but we want to go beyond that here and talk about the science.
The ENCODE project is a large-scale genomewide attempt to "build a comprehensive parts list of
functional elements in the human genome." The idea is that once we know what does, and what doesn't, have function, we can effectively home in on those sites that affect human traits and, of course, disease susceptibility in particular. The new ENCODE web page will be very helpful and welcome.
Yes, the hoopla surrounding the release (or is it 'press release'?) of the ENCODE results was widely criticized, including by us, for its excesses and manifest if not blatant self-promotional advertizing. The critiques included comments about possible over-acceptance of data that are less secure than the impression given, and other such issues.
These critiques are justified, and it will take an enormous almost open-ended effort to address or resolve all of the points that have been raised--if that were even possible, given life's moving target. And, of course, this must already have triggered a host of me-too demands by the mouse, fly, nematode, arabidopsis, maize, and other genome communities who will want to do all the same for their favorite species--and of course, they'll make the same claims about its urgent disease relevance. Adjudicating the relative importance of what to fund may be tricky if budgets are limited.
As importantly, a sober evaluation of what and where the real value is may be contentious. This is because if causation is complex and much or most of the genome is involved in traits of interest, exhaustive enumeration of countless trivially-small or rare-variant effects may not have much point, even if each instance is really valid, which will be hard to prove. Like trying to understand a changing beach by enumerating sand grains. Our earlier post dealt with some of these points.
However, there is one major actual scientific point that even in this sea of potential error or misinterpretation, seems rather safe--and particularly important.
Making life live
The traditional idea of genomes is that their purpose is to code for protein. We've long known that DNA has to replicate itself and that it strings along tens of thousands of protein-coding segments, and that these functions are fundamentally important to life. Life is about proteins and the variation among cells and among species is largely due to variation among proteins. Some DNA codes for RNA that is directly functional rather than being translated into protein. We've known all this for many decades.
We've also known for somewhat fewer decades that part of DNA is used to determine which of our thousands of genes are expressed (used) in which of our cells and under what conditions. This is called regulatory DNA. A body is made of many cell types (skin, nerve, muscle, stomach, immune,....) and they are different because they use different sets of genes. Likewise, cells change the genes they use, or their level of activity, based on circumstances. That is why organisms can respond to changing environmental conditions and so on. This works because the regulatory DNA is recognized by proteins (transcription factors) that stick there, and that leads to other proteins transcribing nearby DNA into RNA.
We knew that regulatory DNA was just as important as coding DNA, so that there are two basic types of 'code'. In Mermaid's Tale we called these correspondence codes and recognition codes, respectively. The first term refers to DNA whose sequence corresponds to the protein (or functional RNA) that is copied from it and used elsewhere in the cell. The second is a sequence code that is directly used--recognized by transcription factors--rather than only standing for something used elsewhere. The transcription factors themselves are proteins, and so must be coded for by their own correspondence codes somewhere in the genome.
The dogma for a long time was that protein variation is the cause of evolutionary fitness differences and trait variation. This followed essentially from Mendel's showing that what turned out to be protein variation--and hence variation in protein-coding genes--was responsible for trait differences in peas. That led to a century of focus on genes as protein correspondence codes, the discovery of how that worked and how those genes were arranged on chromosomes.
But a few decades ago recognition codes were discovered, and that explained in principle how DNA also coded for the usage of its genes. None of this was or is controversial. But gene mapping to find genes 'for' (causally associated with) traits, and especially (from a funding point of view) disease, initially was intended to find the protein variants that were responsible. This was the focus and the prior belief of most investigators, but genomewide association studies, GWAS, were conceptually designed for that purpose, and it was obvious that regulatory variation would also be picked up by the same method.
Although the media and grant attention has focused on protein variation, even for years after we knew about the importance of regulation, it has turned out that many if not most well-documented mapping-identified common disease variants were not in protein coding regions. And the important documentation of the ENCODE project shows rather persuasively and systematically, that the finding of regulatory rather than coding variation in the data that has been has been a valid indicator of how things are.
The important impact of this is to make it clearer that most of our functional DNA is not about protein-codes directly, but is about how they are used. It is the timing among cells, and regulation of that among cells, of the genes and their level of expression, that accounts for the production of organisms from fertilized eggs (single cells), and likewise that their variation and hence how they are seen by natural selection is important to evolution. In a sense, overall, evolution is a regulatory phenomenon.
What needs rethinking?
The ENCODE project did not raise any new points in this regard but the paper by Maurano et al. in the Sept 7 Science, "Systematic Localization of Common Disease-Associated Variation in Regulatory DNA," that shows the regulatory documentation from the project is an important one to be aware of. And this paper only deals with classical 'regulatory' regions, but not with other kinds of DNA-encoded RNA molecules of various sorts that serve to regulate gene usage and dosage in other ways (others of the ENCODE papers in various places deal with some of these issues)
The finding that much or most of DNA function is regulatory casts a somewhat different light on the nature of organisms and on their evolution, but it is not a fundamental new theory of any sort. Let's hear no talk of 'paradigm shifts' or any other such puffery! Though we continue to learn of new DNA function and that more of the DNA has function that was thought at one time, the results are just a re-weighting of focus among things we've long known. It's an incremental accumulation of knowledge. Indeed, traits and their evolution and variation are still mainly about protein structure and variation, and that ultimately means genomic structure and variation.
Conceptually, however, by trying to understand regulation, we gain a better way to view ourselves as 'emergent' phenomena, that is, that an organism and its highly complex organization is more than just a pile of genes, but 'emerges' from the way those genes are used. This is similar to the idea that you cannot understand a building by enumerating the bricks, tiles, wires, and so on that it is built of--or sand grains and beaches.
And, importantly, the new project makes traits more, not less, complex and potentially problematic to deal with. This is because it simply adds to the breadth and depth of factors whose variation contributes to the nature and variation of our traits. Similarly, every new factor that we find contributing in subtle, and subtly variable ways to traits of interest, the harder it is to promise that we can enumerate risk gene by gene, or account for genome evolution by gene-focused selection models. Different ways of thinking will be called for.
So, filter out the blatant BS, and there is still importance in the findings.