The Mermaid's Tale: ENCODE

Showing posts with label ENCODE. Show all posts

Thursday, March 7, 2013

What is genetic function? The ENCODE non-questions

The human genome is 3.1 billion nucleotides long. If only 1-2% of it codes for proteins, what does the rest do? And how do we figure that out? One way is to commit multi-millions of dollars to a project dedicated to doing just that, and that's exactly what's been done by the ENCODE project, which recently published, with much fanfare, the results of years of work by a consortium of over 400 people and numerous labs. The comment by the ENCODE PR spokespeople that got the most attention during all the hoopla when the papers were published was the idea that while 98-99% of the genome was once called 'junk DNA', now it looks like 80% of the genome is in fact functional.

There is a heated, and indeed vitriolic debate about how misrepresentative and even highly wasteful, the ENCODE Megaproject was. ENCODE is a cute acronym (we need that in science, after all, since much of what we're about is marketing, so we need a brand or trade-mark) for ENCyclopedia Of DNA Elements. In a nutshell, the project was a large consortium whose objective was to identify as much of the functional elements in genomes as possible.

The interaction of tRNA and mRNA in protein synthesis:
Wikipedia

We know that DNA codes for protein, but that is only about 1-3% of the genome. Another small fraction is transcribed into a variety of RNA molecules that do things on their own (that is, they don't just get translated into protein). Examples are transferRNA, ribosomalRNA, and various others. Some of this, called microRNA, is used to affect gene usage, by interfering with messenger RNA and hence protein production. Protein coding and these RNA other processes and interactions are what in our Mermaid's Tale book we called 'correspondence' codes in the genome, because DNA contains the code for -- corresponds to -- the RNA which is then used elsewhere in the cell.

Then there are bits of DNA that contain what we called 'recognition' codes. The DNA sequence is directly recognized by other molecules, such as proteins called transcription factors, that physically bind to the DNA sequence elements and, among other things, cause nearby protein-coding genes to be transcribed into messengerRNA. These are codes, but they act locally on the DNA itself.

The middle and ends of chromosomes (centromeres and telomeres) contain DNA sequences used for protecting the integrity of the DNA molecule in the potentially hostile chemical environment of the cell, or in the process by which chromosomes are copied when the cell divides. There are other codes of various kinds in DNA that affect how it is wrapped around proteins so it can fit into the nucleus and so on.

But various studies had shown that much or even most DNA is actually transcribed into RNA molecules of unknown (if any) function. It is replicable--so not just chance or experimental trash. Since the function isn't known, it is debatable whether this is truly 'functional' or not.

Meanwhile, 40% or even more of our DNA consists of repeat elements, short sequences that are found scattered all over the genome, and (among other ways) are copied from one location and inserted more or less randomly in some other location. These relate to various processes, including errors in DNA replication (e.g., microsatellites) or use some viral-related mechanism on rare occasions, but enough over evolutionary time to proliferate in the hundreds of thousands.

By some accounts, especially since it seems to be transcribed into RNA, much or even most of the genome is 'functional'. Such claims challenge well-established ideas that most of the genome has very little function--what was called 'junk' DNA--and that therefore only a small fraction really matters.

But what is function?
A lively, funny, but quite sharp--some would say vicious--attack on the excited reports of ENCODE by Dan Graur and colleagues was published recently. First, even though the ENCODE authors, being good scientists, put lots of caveats in the original papers, they were not averse to the super-hyping given the report by the media. Instead of saying that ENCODE had provided a very useful and accessible data resource and some thought-provoking data, the usual hype about transformative new findings, mysteries uncovered, etc. was all over the media last year. Graur et al. blasted such reportage as culpable, or even scientifically naive hype (or, perhaps, bovine droppings). Indeed, the aspects of genome structure and use that were reported by the project were all to some extent or other already well-known, even if ENCODE provides a more systematic data resource and coverage of them than had been available before.

The controversy involves many different issues, some of them quite technical and methodological, but the core centered around ideas of 'function'. The project investigators used various methods to find biochemical activity of different kinds to identify aspects of the genome that were functional by that standard. Thus, for example, if a transcription factor protein stuck to a particular bit of DNA, that was activity and classified as function; it didn't have to be shown to affect a protein-coding gene's expression level.

From an evolutionary point of view, function only matters if it affects reproductive success--or 'fitness' in the Darwinian sense related to natural selection. Why is this? It's because if it doesn't affect fitness, then mutations will eventually disrupt the activity but with no loss to the organism's reproduction. The bit of DNA will, over time, accumulate variation among individuals and between species. By contrast, a bit of DNA that does have a fitness effect will have much less variation in the population, because mutational disruption will harm the individual, who won't reproduce, taking the variation out with it. We say that relatively limited variation, or sequence conservation among or within species, indicates evolutionarily important function. Indeed, even if the bit of DNA did have some function that affected a trait, say body shape, but not in a way that would be screened by natural selection--that is, not in a way that affected fitness--that function would sooner or later be erased by mutation.

In that sense the function might be real but evolutionary unimportant or irrelevant. The idea that one could have function but not be affected by mutation in this way is tantamount, the critics argued, to saying that organized structures could arise just by chance, without being molded by natural selection. That is hard to justify (actually, there may be such reasons, but they're too much to go into here). But it's worth noting that Graur et al. do point out that such function could, under some circumstances, become relevant to natural selection, so that even highly variable bits of DNA may not be unrelated to evolutionary potential. But looking at it at any given time can't tell you that, and doesn't warrant assigning function in the evolutionary sense to it.

Wasted electrons--debates over angels on pin-heads
This is a debate about many things, but in part centers around orthodoxy. The discussion is over the question "What fraction of the human genome is actually 'functional' in these latter senses?" 10%? 80%? 17.654382234887%?

This is a thoroughly electron-wasting debate (using up electrons via the internet and airwaves), because we know very well, beyond any serious doubt, that the usefully used parts of the genome vary from person to person and, indeed, from cell to cell within each of us! And if we take a broader evolutionary view, different parts and regions and fractions of genomes will be used over time.

Among individuals in a species at any given time there are hundreds of dead or partly dead genes, regulatory regions with variable strength transcription factor binding, and so on, all across the genome. These vary from person to person, as we have clear evidence to prove. And, have you forgotten the hoopla over copy number variation, the hot recently-new finding that our numbers of genes and other parts of our genomes vary among by the thousands us and between the two copies of the genome each of us carries?

Since everyone differs, it is almost impossible in principle to ask this question of any single individual, because there just isn't enough information and unique observations that can't be tested with the statistical approaches needed to document it (needed for reasons that are not controversial). Alternatively, we might come to some sort of average functional fraction for a species, but that is rather vague and perhaps misleading--misleading about how DNA functions. For example, it's been estimated that a high fraction of our 'real' genes (protein-coding or regulatory regions) are individually dispensable if other well-working genes cover the same function. In that sense, a high fraction even of those 'real' genes are dispensable.

Whether something has a fitness effect is also a statistical question, since there is always a probabilistic aspect to reproductive success. The amount of conservation in a DNA region is in principle an indicator of past history of natural selection, but it also involves other factors (population size, mutation rate, and so on), and it is inherently a relative measure. Assessing what varies enough to be judged not to have a fitness-related function is also a statistical issue, not one with precise criteria. If the relatively limited variation in protein-coding regions reflect real function, how much more variable reflects no, or 'less' function? One might say that the evolutionary definition of function, based on more relative variation, is not entirely free of the subjectivity issues that plague the ENCODE definition of fitness based on having some biochemical activity.

This wastes trillions of electrons, because it stimulates the media streams of hot air, capitalizing on the flap, even though the issues themselves are debates over non-questions or even subjective issues, as we have tried to suggest here, that are not clear cut--and of course there is the strong vested interest of the investigators vigorously to defend the over-selling of yet another over-priced mega-project, so its funding won't be cut. As usual, this electron stream misses much of the interesting and actually scientific aspect of the findings and their ambiguities.

For example, the issues rest on the tacit idea that genome functions can be enumerated at the nucleotide sequence level by essentially assuming that each function is independent of other functions, which is purely a fiction. But interdependence makes these kinds of issues, that are based on differing criteria of 'function' and relative variation, very tricky. And the evolutionary argument essentially assumes that non-conservation means no function which is also a misperception of the dynamic control and complexity of genetic mechanisms and evolutionary adaptation. This is not the place for me to outline my view on that, but I do think that a proper understanding of genomes and their evolution can answer the perceived differences of point of view in the current food fight.

Of course, thinking seriously about evolution is harder, won't please Big Story seeking journalists, and makes less dramatic material for grant applications. So the food fight is not at all surprising.

Tuesday, September 11, 2012

What makes the traits of life? More on ENCODE

By Ken Weiss

Our favorite tweet about ENCODE went something like this: "I have no idea what ENCODE is but I know it's making everyone mad." There were valid reasons for all the to-do, but we want to go beyond that here and talk about the science.

The ENCODE project is a large-scale genomewide attempt to "build a comprehensive parts list of functional elements in the human genome." The idea is that once we know what does, and what doesn't, have function, we can effectively home in on those sites that affect human traits and, of course, disease susceptibility in particular. The new ENCODE web page will be very helpful and welcome.

Yes, the hoopla surrounding the release (or is it 'press release'?) of the ENCODE results was widely criticized, including by us, for its excesses and manifest if not blatant self-promotional advertizing. The critiques included comments about possible over-acceptance of data that are less secure than the impression given, and other such issues.

These critiques are justified, and it will take an enormous almost open-ended effort to address or resolve all of the points that have been raised--if that were even possible, given life's moving target. And, of course, this must already have triggered a host of me-too demands by the mouse, fly, nematode, arabidopsis, maize, and other genome communities who will want to do all the same for their favorite species--and of course, they'll make the same claims about its urgent disease relevance. Adjudicating the relative importance of what to fund may be tricky if budgets are limited.

As importantly, a sober evaluation of what and where the real value is may be contentious. This is because if causation is complex and much or most of the genome is involved in traits of interest, exhaustive enumeration of countless trivially-small or rare-variant effects may not have much point, even if each instance is really valid, which will be hard to prove. Like trying to understand a changing beach by enumerating sand grains. Our earlier post dealt with some of these points.

However, there is one major actual scientific point that even in this sea of potential error or misinterpretation, seems rather safe--and particularly important.

Making life live
The traditional idea of genomes is that their purpose is to code for protein. We've long known that DNA has to replicate itself and that it strings along tens of thousands of protein-coding segments, and that these functions are fundamentally important to life. Life is about proteins and the variation among cells and among species is largely due to variation among proteins. Some DNA codes for RNA that is directly functional rather than being translated into protein. We've known all this for many decades.

We've also known for somewhat fewer decades that part of DNA is used to determine which of our thousands of genes are expressed (used) in which of our cells and under what conditions. This is called regulatory DNA. A body is made of many cell types (skin, nerve, muscle, stomach, immune,....) and they are different because they use different sets of genes. Likewise, cells change the genes they use, or their level of activity, based on circumstances. That is why organisms can respond to changing environmental conditions and so on. This works because the regulatory DNA is recognized by proteins (transcription factors) that stick there, and that leads to other proteins transcribing nearby DNA into RNA.

We knew that regulatory DNA was just as important as coding DNA, so that there are two basic types of 'code'. In Mermaid's Tale we called these correspondence codes and recognition codes, respectively. The first term refers to DNA whose sequence corresponds to the protein (or functional RNA) that is copied from it and used elsewhere in the cell. The second is a sequence code that is directly used--recognized by transcription factors--rather than only standing for something used elsewhere. The transcription factors themselves are proteins, and so must be coded for by their own correspondence codes somewhere in the genome.

The dogma for a long time was that protein variation is the cause of evolutionary fitness differences and trait variation. This followed essentially from Mendel's showing that what turned out to be protein variation--and hence variation in protein-coding genes--was responsible for trait differences in peas. That led to a century of focus on genes as protein correspondence codes, the discovery of how that worked and how those genes were arranged on chromosomes.

But a few decades ago recognition codes were discovered, and that explained in principle how DNA also coded for the usage of its genes. None of this was or is controversial. But gene mapping to find genes 'for' (causally associated with) traits, and especially (from a funding point of view) disease, initially was intended to find the protein variants that were responsible. This was the focus and the prior belief of most investigators, but genomewide association studies, GWAS, were conceptually designed for that purpose, and it was obvious that regulatory variation would also be picked up by the same method.

Although the media and grant attention has focused on protein variation, even for years after we knew about the importance of regulation, it has turned out that many if not most well-documented mapping-identified common disease variants were not in protein coding regions. And the important documentation of the ENCODE project shows rather persuasively and systematically, that the finding of regulatory rather than coding variation in the data that has been has been a valid indicator of how things are.

The important impact of this is to make it clearer that most of our functional DNA is not about protein-codes directly, but is about how they are used. It is the timing among cells, and regulation of that among cells, of the genes and their level of expression, that accounts for the production of organisms from fertilized eggs (single cells), and likewise that their variation and hence how they are seen by natural selection is important to evolution. In a sense, overall, evolution is a regulatory phenomenon.

What needs rethinking?
The ENCODE project did not raise any new points in this regard but the paper by Maurano et al. in the Sept 7 Science, "Systematic Localization of Common Disease-Associated Variation in Regulatory DNA," that shows the regulatory documentation from the project is an important one to be aware of. And this paper only deals with classical 'regulatory' regions, but not with other kinds of DNA-encoded RNA molecules of various sorts that serve to regulate gene usage and dosage in other ways (others of the ENCODE papers in various places deal with some of these issues)

The finding that much or most of DNA function is regulatory casts a somewhat different light on the nature of organisms and on their evolution, but it is not a fundamental new theory of any sort. Let's hear no talk of 'paradigm shifts' or any other such puffery! Though we continue to learn of new DNA function and that more of the DNA has function that was thought at one time, the results are just a re-weighting of focus among things we've long known. It's an incremental accumulation of knowledge. Indeed, traits and their evolution and variation are still mainly about protein structure and variation, and that ultimately means genomic structure and variation.

Conceptually, however, by trying to understand regulation, we gain a better way to view ourselves as 'emergent' phenomena, that is, that an organism and its highly complex organization is more than just a pile of genes, but 'emerges' from the way those genes are used. This is similar to the idea that you cannot understand a building by enumerating the bricks, tiles, wires, and so on that it is built of--or sand grains and beaches.

And, importantly, the new project makes traits more, not less, complex and potentially problematic to deal with. This is because it simply adds to the breadth and depth of factors whose variation contributes to the nature and variation of our traits. Similarly, every new factor that we find contributing in subtle, and subtly variable ways to traits of interest, the harder it is to promise that we can enumerate risk gene by gene, or account for genome evolution by gene-focused selection models. Different ways of thinking will be called for.

So, filter out the blatant BS, and there is still importance in the findings.

Thursday, September 6, 2012

More advertising, yet a lesson to learn

By Ken Weiss

The HypeMachine is out in force behind the 30 new publications, accessible via Nature's "ENCODE explorer", featuring the ENCODE show. It's been 10 years, at least, since Nature, Science, and other journals started press-release, Big Splash, orchestrated displays of Surprise! at their latest magnificent special issues. Well, it's how business works--and this is business, after all, despite the attempt to make it appear simply the quest for knowledge.

The media pick up on it of course--you can't miss it today--and it's no different from the political convention gala celebration. Some journalists understand the issues better than others, but few of them are doing other than reporting the hype. American life has become one big Potemkin village: all show, but substance a lot harder to find. So what is this ENCODE report?

ENCODE is a mega-project designed to use various data and means to identify the functional usages (or not) of every part of selected segments of the genome. Data from sequencing, gene expression, RNA detection, experimental function, and so on are coalesced into a set of digested reports. But you can also play with various aspects on the ENCODE site.

So, besides the clear money-seeking hoopla, what is actually being found? First, we are told that the term 'junk DNA' is a relic of the past. That was when we 'knew' that DNA was a string of nucleotides, like a necklace, but along which there were spaced some protein-coding sections. The history of research was driven largely by exploiting Mendel's finding of discrete functional heritable units that were later 'genes'.

Then we learned that some of the proteins coded by those units, by genes, served to control the use of other genes, depending on their cell context. How they did that was that the coded protein recognized short DNA sequences roughly nearby on the chromosome to some gene to be regulated. These genes are called transcription factors. That showed how cells differentiate, but also that some DNA was 'regulatory': it didn't code for protein, but instead directly coded for recognition by these TFs. So it was functional, but not coding.

Step by step many other kinds of function have been found, over a period of 20 or so years. Read the report and summaries if you want to know what these functions are. But essentially, they further relate to gene usage, repression, or modification; or to chromosome copying and packaging, or to other such functions.

Not an idle nucleotide?
They confirm our idea that, at its base, DNA controls protein production and perpetuates itself, but through many subtle, complex functional roles. Many are still to be discovered, but a much larger fraction of the genome is found to have some function, and the protein coding parts have been shown to comprise only a few percent of the whole of our DNA. Whether ENCODE has actually determined the function of 80% of the genome, as they now claim, is debatable.

For example, introns are non-coding stretches of DNA that interrupt the coding parts of genes. A gene codes for a protein essentially in sections--stretches of DNA with amino acid codes, separated by stretches that don't code. All of this is transcribed to RNA but the non-coding parts are spliced out, and the coding parts joined together. ENCODE descriptions, in stretching for their 80% figure, include introns because there are some clearly shown functions of some of the nucleotides found in some introns. The important part is that introns are not all just inert DNA, but the excess is to say that all of it matters.

Around 40% or more of the genome is made of various types of repeat elements--short segments of DNA of which copies or near-copies are found adjacent to each other or splattered all over our chromosomes. They get inserted from time to time by various molecular means. Some few of these repeat elements have been shown to have function, and the location and size of various clusters of them may, as well. But most have no plausible function in the proper sense of the term. The cell may have to deal with them, but that's not generally clear and that doesn't mean that they actually do anything.

So there is a bit, or a lot, of exaggeration. Still, the point is valid that more is going on in genomes than had been thought. However, whatever the percent of functional DNA actually is, documenting that function is fine, but much of it has been gradually being discovered for 20 years or so. So the hoopla is very, very misplaced. ENCODE data base is going to be a useful, convenient web tool and so on. It's great to have new knowledge, and new websites comprising and summarizing such knowledge, but it's not all newly discovered. And, equally important, some of the functions are not very well understood and much guessing is still going on. And 'function' is being conveniently redefined to self-justify and puff up the importance. All this, too, would be fine if fully acknowledged and, especially, if it weren't for the self-importance being proclaimed. And the constant dropping of hints that this will lead to disease cures, like claims that Mars adventures are because there might be life out there, have to be viewed mainly as just advertisements--sales pitches.

A deeper issue: evolutionary considerations
Nobody wants to think about some of the deeper issues. One of them is that biological function basically arises because it is advantageous to your reproductive success (evolutionary fitness in the face of natural selection), or is removed if it's harmful. So if this all has function, it must be screened by natural selection. That is easy to say but very hard to understand. Why?

First, the more that has function that is screened by selection, the more excess reproduction is needed to overcome that selection so that a species--ours, or any other species--doesn't just go extinct because everybody's got bad bits of DNA somewhere in their genome! This is called genetic load. Obviously we don't hatch out thousands of babies so that each human couple on average produces two or more surviving children.

In turn this means that even if 80% of the nucleotides in our DNA (that's 2.5 billion nucleotides, by the way) is being screened by selection, the selection pressure is on average, trivially small. For the vast majority of sites, only in very occasional, rare, bad-luck combinations with other sites will selection screen you up or out. And that in turn means that genetic drift --chance--will dominate in determining the variants each of us has, and hence the general nature of, genomic DNA sequences. And again in turn, this means that in essence most of this DNA doesn't, for any practical purposes, have function after all. At least not very specific or important function. Most species, certainly most large slow-reproducing species, simply couldn't support so much selective screening.

There are ways for DNA over the eons to have evolved a lot of distributed function, even if in any individual like you or me, the vast majority has essentially no function (or, that is, its variation has essentially no important effect on the variation between you and me). But this is far from strong assertions of function that are being made.

Reporters and bloggers should take off their rose-colored hype-believing glasses, and understand this.

The important bottom line
Despite all these reservations, a deeper understanding of genomic content is important both for basic biology and its practical application. What it all really means is actually something that was not wanted and not really predicted, even though we had good reason to do so. It's that each new discovery shows that genetic control of what an organism is and how an organism works (and, hence, how organisms evolve) is more complex and less about single, deterministic causation than Mendel's work had led us to think. He showed us the extremes of simplicity, by carefully chosen experiments, and that set out a path we could follow for discovery.

But complexity always grows, never diminishes, by this work. It shows why promises of simple cures for disease--always the hope dangled before the drooling public to keep the till open--may be more elusive, at least elusive if approached from a genetic-causal viewpoint. Part of the problem is that all of these functions are not just additive: you can't get to net function--how you actually are as an organism--just by adding up all the little variant contributions in 80% of your genome.

Genes and the various genome functions interact. They form networks and complexes of contribution to final traits that are not just additive. Indeed, identifying how they are not additive is a major challenge. Some aspects of genome function are obviously strong and clearly understandable by usual scientific approaches. Some are close enough to being additive that we can assume that. But we know that much, probably most, is not so easy.

Secondly, we have because of the specific history of discovery come to think of genome function as being linear--things along a line of nucleotides that make up a single chromosome. Action occurs in what is called cis--meaning along the chromosomes. But we are learning--and ENCODE presents but did not discover this--that there are trans interactions, which means direct interactions between bits of DNA on different chromosomes, as they arrange themselves in the nucleus of cells--and, somehow, do this differently depending on the cell type and its context at any given time. This must be important, but there are deep questions: genomes have rearranged their segments in many ways during evolution, and yet related species with very similar functions (like, say, all mammals), still manage to function even though their genomes are rather differently arranged (and they have different numbers of chromosomes). So how critical is all of this trans activity and organization?

And there are many other curious things going on that are not mentioned in the current ENCODE papers. All very interesting and at least some of it undoubtedly important.

There are many implications of this growing knowledge, and they all point to a need to come to grips in some better, or even some fundamentally newer, way than we have been doing. This is an important challenge, and the ball to keep your eye on. Like political conventions, simple stories and promises are spun to win your votes, but big issues are at stake.

Wednesday, July 20, 2011

Epistemology and genetics: does pervasive transcription happen?

By Anne Buchanan

Pervasive transcription

RNA basepairing

It has been known for some time that only 1-5% or so of the mammalian genome actually codes for proteins. It does that by 'transcribing' a copy of a DNA region traditionally called a 'gene' into messenger RNA (mRNA) that was in turn 'translated' into an amino acid sequence (protein, in common parlance). According to this established theory known as the Central Dogma of Biology (the problems with which we blogged about here), a gene had a transcription start and stop sites, specific sequence elements in DNA from which this canonical (regular) structure of a 'gene' could be identified from DNA sequence (but see below for problems with the definition of a gene). From that, we got the estimate that our genome contains roughly 25,000 genes.

Not so long ago, the remainder of the genome, the non-coding DNA, was called 'junk DNA' because it wasn't known what, if any function it had. Then, some of it, generally short regions near genes, was discovered to be sequence elements that, when bound by various proteins in a cell, cause the nearby gene to be transcribed. So some of that DNA had a function after all.

But then, with various projects such as in 2007, the ENCODE project, a multi-institutional exploration of the function of all aspects of the genome in great detail, it was found that the majority of all the DNA in the genome is in fact transcribed into RNA in a process called 'pervasive transcription'. As the ENCODE project reported,

First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another.

What all this RNA did wasn't yet known, because it didn't have the structure that would be translated into protein, but that pervasive transcription happened seemed clear. Some functions were subsequently discovered, such as 'microRNA' that codes for sequences complementary to mRNA that are used to inhibit the mRNA's translation into protein. There were types of RNA that had their own functions within the classical idea of a gene, even if not translated into protein (these included ribosomal RNA and transfer RNA).

Or maybe not...
However, the pervasive transcription idea was challenged by van Bakel et al. in PLoS Biology, who said that most of the low-level transcription described by ENCODE was in fact experimental artifact or just meaningless noise, error without a function.

Oh, but it does!
Now, in a tit for tat, just last week a refutation of this refutation, a paper called "The Reality of Pervasive Transcription," appeared in PLoS Biology (on July 12). Clark et al. confirm that pervasive transcription does in fact happen, and take issue with the van Bakel et al. results.

...we present an evaluation of the analysis and conclusions of van Bakel et al. compared to those of others and show that (1) the existence of pervasive transcription is supported by multiple independent techniques; (2) re-analysis of the van Bakel et al. tiling arrays shows that their results are atypical compared to those of ENCODE and lack independent validation; and (3) the RNA sequencing dataset used by van Bakel et al. suffered from insufficient sequencing depth and poor transcript assembly, compromising their ability to detect the less abundant transcripts outside of protein-coding genes. We conclude that the totality of the evidence strongly supports pervasive transcription of mammalian genomes, although the biological significance of many novel coding and noncoding transcripts remains to be explored.

Clark et al. question van Bakel et al.'s molecular technique as well as their 'logic and analysis'.

These may be summarized as (1) insufficient sequencing depth and breadth and poor transcript assembly, together with the sampling problems that arise as a consequence of the domination of sequence data by highly expressed transcripts; compounded by (2) the dismissal of transcripts derived from introns; (3) a lack of consideration of non-polyadenylated transcripts; (4) an inability to discriminate antisense transcripts; and (5) the questionable assertion that rarer RNAs are not genuine and/or functional transcripts.

They go into detail in the paper about how and why these are serious problems, and conclude that van Bakel et al.'s results are 'atypical' for tiling array data, their tissue samples were not sufficiently extensive, and that pervasive transcription is being detected by a variety of experimental methods. (Tiling refers to the fact that sequencing is done one stretch at a time, and long stretches and their location on chromosomes from which they were copied is done by finding overlapping ends of these short stretches, that show how they 'tile' together relative to the chromosome as a whole.)

No, no, no!
And finally (to date), van Bakel et al. respond, also in the July 12 PLoS Biology.

Clark et al. criticize several aspects of our study, and specifically challenge our assertion that the degree of pervasive transcription has previously been overstated. We disagree with much of their reasoning and their interpretation of our work. For example, many of our conclusions are based on overall sequence read distributions, while Clark et al. focus on transcript units and seqfrags (sets of overlapping reads). A key point is that one can derive a robust estimate of the relative amounts of different transcript types without having a complete reconstruction of every single transcript.

So, they defend their methods and interpretation and conclude that "a compelling wealth of evidence now supports our statement that 'the genome is not as pervasively transcribed as previously reported.'"

An epistemological challenge -- what do we know and how do we know it?
What is going on here? Is most of the genome transcribed or isn't it? We are intrigued not so much by the details of the argument but by the epistemology, how these researchers know what they think they know. In the past, of course, a scientist had an hypothesis and set about testing it, and drew conclusions about the hypothesis based on his or her experimental results. The results were then replicated, or not, by other scientists and the hypothesis accepted or not. The theory of gravity allowed many kinds of predictions to be made because of its specificity and universality, for example, as do the theories of chemistry in relation to how atoms interact to form molecules. This is a simplified description of course, but it's more accurate than not.

Molecular genetics these days is by and large not hypothesis driven, but technology driven. There is no theory of what we should or must find in DNA. Indeed, there is hardly a rule that, when we look closely, is not routinely violated. Evolution assembles things in a haphazard, largely chance-driven way. Rather than testing a hypothesis, masses of data are collected in blanket coverage fashion, mined in the hopes that a meaning will somehow arise. That this is how much of genetics is now done is evident in this debate. As first reported by ENCODE, complete genome sequencing seemed to be yielding a lot of DNA that was transcribed from other than protein coding regions, and so they speculated as to what that could mean. Their speculation wasn't based on anything then known about DNA, or theory, but on results, results produced by then-current technology. That is the reason--and the only reason--that they were surprising.

And van Bakel et al. disagreed, again based on results they were getting and interpreting from their use of the technology. Then Clark et al. disagreed with van Bakel et al.'s use and interpretation of molecular methods, and described their results as 'atypical'. And both 'sides' of this debate attempt to strengthen their claim by stating that many others are confirming their findings. And this is surely not the end of this debate.

We blogged last week about 'technology-driven science', suggesting that often the technology isn't ready for prime time, and thus that many errors that will be hard to ferret out are laced throughout these huge genetic databases that everyone is mining for meaning. When interpretation of the findings is based on nothing more than whose sequencing methods are better, or whether or not a tissue or organism was sequenced enough times to be credible, or which sequencing platforms seem 'best' for a particular usage (by some criterion) -- meaning that the errors of the platform are least disturbing to its objective -- rather than on any basic biological theory or prior knowledge, we're left with the curious problem of having no real way to know who's right. If everyone's using the same methods, it can't be based on whose results are replicated most. If we haven't a clue what is there, even knowing what it means to be 'right' is a challenge!

But we don't even know what a gene is anymore!
These days, even the definition of a gene, our supposedly fundamental inherited functional unit, is in shreds and the suggested definitions in recent years have been almost laughably vague and nondescript. Try this, by G. Pesole in a 2008 paper in the journal Gene, out for size, if you think we're kidding:

Gene: A discrete genomic region whose transcription is regulated by one or more promoters and distal regulatory elements and which contains the information for the synthesis of functional proteins or non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate products (proteins or RNAs).

Or how about this one, from 2007 (Gerstein et al., Genome Research):

Gene: A union of genomic sequences encoding a coherent set of potentially overlapping functional products.

Not very helpful!!

These definitions show how much we're in trouble. Ken attended a whole conference on the 'concept of the gene', at the Santa Fe Institute, in 2009, but there was no simple consensus except that, well, there is no consensus and essentially no definition!

The good old days
When microscopes and telescopes were invented, they opened up whole new previously unknown worlds to scientists. If anyone doubted the existence of paramecium they had only to peer through the eyepieces of this strange new instrument to be convinced. When cold fusion was announced by scientists in Utah some years ago, the reaction of disbelief was based on what was known about how atoms work. That is, there were criteria, and theoretical frameworks, for evaluating the evidence. Yes, there has always been a learning curve; when Galileo looked at the moon or planets, various optical aberrations actually were misleading, until they were worked out.

And now....
But they were building in an era when fitting theory was the objective. In biology, we're in new territory here.

Comments

We always welcome comments, but we moderate them to reduce spam, gratuitous unkindness and so forth. Because we moderate comments, they won't appear on the blog until one of us publishes them, but we try to do that in a timely way.

We've had to make a change to the commenting page. People had told us that Blogger was eating their comments, so now, rather than embedding comment editing with the posts, it has to be done on a separate, full page. Unfortunately, the 'reply' option has disappeared so comments will just follow one another. We'll see how this goes.

FINALLY, please note that we assert copyright for the material written by us that appears here

Link List