Wednesday, July 20, 2011

Epistemology and genetics: does pervasive transcription happen?

Pervasive transcription
RNA basepairing
It has been known for some time that only 1-5% or so of the mammalian genome actually codes for proteins. It does that by 'transcribing' a copy of a DNA region traditionally called a 'gene' into messenger RNA (mRNA) that was in turn 'translated' into an amino acid sequence (protein, in common parlance).  According to this established theory known as the Central Dogma of Biology (the problems with which we blogged about here), a gene had a transcription start and stop sites, specific sequence elements in DNA from which this canonical (regular) structure of a 'gene' could be identified from DNA sequence (but see below for problems with the definition of a gene).  From that, we got the estimate that our genome contains roughly 25,000 genes.

Not so long ago, the remainder of the genome, the non-coding DNA, was called 'junk DNA' because it wasn't known what, if any function it had.  Then, some of it, generally short regions near genes, was discovered to be sequence elements that, when bound by various proteins in a cell, cause the nearby gene to be transcribed.  So some of that DNA had a function after all.

But then, with various projects such as in 2007, the ENCODE project, a multi-institutional exploration of the function of all aspects of the genome in great detail, it was found that the majority of all the DNA in the genome is in fact transcribed into RNA in a process called 'pervasive transcription'.  As the ENCODE project reported,
First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another.  
What all this RNA did wasn't yet known, because it didn't have the structure that would be translated into protein, but that pervasive transcription happened seemed clear.  Some functions were subsequently discovered, such as 'microRNA' that codes for sequences complementary to mRNA that are used to inhibit the mRNA's translation into protein.  There were types of RNA that had their own functions within the classical idea of a gene, even if not translated into protein (these included ribosomal RNA and transfer RNA).

Or maybe not...
However, the pervasive transcription idea was challenged by van Bakel et al. in PLoS Biology, who said that most of the low-level transcription described by ENCODE was in fact experimental artifact or just meaningless noise, error without a function.

Oh, but it does!
Now, in a tit for tat, just last week a refutation of this refutation, a paper called "The Reality of Pervasive Transcription," appeared in PLoS Biology (on July 12).  Clark et al. confirm that pervasive transcription does in fact happen, and take issue with the van Bakel et al. results.
...we present an evaluation of the analysis and conclusions of van Bakel et al. compared to those of others and show that (1) the existence of pervasive transcription is supported by multiple independent techniques; (2) re-analysis of the van Bakel et al. tiling arrays shows that their results are atypical compared to those of ENCODE and lack independent validation; and (3) the RNA sequencing dataset used by van Bakel et al. suffered from insufficient sequencing depth and poor transcript assembly, compromising their ability to detect the less abundant transcripts outside of protein-coding genes. We conclude that the totality of the evidence strongly supports pervasive transcription of mammalian genomes, although the biological significance of many novel coding and noncoding transcripts remains to be explored.
Clark et al. question van Bakel et al.'s molecular technique as well as their 'logic and analysis'.
These may be summarized as (1) insufficient sequencing depth and breadth and poor transcript assembly, together with the sampling problems that arise as a consequence of the domination of sequence data by highly expressed transcripts; compounded by (2) the dismissal of transcripts derived from introns; (3) a lack of consideration of non-polyadenylated transcripts; (4) an inability to discriminate antisense transcripts; and (5) the questionable assertion that rarer RNAs are not genuine and/or functional transcripts.
They go into detail in the paper about how and why these are serious problems, and conclude that van Bakel et al.'s results are 'atypical' for tiling array data, their tissue samples were not sufficiently extensive, and that pervasive transcription is being detected by a variety of experimental methods.  (Tiling refers to the fact that sequencing is done one stretch at a time, and long stretches and their location on chromosomes from which they were copied is done by finding overlapping ends of these short stretches, that show how they 'tile' together relative to the chromosome as a whole.)

No, no, no! 
And finally (to date), van Bakel et al. respond, also in the July 12 PLoS Biology.
Clark et al. criticize several aspects of our study, and specifically challenge our assertion that the degree of pervasive transcription has previously been overstated. We disagree with much of their reasoning and their interpretation of our work. For example, many of our conclusions are based on overall sequence read distributions, while Clark et al. focus on transcript units and seqfrags (sets of overlapping reads). A key point is that one can derive a robust estimate of the relative amounts of different transcript types without having a complete reconstruction of every single transcript.
So, they defend their methods and interpretation and conclude that "a compelling wealth of evidence now supports our statement that 'the genome is not as pervasively transcribed as previously reported.'"

An epistemological challenge -- what do we know and how do we know it?
What is going on here?  Is most of the genome transcribed or isn't it?  We are intrigued not so much by the details of the argument but by the epistemology, how these researchers know what they think they know.  In the past, of course, a scientist had an hypothesis and set about testing it, and drew conclusions about the hypothesis based on his or her experimental results.  The results were then replicated, or not, by other scientists and the hypothesis accepted or not.  The theory of gravity allowed many kinds of predictions to be made because of its specificity and universality, for example, as do the theories of chemistry in relation to how atoms interact to form molecules.  This is a simplified description of course, but it's more accurate than not.

Molecular genetics these days is by and large not hypothesis driven, but technology driven.  There is no theory of what we should or must find in DNA.  Indeed, there is hardly a rule that, when we look closely, is not routinely violated.  Evolution assembles things in a haphazard, largely chance-driven way.  Rather than testing a hypothesis, masses of data are collected in blanket coverage fashion, mined in the hopes that a meaning will somehow arise.  That this is how much of genetics is now done is evident in this debate.  As first reported by ENCODE, complete genome sequencing seemed to be yielding a lot of DNA that was transcribed from other than protein coding regions, and so they speculated as to what that could mean.  Their speculation wasn't based on anything then known about DNA, or theory, but on results, results produced by then-current technology. That is the reason--and the only reason--that they were surprising.

And van Bakel et al. disagreed, again based on results they were getting and interpreting from their use of the technology.  Then Clark et al. disagreed with van Bakel et al.'s use and interpretation of molecular methods, and described their results as 'atypical'.  And both 'sides' of this debate attempt to strengthen their claim by stating that many others are confirming their findings.  And this is surely not the end of this debate.

We blogged last week about 'technology-driven science', suggesting that often the technology isn't ready for prime time, and thus that many errors that will be hard to ferret out are laced throughout these huge genetic databases that everyone is mining for meaning.  When interpretation of the findings is based on nothing more than whose sequencing methods are better, or whether or not a tissue or organism was sequenced enough times to be credible, or which sequencing platforms seem 'best' for a particular usage (by some criterion) -- meaning that the errors  of the platform are least disturbing to its objective -- rather than on any basic biological theory or prior knowledge, we're left with the curious problem of having no real way to know who's right.  If everyone's using the same methods, it can't be based on whose results are replicated most. If we haven't a clue what is there, even knowing what it means to be 'right' is a challenge!

But we don't even know what a gene is anymore!
These days, even the definition of a gene, our supposedly fundamental inherited functional unit, is in shreds and the suggested definitions in recent years have been almost laughably vague and nondescript.  Try this, by G. Pesole in a 2008 paper in the journal Gene, out for size, if you think we're kidding:
Gene: A discrete genomic region whose transcription is regulated by one or more promoters and distal regulatory elements and which contains the information for the synthesis of functional proteins or non-coding RNAs, related by the sharing of a portion of genetic information at the level of the ultimate products (proteins or RNAs).
Or how about this one, from 2007 (Gerstein et al., Genome Research):
Gene: A union of genomic sequences encoding a coherent set of potentially overlapping functional products.
Not very helpful!!

These definitions show how much we're in trouble.  Ken attended a whole conference on the 'concept of the gene', at the Santa Fe Institute, in 2009, but there was no simple consensus except that, well, there is no consensus and essentially no definition! 

The good old days
When microscopes and telescopes were invented, they opened up whole new previously unknown worlds to scientists.  If anyone doubted the existence of paramecium they had only to peer through the eyepieces of this strange new instrument to be convinced.  When cold fusion was announced by scientists in Utah some years ago, the reaction of disbelief was based on what was known about how atoms work.  That is, there were criteria, and theoretical frameworks, for evaluating the evidence.  Yes, there has always been a learning curve; when Galileo looked at the moon or planets, various optical aberrations actually were misleading, until they were worked out.

And now....
But they were building in an era when fitting theory was the objective.  In biology, we're in new territory here.


Anonymous said...

Thought provoking article. I am a graduate student in molecular biology and am about to do another "technology driven" high throughput experiment called ChIP-Seq. My concerns with this approach is that it is very difficult to find decisive mechanisms with these approaches. One becomes lost in a mountain of data, with no clear cues where the really interesting stuff is hidden. We need more intelligent computers so they can reveal to us hidden patterns or potentially interesting peaks in the data in the hopes that these can be verified through further experiment. Our brains are not built to understand this technology driven data at a global level. Are we smart enough to build software capable of understanding our questions and finding the hidden gems in the genome?

Ken Weiss said...

Yes, well everybody's doing what you're doing. At least if you keep clear about what the issues are, you have a chance of thinking outside of the BigGear box. It may not be that more computers will do the job. Uncertain data can be smoothed into apparent certainty by computational and statistical techniques (tricks?).

Perhaps we need more intelligent thought about what the 'really interesting stuff' is or might be. I think too often we assume what it is and force results into the mold.

New technologies have clearly (well, at least certainly!) opened new sets and even kinds of facts. But whether we are over-interpreting them or giving them too much importance, or thinking they are conceptually more transformative than they actually are, is to me much less clear.