Wednesday, October 9, 2013

The functional genome expands....but nothing "hidden" is revealed

When genomewide mapping is done, such as by GWAS (genomewide association studies; see other posts that we've done on GWAS), many spots across the genome are usually found that contribute to variation in the trait in question (e.g., blood pressure, stature, diabetes risk).  But those sites together account for only a small fraction of the heritability, that is, of the total genetic contribution to the trait that various kinds of data, like shared family risk, suggest is at work.  This 'hidden heritability' has been a big problem because even massive studies find the same incomplete accounting for where in the genome the causative effects are.

One thing that has been found, with steadily increasing reliability and believability, has been that the idea that the genome is a chain of protein-coding 'genes' separated by do-nothing random DNA sequence is outdated.  We now know from many studies, that a major fraction of the genome is transcribed (copied) into RNA that is not the messenger RNA (mRNA) that specifies amino acid (protein) sequences.  This non-coding RNA (ncRNA -- there are a variety of types, most of whose function is not known) is shown to be  transcribed from a large fraction of the genome, but could it just be molecular 'noise', a kind of non-specificity simply due to the fact that all sorts of transcription activity is going on in our cells' nuclei, and some sloppiness is to be expected and perhaps too costly for natural selection to have eliminated?  Is this ncRNA just junk that floats around, harmlessly, in the cell the way, say, dust particles float around at home doing no harm or good (except needing to be swept from time to time)?

Stereotype of a typical gene; Genetics and the Logic of Evolution, Weiss and Buchanan, 2006

One way to look at this is to see whether the sites that are transcribed into ncRNA are transcribed because they have the requisite 'promoter' sequences such as those that are necessary for transcription of protein coding DNA sequences.  Classical transcription--that is, of protein-coding genes--involves 'canonical' DNA sequences nearby.  These are called promoters and they have rather widely shared short sequences and arrangements near the starting point in the DNA where the protein code begins.  They're canonical because the molecular machinery that transcribes mRNA is based on proteins that grab hold of the promoter sequence itself, enabling other proteins to bind there and start the transcription process.  The promoter has to be there, or the transcription machinery doesn't recognize the place and can't do its work.

Now a paper recently out in Nature reports a genome wide search in human DNA and that of other species, to find canonical promoter sequences by searching the entire genome sequence.  The authors, Bryan Venters and Frank  Pugh (an acquaintance here at Penn State) report finding about 160,000 transcription sites, with regular promoter sequences, scattered all across the genome.  These are so clearly recognizable that they strongly suggest that these are, indeed, 'intentional' transcription sites, not just molecular noise.

Analyzing what actually gets transcribed and what it does will be a challenge, because one assumes that such regularity of transcription sites is not just trash from incomplete gene duplication events in the recent past.  But if there is a function one might expect to find that not just the promoter sequence but also its downstream DNA code--the sequence that actually gets transcribed--is also conserved, that is, similar among species, because that's at least a major criterion for something that has a function.  If it has a function, typically we expect natural selection to keep the sequence from varying too much, because the resulting RNA would be too damaged to do its job.

But this leads to something of a dilemma!  How would we tell if the downstream sequence is conserved?  We would have to align the homologous (shared by descent from a common ancestor) sequences from a variety of species.  We assume the promoter will be similar, but what about what's nearby?  If we can't align the homologous DNA, we can't test it for sequence conservation....but how do we align sequences? By their similarity, their conservation!  So we have a somewhat circular situation.  Conservation can happen just by the relationship due to common descent.  The  promoter itself is not long enough to base the alignments on because there are so many tens of thousands of very similar promoter sequences--as the new paper reported--across the genome that we would have difficulty seeing which one corresponds with which.  So we can't easily even see which downstream sequences are not conserved, and the ones we can align we do so because they're conserved.  There will be statistical ways to address such questions, but we think they'll be challenging.

These findings, of clearly functional promoters all over the genome, may expand what we think of as the genome, since we're usually not very interested in inactive DNA sequence (except to use its variation to reconstruct history).  We already now know that genomes are about much more than direct protein coding.  The RNA may relate to protein coding in various ways, but also may on have independent functions in the cell.

But it doesn't explain hidden heritability!
The authors finish up this fine paper by saying that their work helps find the 'hidden' heritability.  That sounds nice, and important....except that it's not true!  Why?  Genomewide mapping is based on the use of polymorphic nucleotide sites (SNPs) densely spaced across the genome: some people may have, say, a T in a given position while others have a G.  These sites are tested in mapping studies not directly for having functional effects but just because they're known to vary. The question is whether those with a given trait, like some disease, are more likely to have the T compared to those without the trait.  This is a statistical association unrelated to whether or not the test SNP site itself has any function.

The idea of mapping is to find sites that indicate that some nucleotide in the  icinity of the marker, that is, in the nearby DNA sequence, affects the trait.  That causal site may be in a nearby protein coding gene, but it need not be.  The effect may just be statistical chance, but a case-control disparity that exceeds some chosen statistical cutoff level becomes a candidate for having an effect, and something one might wish to follow up experimentally or in some other way.

This effect need not be related to a protein coding gene, but simply says that something nearby is responsible.  That could be one of the noncoding RNA sites.  What that means is that these sites, despite their undoubted intrinsic interest,  have not been 'hidden' from mapping!  They certainly represent unexplained heritability in the sense that examination of the DNA near to the promoter sequence.

A more likely explanation for actually hidden heritability is that the sites involved are very numerous, and/or their causal variant very rare in the population, so that they may contribute importantly in aggregate but individually not enough to be detected by statistical 'significance' cutoff criteria.  This applies equally to protein-coding or noncoding sites.  We've written about this before.

In the end, this study points us to countless sites in genomes where something systematic seems to be going on, but we have yet to understand what the something is.  It may lead to an understanding of what has been mysterious before now, but it hasn't been hidden.

No comments: