Friday, August 3, 2012

Ooops! The human genome does not exist! Part III: a non-Platonic type specimen?

In the last couple of posts (here and here) we have been discussing the idea of 'type specimens', as that idea relates to genome sequences representative, in some way, of a species.  Stuffed robins and dodos seem to raise few scientific questions about the species they represent as type specimens, even though we know that each species varies.

Of course, we're engaging in a bit of tongue in cheek poetic license here, because there are lots of ways to view the notion of a type specimen (see Wikipedia on this, e.g.), and of course, every biologist knows that species vary.  The point here is that it is one or a few examples chosen in some way to represent a species, assuming adequate evidence that a type specimen does represent a proper species (such discussions are off-topic here).  People in genetics should be clear about what a reference sequence, an analogy to a morphological type specimen, is.  Not all of them are.

In the genomic age we now have whole genome sequences from one or more representative individuals, as DNA sequence type specimens of 'the' genome for that species.  Of prime interest of course is 'the' human genome sequence (HG), and we've pointed out that this must be viewed not as a real sequence carried by any individual, but as a sort of Platonic ideal: a reference or road map to the sequences we carry around but none of which is identical to--regardless of whether it does or ever did exist in any individual.  The latter is the central point.


The reference sequence keeps changing as errors are found or problematic areas done again, to be clearer or sure about the sequence, yet it seems not to lose its status as 'the' reference. And now we have many sequences from different species--soon to be thousands for humans alone.  Since each differs, many in major ways but all in millions of minor ways, from any edition of the HG reference, we wonder if a single reference (or reference set) is the best way to characterize 'our' genome.  Since evolution works via populations, why not use a population-based way to represent our genome?

We suggested yesterday the idea of using a visual 'gene logo' to represent the variation found in existing sequences on file, augmenting that reference as more data become available. We show the example again here.  This is a 'population type' rather than a type specimen.  It would be 'the' referene instead of a single sequence, not as a supplement.  It shows the most common nucleotides as well as the variation, in every position along the genome.

This is not  a Platonic ideal, because it does not represent a single anchor sequence, but instead portrays genomic variation as it exists in available data--much more natural than a single reference and since genomes got here by evolution, and evolution is about variation,  a biologically more natural way to represent genomes.  There are problems, for example, that each instance of the HG has bits and pieces added, missing or rearranged, so not all the variation is in nucleotides in specific positions along the genome.  But ways could be found (there are, for example, graphical ways to show where segments of the genome have moved to and from).  And of course the underlying data are downloadable.  Still, could there be a better way?

An evolutionary Platonism?
Plato's idea of things like dogs or chairs or virtues is that there is an actual 'ideal' that really exists somehow but of which all actual ones we see are but imperfect instances.  Each new chair is an instance of 'chair', and one might liken this to being stamped out, never perfectly, by a chair-making machine.  Or perhaps every chair is a kind of real sibling produced by the same (ideal) parent.  In Plato's terms the parent may not even have physical existence.  That might give us an idea of how to handle biological types in a better, more physically real way.  The way is an evolutionary one.

Doll chair, The Children's
Museum of Indianapolis
The only reason we can even use a representative, reference HG sequence is that all humans are similar and the only reason that's true is that we evolved by descent with modification, Darwin's phrase, from common ancestry.  So could we represent a gene not by an arbitrary reference sequence of some person who happens to be in our lab, but in terms of ancestry?

If we look at large numbers of instances of a gene, we can make reasonable guesses as to what the common ancestral sequence was like.  If most people have an A in a given nucleotide position in the gene (even if, say, some have a T or G there), then A was likely the ancestral state of that position.   If we find that chimps also mainly have an A there, we can be even more confident.  Working site by site in this way, we can make an educated guess as to the ancestral sequence.  In fact, we do this all the time, and have routine, reasonably reliable software to do it on a large scale.

So why not use the ancestral sequence--which actually did exist, unlike some of the composite and hence artificial sequences we use today--as our reference?  Then, while nobody has the same exact sequence today, we can landmark what we do carry in reference to that 'ideal', a kind of real Platonism.

In many ways we do this in daily lab work and drawing evolutionary phylogenies.  We routinely use the idea of the common ancestral sequence, called the coalescent because all today's variation seems to 'coalesce' to the single ancestor as we look backward in time.  So the coalescent sequence of a gene could be our type specimen for that gene.

That seems like a sound way to 'Platonize' our genomic reference map without the arbitrary use of a randomly picked sequence today, or a sequence that has no material existence, since the coalescent is 'the' ancestor rather than just 'an' instance of the sequence.  But while this is evolutionary sensible, there are problems.

First, the ancestral gene may have been so different from today's that even if we reconstruct it accurately, using it as our reference would not work well from a functional, or perhaps even a structural point of view.

Second, while we can do this for a gene (even including inserted or deleted elements in the gene), what we need is the contextual arrangement of a whole genome sequence.  That means the sequence surrounding the genes (and that is about 95% of genomes!).  But when that is non-functional, or even when it has gene-regulating sequence elements, these can move around and evolve so fast that we often cannot easily guess their sequence.  Genes coding for proteins, are so highly conserved in sequence (because, evolutionarily, mutations were harmful and removed) that we can find and align them from other species.  But this doesn't work as well for surrounding sequence.

Third, the lack of such conservation in non-functional sequence might not matter so much if it weren't that the common ancestor of many genes existed long before there were humans and in some species often very distant from us--sometimes, even, an early vertebrate or even before that.  So the surrounding sequence can't even be guessed at.  And the ancestral gene sequence may have little interpretable relevance for human biology (or phylogeny).

Fourth, sequences are rearranged over evolutionary time, and the flanking bits aren't the same, so we can't really put together a very meaningful whole genome sequence (this has been attempted for mammals or vertebrates, but is very speculative and approximate, it never existed in an individual, and is essentially worthless as a reference to guide much of current work on genes and their functions).

Fifth, each gene's evolutionary-Platonic ancestral copy existed in a different individual, the ancestors for different genes living far apart and at very different times in history.  The animal (not even a person!) with 'the' original beta-globin gene would have been countless centuries and kilometers different from whoever had the ancestral blue-light receptor gene, and so on.  Unlike Platonic chairs, there may have been single ancestral genes, but not a single ancestral genome

So, while evolution is a better framework for discussing ideals than Plato's, because ancestors did actually exist, using ancestry to produce reference sequences--genome 'type specimens'--doesn't seem to lead to an obvious strategy.  This is strange, since a random instance of a genome sequence works as a reasonable reference because we're all so closely related--the same evolutionary reason that doesn't seem to work very well.

Our own view is that we should try some kind of variation-including method, such as the 'logo' approach above and that we discussed in Part II of this series, to represent our genomic 'type'.  It's just like using a single stuffed robin, or a single Neanderthal skeleton: imperfect, easily misleading, not actually 'representative' as a representative......but taking variation into account, perhaps the best we can do.

A few thoughts and updates about these topics will conclude this series, in Parts IV and V next week.

No comments: