Monday, August 6, 2012

Ooops! The human genome does not exist! Part IV. What do we want to represent by 'a' genome sequence?

In the first post in this series, we said that the current Human Genome sequence is a kind of Platonic stereotype of something that does not actually exist.  This is true at least in the sense of its incompleteness, but considering it raises several issues--and by new things that are afoot in the human genomics community.

Any human varies at about 1 nucleotide per thousand of the simplest parts of 'the' genome, and much more than that for large subsets of it (e.g., 'repeat' regions, copy number variants, and others).  And we're diploid.  So listing only a single nucleotide per location is clearly misleading about any human person.  If the genome reference sequence is a composite of several donors, then of course nobody has the sequence.  How many people are represented has been shrouded in secrecy, to avoid misuse of the data, racist interpretations, and so on.  Even if it's only one person, a haploid version may not be accurate for the technical reason that in parts that vary, all variants on the same copy should be listed, whereas in the past my understanding is that when the individual was heterozygous (had a different nucleotide on each of his/her copies), one of them was arbitrarily chosen.  If we keep in mind that this is a reference sequence, a Platonic representation rather than a real sequence, this may be OK, but it would not be any actual sequence, even that of the donor/s.

Completely real salad.
What if, whoever or however many people were donors, sequence elements come from different parts of the world--different continental ancestry?  It is clearly a strange kind of biology to assemble bits and pieces from different places in that way. Even if from one person with mixed geographic ancestry, the person is real but the genome sequence would be difficult to interpret beyond identifying continents of origin (where that could even be done accurately).  That would be like taking a salad to represent vegetables, even if the salad is completely real.

Now, suppose it turns out that after all the DNA in the reference sequence was in fact entirely from one person and not 8 or so as have been many statements in the past.  And suppose that resolving difficult-to-sequence regions were not in any way complemented with known elements from other individuals.  Then, we can post the diploid sequence of that individual (and that is in the plans) for the same DNA that has currently been used as 'the' human reference sequence.  Once we have that diploid sequence, then except that it will be inevitably incomplete for technical sequencing reasons, it will at least be a sequence that actually exists!
Stuffed robin

That would seem like progress, but it really only gets us closer to the stuffed robin type specimen in a museum near you.  No one person represents our species except if we understand, clearly and consistently, that it is not 'the' species, or 'normal', etc. in any global sense.  Or, not in a way other than as an arbitrary reference, the way a map of  Pennsylvania represents where I-80 passes by State College.  If so, by what criterion and who decides?

In that sense, the diploid sequence isn't literally Platonic, but that is really almost a trivial technical point, since nobody else has the same sequence.  These may be philosophical questions, but at least 'the' reference person is a reference!  Or is it?

In fact this is not so clear.  The instances of the genome in every cell in every individual are different.  The fertilized egg by which your life began had 2 instances of a human genome.  But each subsequent cell, of which you are comprised of many many billions if not trillions, contains new mutations that arose when it was produced by a 'parent' cell dividing.  Mutations arising in the very first cell division will be shared by half your body, while the inherited instance will be in the other half.  But each subsequent time you make new cells even these split-genome personalities acquire new variation, that is faithfully inherited thereafter within your body (unless altered by subsequent mutation).

A reference sequence is obtained, basically, by taking a blood or other tissue sample, extracting the DNA, and cloning it, then sequencing the clone.  So 'your' sequence, and 'the' sequence is really only a composite of bits of DNA from one, or each bit from a different, cell.  We're back to square one: 'the' HG sequence, even a diploid one, really is essentially a Platonic ideal that doesn't exist and never did (or even if it did, will some day not, when you or the donor of 'the' HG sequence passes away).

It is very difficult to escape these problems.

Now, we suggested earlier the use of a 'gene logo' approach, that would represent variation in each position along the genome, as seen so far in the available sequences.  There are problems involved in rearranged parts of the genome, which we all have some of.  But forgetting those, one could make a frequency-based representation that would at least represent humans as a population, which is how we evolved.  We just use the available sequences and for each nucleotide position represent the observed variation.  Great!   Or.....not so.

The available samples of human genome sequences are not from a random sample of our species--whatever that would be.  They are from a few populations, selected for various usually arbitrary, political, or convenience reasons.  I know from direct experience that deep arguments were held to decide how to sample human genetic variation.  Haste and expedience (and a bit of cultural elitism and politics) ruled the day.  So, while we do have samples from various populations, they are not any sort of random sampling of our species.

Still, that's far better than nothing and, indeed, a major US genome browser (UCSC) that displays genome information (and from which you can download the data), does have an optional 'track' that that shows variation in a way similar to a gene logo, for some available population samples.  Here is the 'genome variants' track from 'the' human genome, in a region of the beta-globin gene, showing essentially the same kind of frequency portrayal of each site that varies in this region, in the available samples:


This is itself currently only tentative, and is just for some selected individuals rather than populations or over all known sequences, but it begins to make clear which sites vary along the genome at least in this very limited data (each vertical bar is split into proportions reflecting the variant nucleotides' frequencies--here, at most into two halves, since these are individuals who have just 1 or 2 alleles, AA, AT, or TT, say).  The legend above each track identifies the individual portrayed.  Note something, however: that the same sites often vary in multiple individuals.   One can dig around in the genome browser to identify details and get known population frequencies, but the data are not yet in a nice single logo-like visual track.

Such a track, for the '1000 Genomes' project, is under development here at Penn State as part of the various collaborative genome consortia, but not yet publicly available.  A separate frequency bar will be available for each population in that sample in the way individual data are in the above figure.  I understand that it, or some other similar visualization tracks will soon be available. There already are databases that portray allelic variation as little site-specific pie charts, too. 

This is progress, but these are ad hoc examples, and even if put into a single logo-like genome track pooling all available samples, will not be a completely good representations of human genetic variation, because the samples are a rather unsystematic representation.  That's in part just a temporary problem because new genome sequence data will rapidly accumulate, not just in population samples but also from countless unrelated individuals.  So perhaps the day is coming when we really could have a single logo-like presentation to give a more proper gestalt of 'the' human genome--if it is used as the reference representation.  It may be a long time before we have a 'representative' set of people, whatever that means, even if it will certainly be a  conceptual improvement, reminding everyone, not just specialists, of the population relevance of genomic variation.

But even this will still be stereotypical (even if every living human were included!) for the subtle reason that we discussed above, that each individual represented is only being sequenced in some way that essentially gets information from a single cell in that individual, even though the sequence in each cell differs.  No individual cell is much better than a Platonic ideal even of that one person's genomic variation! 

No sequenced individuals will be random or in obvious ways representative of the human species.  Indeed, how should or could we get such sample(s)?  What is the human species?  Thousands are born or die every hour, so the collection of human genomes constantly changes.  Representing genetic variation in a species is akin to representing a river with a snapshot of the water and waves (and fish) that happen to be there at the time.

Even biomedically, variation that's relevant changes all the time.  Depending on whether you are involved in public health (whole population level epidemiology), or a local isolated group, or a family, you might represent genetic variation differently.

In the end, we are perhaps stuck with Plato:  as limited human beings, we cannot grasp everything in our heads, and representations and reference guidelines are immensely useful.  Perhaps only at the edges of variation, or with exceptional cases, does the use of type specimens lead to problems, if we know what we're doing when we choose how to represent the variation.

There have been centuries of debate about the nature of Platonic ideas (even whether they somehow exist).  This debate occurs today even in mathematics and cosmology: do the rules and findings of mathematics somehow exist in a universal Platonic sense, or is that just how we can describe our own particular universe?  Is it anti-scientific mysticism to think of abstractions (like cosines, algebraic solutions, stuffed robins, or genome sequences) as having some sort of non-physical reality?

These are interesting and I would say profoundly important points.  Even if we can't understand the 'truth' of such ideals, they prove to be immensely useful and important.  For those familiar with mathematics, perhaps an analogy would be i, the 'imaginary' number equal to the non-existent square root of -1.  It doesn't exist, in the real world in the usual sense, but is fundamental to physics and how we understand the universe and much besides that.

Representations are fundamental to science.  The danger is when we don't understand them clearly and they become misrepresentations.  We'll have a few more finishing thoughts about this subject tomorrow.

No comments: