Thursday, August 2, 2012

Ooops! The human genome does not exist! Part II. What kind of reference is a reference?

We posted yesterday about the nature of type specimens, one individual (or sometimes a few more) chosen by scientists to represent an entire species.  How that choosing was done, how the type specimens were sampled, or what they actually represent is variable but always a bit (or more than a bit) arbitrary.

When the age of genomics arrived, and we had large-scale sequencing, the Human Genome Project gave us something similar, 'the' Human Genome sequence (HG).  This, like previous type specimens--the stuffed carcases shipped back home by globe-trotting naturalists largely in the past 2 or so centuries--is an arbitrary instance of something we believe exists elsewhere in profusion.  But while the stuffed robin or Loch Ness Monster on display was an actual robin--or a grainy picture of an actual LNM--even this degree of specificity doesn't apply to the HG:  it has generally been said that it was not derived from a single individual, but 8 or so, whose identity or identities or ethnic origin etc.  are confidential.  In addition, it is under revision as errors are found and so on.

That means if you do a study today, for example, looking at some human genome region in a sample of people, you'll have to identify that region in terms of hg19, the current edition of the sequence. But if you want to compare your work to past work, or if those following up on your work want to compare to yours, they'll have to go back (or forward) to hg19.  Yesterday we noted that the starting position for the human beta-globin gene has shifted about 43,000 nucleotides between the current and previous versions of the genome.  This is not just due to error correction, but would happen if you compared any two instances of a human genome.

Thus not only do the sequence details of any given gene vary, but so does its location. The same is true for any other functional or even nonfunctional aspect of the genome.  The beta-globin or any other gene no more exists in reality than does 'the' HG itself.  What about other details? Just as every robin differs (we don't know about Nessies), so every instance of the HG differs.

Albino Robin, photo from
How many chromosomes do we have?  22 (two copies of each) plus X and Y?  Well not everybody has that number.  People with Down syndrome have 3 copies of chromosome 21.  Well, you may say that's abnormal, but are they not human?   In fact, every instance of a human genome varies in the number and location of substantial chunks of DNA (called 'copy number variation'), and there are shorter bits that vary in their number of copies in millions of different places across the genome.  There are tens or hundreds of defunct genes, different sets of them in each 'normal' person.  Further, of course, what is 'normal' is a largely arbitrary notion, based on what is average (the 'norm') or some range that we think of as 'normal' in some vague and subjective way.

We've discussed when or how we decide that your behavior is not 'normal' (e.g. here), and how disease is sometimes an arbitrary classification that we choose to make.  Are people 'abnormally' tall or short not human?  Is that a 'disease'? 

Somehow we need to decide how to categorize species--their traits and their DNA.  What would be a better way to deal with a Platonic ideal, a single kind of reference genomic sequence--an ideal  that seems to make sense, while at the same time there is no actual instance of that ideal?

Reference sets?
Rather than a continually revised type sequence, one possibility is to use some set of DNA sequences to determine what 'the' human genome is like.  But then what would be included in that set, and how many different sequences?  If we want a road map, strictly as a reference for comparing sequences, how would a set be different from a single type specimen?

If we had a set of sequences, they would somehow capture both the typical elements and their variability, and these are vital to understanding both gene function and evolution.  One way might be to use 'gene logos', such as in this figure:

Here, each position along a sequence is represented by a visual, color-coded, letter whose size is proportional to the the fraction of the  sample that contains an A, C, G, or T. So in the first position, every sequence had a C, but in position 7, G predominates.  The height of the letters sum to 1 (meaning 100%, unlike in this particular image).

This would provide a quantitative 'stereotype' of the human genome sequence that is not a rigid Platonic ideal, but would still be useful.  A given sequence, like the beta-globin gene, varies but a computer search could easily find it in any whole genome sequence (a search and compare method called BLAST does just that).

We could also include missing nucleotides by having a proportionally sized N (for None) in the location could be added.  Major copy variation or rearrangements would be harder to include, but probably ways could be found.

With computer technology, even a digest of all available sequences in a species could be stored in this way.  Since individuals will always die and be born, at any given time this would be a 'type species' rather than type or reference specimen, but that would perhaps be a much better way to think of things.

Then, it would not be necessary to use the concept of 'the human genome sequence' as a somewhat misleading guide--where we use something that doesn't exist to try to understand what does exist.  In fact, something like this approach is already available, though not standard, and will be discussed in Part IV of this series.

How we would do the same with robins (or Loch Ness monsters) is less clear.


Joachim Dagg said...

The idea behind type specimen is probably no longer that they truly represent the species, but that two specimen from two different species could more or less illustrate the gap that should exist between species (comparative approach).

Ken Weiss said...

Yes, though careful examination has always shown problems with the very concept of species, and these are most problematic at the 'edges' of variation.

Polyphenism, very different morpohologies in a given species due to environmental differences, present one problem.

Separating 'varieties' (as Darwin called what we may call 'subspecies') from 'real' species is not always easy, and depends on the definition of species.

Of course, nobody can confuse a human genome sequence even from a chimps, our very close relative, so your comment certainly fits with how we properly use a 'reference' sequence.

Hopefully, we've properly discussed other nuances and thoughts in the later parts of this series.