That means if you do a study today, for example, looking at some human genome region in a sample of people, you'll have to identify that region in terms of hg19, the current edition of the sequence. But if you want to compare your work to past work, or if those following up on your work want to compare to yours, they'll have to go back (or forward) to hg19. Yesterday we noted that the starting position for the human beta-globin gene has shifted about 43,000 nucleotides between the current and previous versions of the genome. This is not just due to error correction, but would happen if you compared any two instances of a human genome.
Thus not only do the sequence details of any given gene vary, but so does its location. The same is true for any other functional or even nonfunctional aspect of the genome. The beta-globin or any other gene no more exists in reality than does 'the' HG itself. What about other details? Just as every robin differs (we don't know about Nessies), so every instance of the HG differs.
|Albino Robin, photo from www.seed-solutions.com|
We've discussed when or how we decide that your behavior is not 'normal' (e.g. here), and how disease is sometimes an arbitrary classification that we choose to make. Are people 'abnormally' tall or short not human? Is that a 'disease'?
Somehow we need to decide how to categorize species--their traits and their DNA. What would be a better way to deal with a Platonic ideal, a single kind of reference genomic sequence--an ideal that seems to make sense, while at the same time there is no actual instance of that ideal?
Rather than a continually revised type sequence, one possibility is to use some set of DNA sequences to determine what 'the' human genome is like. But then what would be included in that set, and how many different sequences? If we want a road map, strictly as a reference for comparing sequences, how would a set be different from a single type specimen?
If we had a set of sequences, they would somehow capture both the typical elements and their variability, and these are vital to understanding both gene function and evolution. One way might be to use 'gene logos', such as in this figure:
Here, each position along a sequence is represented by a visual, color-coded, letter whose size is proportional to the the fraction of the sample that contains an A, C, G, or T. So in the first position, every sequence had a C, but in position 7, G predominates. The height of the letters sum to 1 (meaning 100%, unlike in this particular image).
This would provide a quantitative 'stereotype' of the human genome sequence that is not a rigid Platonic ideal, but would still be useful. A given sequence, like the beta-globin gene, varies but a computer search could easily find it in any whole genome sequence (a search and compare method called BLAST does just that).
We could also include missing nucleotides by having a proportionally sized N (for None) in the location could be added. Major copy variation or rearrangements would be harder to include, but probably ways could be found.
With computer technology, even a digest of all available sequences in a species could be stored in this way. Since individuals will always die and be born, at any given time this would be a 'type species' rather than type or reference specimen, but that would perhaps be a much better way to think of things.
Then, it would not be necessary to use the concept of 'the human genome sequence' as a somewhat misleading guide--where we use something that doesn't exist to try to understand what does exist. In fact, something like this approach is already available, though not standard, and will be discussed in Part IV of this series.
How we would do the same with robins (or Loch Ness monsters) is less clear.