We started this series to explore some conceptual complexities and subtleties associated with the universally known, but perhaps less universally understood notion of 'the' human genome. What seemed like a hard-won and very straightforward entity for human genetics turns out to be a lot less straightforward--even if it's exceedingly useful in many ways.
A type specimen in taxonomy can be a single individual. If 'the' human genome sequence in a future iteration is a diploid sequence from one person, it will be similar, as we noted in a previous part of this series.
|Helixanthera schizocalyx type specimen|
It is important to be aware that even as a reference, what it is a reference of is unclear. In prior posts of this series we discussed some of the issues. One general idea is that it's the sequence of the DNA from, at least, a 'normal' person. But as we noted that is misleading. The donor--when or if 'the' reference is from one person--is not in any way a particularly superior, nor even a particularly average, person.
A subtle point that seems very widely misperceived is that there is no 'normal' person. Nobody is at the exact average value for glucose levels, stature, blood pressure, and so on. Nobody has the 'average' DNA sequence (whatever that means). No person has ever died at the exact life expectancy in his or her population! An average, a 'norm', is a digest of variation. It is a Platonic concept.
If 37% of individuals in our species have an A, and 63% have a T at a given position in the genome, no person can be average. The closest s/he can get is 50% A, 50% T since we each (if we're 'normal'!) have two copies of the genome, so a single person's allele frequencies can only be 0/1 or 50-50. So it's hopeless even to think of a digest such as 'average' in this context. The nucleotide frequencies at any give site will change, anyway, as samples increase (but we'll not know by how much, since people are leaving and entering the gene pool at every moment). And, as a reminder, there is the further problem that the genome instances in each cell of an individual differ: 'the' genome sequence obtained from that individual alone is a reference, not a true description of his/her genomic contents.
So we are not really using a reference sequence to establish an ideal. Yet, unless to make a point of something like political correctness, we would not use a person with, say, Tay Sachs, or who was congenitally blind, nor an indigenous Australian or an Inuit, as our reference. This shows in yet another way that terms like 'reference' or 'representative' are potentially quite misleading, or even political.
And yet, an arbitrary reference is very useful! Why is that?
'The' wildebeest; 'the' mongongo nut': Deep aspects of the mind?
However this works neurally, our ancestors knew what a wildebeest is, or a mongongo nut, even if no two of them are identical. We have always created mental Platonic ideals and within rather broad limits we instantly fit sensory input into these categories and act on the result.
We use language--terms like 'the chicken' or 'the human genome', but the biological process involved in understanding the concept doesn't depend on language, and it certainly isn't anything particularly human! Lions recognize wildebeests, bees recognize flowers, and ants recognize enemies. The wiring process that leads to the ability quickly to characterize is long built-in to neural systems, however they work.
Using type specimens, including whatever we determine to use to represent genomes, is thus a manifestation of a deep mental characteristic. Where we have trouble is at the edges of our categories, in the range of objects that we include, and this is true not just of taxonomy, or genomics, but in all of life.
Even when it comes to disease and gene mapping!
It is thus no surprise that in human genetics we hunger for 'the' diabetes gene, or a gene 'for' diabetes--and have a definition of 'diabetes'. That is an underlying reason for the fervor for enumerative approach to genome mapping of traits. Categorization can be useful, since categories of disease allow us to develop protocols consisting of categories of treatments, tests, and identification. For deep as well as pragmatic reasons, we are not easily accepting of complexity: we like to throw the word around perhaps, if it sounds fashionable and insightful, but in fact we do our damndest to reduce quantitative complexity to qualitative simplicity. We simply can't stop wildebeesting.
Evolutionary typology, too
We've stressed the central relevance of evolutionary population thinking that should be an omnipresent counterweight to typological thinking. But we even do the very same Platonic thing in evolutionary biology, no matter how deeply we understand its quantitative nature. We want there to be 'the' adaptive reason for a trait that we are interested in. The adaptive 'fitness' of a trait is viewed in effect as an inherent, cosmically existant reality of which each individual's reproductive success is an imperfect reflection. Indeed, often we do the same thing in defining 'the' trait in the first place: we want 'the' selective explanation for 'bipedalism', for example, as if that were a single thing and evolved as such.
Categorical thinking is very hard to avoid, and population thinking devilishly hard to ingrain. A new paper in PLoS One that a colleague, Charlie Sing at the University of Michigan, pointed out to us, looks at the current data from the '1000 Genomes' project, the effort to provide a large number of whole genome sequences, raises--or rather, demonstrates--problems that will be faced by the desire (or dream) of predicting disease from individual genomes. The structure observed among known variants is used to identify causal variation by various mapping or association tests, as we've commented on many times (e.g., GWAS studies). But while some variants identified in what might be called disease-specific 'reference' data will have strong effects, this reference even though based on many rather than one sequence, will not include many other relevant variants.
How we should deal with rare variants is currently a hot topic in many labs, but whether one should be thinking in terms of such single-cause prediction is--or, rather, should be--at best an open question. No composite of variation will be a perfect genome referent for the moving target that is a species.
Since the real world is at least as quantitative as it is qualitative, categorical type-specimen thinking may sometimes be helpful but also can lead to wasteful effort or even to harmful results. Even war between 'us' and 'them' is an instance. It's hard to be careful about how and when to use 'types' even as referents. In a sense, the macro world has duality the way the micro world does in physics, where particles come in types, and 'the' electron is sometimes a particle, sometimes a wave....something, and yet not something.
Typological thinking is very deeply engrained. For example, in what is almost laughably misleading yet ineradicably pervasive, almost every geneticist refers to 'mutations' when they mean 'alleles': mutation is a change due to a DNA 'mistake' between one cell and another, as between parent and offspring. Every element of every sequence on earth today was, at one point, a 'mutation'. 'Allele' properly refers not to a change but to variant states present at any given time. But we exceptionalize variants in this way by calling them mutations if, for example, they lead to a disease or trait under study. This arises, conceptually, only because we think there is a 'normal'--a type--from which the variant is a 'mutant' (a world often with denigratory connotations). Yet we know that's at best inaccurate.
Worse, almost everyone in biology unstoppably refers to the 'wild type'--the wild type--when comparing to some variant ('mutant'). This has become ridiculously entrenched, because for reasons we've been discussing, there is no single type. Even in mouse research it is absolutely routine to refer to a strain, say C57BL/6 mouse as the 'wild type' relative to an experimental manipulation such as inducing a transgenic 'mutation' into one of those mice. But there is absolutely nothing 'wild' about a C57BL/6 mouse--it is a laboratory produced inbred strain that couldn't last the proverbial 5 minutes out in the 'wild'. It, like all inbred strains, have been very un-naturally produced.
So 'wild type' is a persistent, historically derived but obsolete term, a perversion even of the original type-specimen concept. In daily practice, of course, this is jargon and geneticists know what someone is talking about when contrasting a wild type with a mutant. But the habit reflects psychologically deep patterns of thinking that are far from what the real world is like, and we have tried to suggest, can be misleading, sometimes when it's important not to mislead.
A simple idea like 'the' human genome intuitively seems to exist, and yet on close inspection doesn't exist. In a sense, one can say that humans are built for material Platonism. The ideal, the abstract entity, is a figment of our individual imaginations, but our power of imagination was built by evolution. Using ideals, based on material objects, comes natural to us. We depend on it.
|Plato's Allegory of the Cave by Jan Saenredam, |
according to Cornelis van Haarlem, 1604, Albertina, Vienna.
In the end, 'the' human genome sequence is imaginable, but doesn't exist, no matter how we might contrive to produce something less arbitrary than the essentially fabricated referent we now use. The challenge is to understand the limits, as well as the strengths, of categories.