Wednesday, August 1, 2012

Ooops! The human genome does not exist! Part I. The notion of a type specimen

It's sultry summer, and of course the science hype machine motors on unabated, but we want to ignore it briefly and relax at least a little bit before Penn State resumes it's serious academic activities--that is, football season.  Still, while you're basking in the sun getting your tan (and skin cancer), along with your mint juleps (and undoubtedly some other ugly diseases), here's something to think about:

The human genome doesn't exist
René Magritte: This is not a pipe.
It's a painting.  But a painting of 'a'
 pipe--but not  'the' pipe, which doesn't
 actually exist in the real world.  Or does it?
Despite many claims to the contrary, that The Human Genome project sequenced the human genome and thus set in motion the most exciting era of fundamental new scientific discovery since Galileo, it has turned out that the HG doesn't exist, after all.  It no more exists than does 'the chair' or 'the dog' as Plato once asserted.  He said we have dogs and chairs but they are only imperfect instances of the real true dog and chair.

Now what everybody raves on about, the HG, is like that.  It is not from one person, not even one copy from one person's two. It's not clear whether this is even from one person or from several donors, or if from one, who that person is in terms of his origins (his, because there's a Y chromosome in the sequence). 

In any case, it's not a 'normal' genome, because while we believe the sequenced individual(s) was/were healthy at the time they bled for the cause, they won't be healthy forever.  Since we are told every day by the press that every disease without exception must be genetic, and hence we should GWAS it endlessly,  the donors will eventually become genetically abnormal.

Worse, 'the' genome keeps changing!  We are now in version 19, released in 2009. It's either more details from the same donors, or now has bits that couldn't be sequenced easily from those donors' DNA and so was sequenced from additional donors (we don't know which is the case).  There will be future revisions.  Not only that, but it is a 'haploid' sequence, with only a single-nucleotide per position reference, whereas any human has two copies (except for X and Y chromosomes).  Yet in any one person 1% or more of sites actually vary.

This is not Van Gogh's
kitchen chair
Now, this means clearly that nobody actually has  'the' currently posted human genome sequence any more than your kitchen chair is 'the' chair.  Of course, what some biologists who are more savvy than most people who talk about it know is that the HG is a reference sequence.  Nobody has and nobody has ever had, this incomplete reference HG. All of us have some variants of that sequence.  It is wrong, but standard, to refer to what we each carry as 'copies' of the HG.  They are not copies!  There is no copy of the HG!  No one person ever had that sequence, so no person could copy (replicate) it to transmit to its offspring.

Probably, we should use a term like 'instance' rather than copy, in the same sense that each chair is an instance of the concept of chair, even if no Platonic actual ideal chair exists, except as a reference concept in our minds.

What is good for the goose is good for the gander.  If the HG doesn't actually exist, neither does a given gene say 'the' beta-globin gene, which is expressed in red blood cells and of which some variants are involved in anemia (like sickle cell anemia).  Here is hg19's version of a very small part of that sequence (click to see details):

Any one instance of this in an actual person, like you, may or may not have this exact sequence.  But unless it's been deleted from your genome, you'll have something very similar, not because it is a copy of some Platonic idea, but because they are descendant copies handed down through the generations since we shared a common ancestor, and that--evolution--is the crucial difference.  It's why we should be careful about treating a reference as being real, or how we relate the existing instances of the 'same' (homologous) sequence.

We use 'the' HG sequence as a way of referring to a standardized set of nucleotide locations along the chromosomes in the individuals who were sequenced.  Each of us varies from that sequence in millions of places (in each copy that we carry). To find similar functional elements--genes and so on--in my instances of the human genome, we use 'the' HG as a kind of map.  If I had your DNA sequence, I could identify which is 'the' beta-globin gene by its similarity to that in the figure, even if you didn't have the exact same sequence.

This is very useful, but it's important to be aware that not only does nobody have the same sequence, but even the structural details--the locations of the 'same' elements in you and us and the donors of 'the' HG--differ among people.  So the coordinates (the number at the top line) will not be the same.  For example, the beta-globin gene starts at chromosome 11 position 5,246, 696 in draft hg19, but at 5,203, 400 in draft hg18, a difference of around 43,000 nucleotides!  Where did they come from?  Which draft should we believe--if any?  What will hg20 say? What is 'the' HG?

Type specimens
American Robin
Think about this in another context.  For centuries we have had the concept of a type specimen.  There is 'the' robin, 'the' monarch butterfly (or even, perhaps, 'the' Loch Ness monster??).  'The' Neanderthal hominid is an arbitrary individual whose luck it was to die and not rot, so we could discover him around 30,000 years later in Germany.  The lucky old sod gets to represent all of his contemporaries and ancestors and descendants for hundreds of thousands of years.  And 'the' Neanderthal DNA sequence is partial, a composite of several individuals who lived hundreds of kilometers and thousands of years apart.  Think about that when you hear people pronouncing on how 'the' Neanderthal lived, or what color hair 'it' had, or whether it had religion!

Robin you'll see in the UK (image: RSPB)
The international organization of systematists--biologists who are concerned with cataloging, characterizing, and name species and their relationships--has decided what counts as a reference specimen of an organism, in the same way we decide what counts as a reference specimen of the genome of a species.  Indeed, and interestingly, the genome of 'the' mouse or 'the' horse, is likely rarely if ever to be from the same official physical type specimen on display at some authorized museum.  Thus some poor (former) robin sits rigidly on some display branch for all to see, representing all robin-hood individuals it never saw or dreamt of.

This leads to interesting questions about how we treat our subjects, biology and evolution.  Are there alternatives to type specimens?  So, while you nurse your mint julep in suspense, think about this--the whole concept of a reference specimen, be it genetic or physical, because tomorrow we'll give a least a bit of thought to this question.


Deanna said...

You can learn more about how the reference assembly is changing here:

Ken Weiss said...

Thanks for this link, which I had not known of. I'll look at it. Meanwhile, I was just talking with people here at Penn State who are involved, and learned something of what's new. This led me to modify the first post, somewhat qualified from the original version posted this morning.

Also, I'm adding a Part IV to further update the concepts.

But I think the conceptual issues are not entirely changed, even with improvements, and part of the purpose of the posts is that some or even many biologists or even human geneticists aren't fully aware of the nature of a reference sequence etc. and what I would call its epistemological meaning, relative to the thorny problem of population vs stereotypical thinking.

Henk Poley said...

Minor nitpick on the intro. Though UV causes suntan and cancer, UV-B also causes the skin to create vitamin D, which is used to keep a lot of cancers at bay. 50-75% reduction for a lot of cancers when looking at the highest serum level cohorts.

Ken Weiss said...

Well, maybe; the vit D story is far from clear at least in relation to sunlight exposure, but the skin cancer story is very clear. Whether Aussies have a net gain or loss of cancer overall, as a result of their infatuation with the sun (which, historically, gave them lots of skin cancer, though maybe their being more protective these days) isn't clear (to us, but maybe to someone it is).

Anyway, that is a nit in the intro, which was just a glib intro, so we don't mind being picked!

Ken Weiss said...

We've just looked at this blog and it does provide information on many of the issues discussed in this series. It shows among other things that there are issues we don't mention, that there are a number of people working on how best to represent out genome (and yours!).

Deanna said...

There is also a publication (should have included it the first time):

Many people aren't aware of how the genome was constructed, despite this being documented in the literature. We are working towards evolving the reference assembly into more of a pan genome- but we clearly have to continue fixing errors as well. Additionally, there is a lot of room for additional reference sequences from individuals from different populations. The nature of the reference assembly you need depends largely on the questions you are trying to ask.

Ken Weiss said...

Thanks for this reference, another we did not know's impossible to keep up with everything. I think nothing we say is inconsistent with this.

I do believe that many if not most people are unaware of these things, even though they use 'the' genome regularly.

In future episodes of this brief series I think we address some of these questions, at least in our own way.

Anonymous said...

Future project: synthesise and clone the reference human. To be stored under an airtight belljar in Sèvres, next to the reference kilogram.

Ken Weiss said...

Well, there is a problem. His/her/their genomes will experienced mutations during the cloning. Even just en-jarring the original person(s) won't work, as it/they too will have experienced mutations since the DNA was collected.

We deal with some of the complications in later parts of this series.

It seems that having 'the' meter stick, or 'the' cesium clock is an easier task.

AD Johnson said...

Thanks for a nice blog entry. I've worked in genetics and bioinformatics for about a decade and collaborated with many from a variety of backgrounds. Anecdotally, these are issues that a significant fraction of geneticists, biologists and clinicians are relatively unaware of until they are explained to them (i.e., it is not covered in most courses and textbooks).

A few years ago I proposed a modified IUPAC code to try to provide single reference sequences more reflective of populations/organisms:

More recently Euan Ashley and colleagues explored the construction of sythetic reference genomes from different popluations. This article provides some nice examples of the benefits of an improved reference or synthetic sequence:

P.S. - go Lions (proud Alum)!

Ken Weiss said...

We deal with other aspects, probably similar in spirit to your thoughts, in the later parts of this series...but also we raise some problems.

At this point, as far as "go Lions" is concerned, it's too bad what happened to the football team, since only one player, named Jerry Sandusky, has ever been implicated in wrong doing, and of course the coach did the positive things he's been credited with, even if he had his human failings.

But that's what happened to 85 Lions. What about the other 44,915 Lions who are here to play with pen and computer, not footballs? I think we need to focus on academic reform, to take some leadership in addressing national problems in a long creep of weakening standards. Hopefully, it'll happen, but I'm not holding my breath.