Wednesday, February 22, 2012

How many genes can we live without?

It's well-known that 'the' human genome doesn't exist -- despite the hype about the complete sequencing of this thing (and despite the incompleteness of its sequencing).  In fact, each of us has a collection of DNA variants that, added together, means that our genome has never been seen before in the 3.8 billion year history of genomes, and will never be seen again.  And it's no mean collection; we all differ from each other at something like 3 million loci.

We need to understand first of all that 'the' human genome is not from one person, and even so, what it is, is a reference sequence, useful for comparing other data, but not definitive of our species.  The donors of the DNA were healthy at the time of donation, but that's about it.  They were not particularly special in any way.

Human genome by functions; Wikimedia Commons
In fact, as it turns out, not only do we differ at single nucleotides -- you have a T where I have an A, I have a G where you have a C, and so on, times 3 million -- but we are each carrying around a not insignificant number of variants that result in loss of function (LoF) of some of our protein coding genes.  And many of these are genes we think of as essential.

A paper published in last week's Science, "A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes," MacArthur et al., estimates that each of us has around 100 of these LoF variants, 20 of which result in complete loss of a gene. The authors looked at three pilot data sets from the 1000 Genomes Project, (58 Yoruba, from Nigeria, 60 European American from Utah, 30 Chinese individuals from Beijing and 30 Japanese from Tokyo) and a European genome, and, after filtering a larger initial set of candidate LoF variants, finally analyzed 1285 variants that they found to be likely to cause protein-coding genes to lose function.   

As MacArthur et al. point out, other recent studies of the complete DNA sequences of healthy individuals have also found many LoF variants -- from 200 to 800.  People walking around perfectly normal so far in their lives, but without the use of substantial numbers of their genes.  The specifics vary, and what you can do without, your genes that aren't working, likely depends on what is working.

Before the advent of complete genome sequencing, LoF variants were thought to be rare, largely associated with severe Mendelian disorders such as cystic fibrosis or Duchenne muscular dystrophy.  The finding that they aren't so rare after all suggests to MacArthur et al. "a previously unappreciated robustness of the human genome to gene-disrupting mutations and [this has] important implications for the clinical interpretation of human genome–sequencing data."
LoF variants found in healthy individuals will fall into several overlapping categories: severe recessive disease alleles in the heterozygous state; alleles that are less deleterious but nonetheless have an impact on phenotype and disease risk; benign LoF variation in redundant genes; genuine variants that do not seriously disrupt gene function; and, finally, a wide variety of sequencing and annotation artifacts. Distinguishing between these categories will be crucial for the complete functional interpretation of human genome sequences.
After they weeded out false positives (which were due to sequencing errors; of course, false positives are a problem in their own right in a clinical setting), the variants included indels (insertions or deletions of 1 or more nucleotides) that changed the splicing of the gene (and thus changed the amino acids that got strung together in the resulting protein), single nucleotide variants that introduced a stop codon into the gene sequence (that is, that caused transcription of the gene to halt prematurely), and large deletions that removed some of the coding sequence.  Some of the variants were found to affect all known protein-coding transcripts of the affected gene, and some affected only some of the coding transcripts.  That is, some transcripts were normal. 

The authors find that the common gene-disrupting variants described in this study are not a significant cause of complex disease.  Most of the LoF variants identified that are associated with complex diseases, all but one of which were heterozygous in these subjects (they had one functional copy, one not), are at low frequency, presumably due to purifying selection -- that is, selection against alleles that are severely harmful, thus preventing them from reaching high frequencies in any population. 

Individuals in this study have about 120 LoF variants, about 100 of these heterozygous and 20 of them homozygous -- both the person's copies are non-functional.  It's the homozygous LoF's that seem to have no effect that are of most interest.  The individuals in this study are healthy -- so either their LoF's truly have no effect, or they haven't yet had an effect.  If they are truly what MacArthur et al. are calling LoF-tolerant genes, and don't lead to disease, the authors suggest that they can be used to "define the functional and evolutionary characteristics that distinguish these genes from severe recessive disease genes."
We examined the 253 genes containing validated LoF variants that were found to be homozygous in at least one individual. These LoF-tolerant genes are significantly less conserved and have fewer protein-protein interactions than the genome average. They are also enriched for functional categories related to chemosensation, largely explained by the enrichment of olfactory receptor genes in this class (13.0% versus 1.4% genome-wide), and depleted for genes involved in embryonic development and cellular metabolism.
The finding that so many of the genes that are 'allowed' to vary are olfactory receptor (OR) genes isn't much of a surprise, as we all carry many OR genes that are pseudogenes, OR's that no longer function.  So, the researchers eliminated these from the set of genes that could lead to Mendelian disease, and then compared the remaining 213 LoF-tolerant genes with 858 known recessive disease genes.  They found these 2 categories were very different, and suggest that the characteristics of the recessive disease genes that are not shared with the LoF-tolerant genes could be used to prioritize candidate disease genes.

This can be important because we all have so many variants, and identifying which of these is or are contributing to a disease we may have is often impossible without data from other affected family members.  If likely candidates can indeed be prioritized based on the results of this study, this could be very helpful.  But, many genes are only deleterious in a given environment, or after years of exposure to environmental factors, and one person's LoF-tolerant gene might be another's disease gene.

To us, the fact that so many genes can apparently be disrupted with no discernible ill effect is further evidence of the adaptability that evolution has built in -- DNA replicating errors are common, the timing and locale of gene expression are often imprecise, and environments frequently change.  We're redundant, buffered, pretty hardy creatures!  The fact that all this can happen and we can live on with no ill effect is a beautiful fact of life.


Holly Dunsworth said...

Beautiful ...and so annoying too.

So does this mean that disease risk estimates based only on one or a few SNPs rather than on those SNPs inside your entire personal genomic context...these current, relatively estimates could be more tenuous than we think?

Anne Buchanan said...

I think it means that we don't yet have a nearly complete picture of the kinds of genetic variation that cause disease. And that there's a lot of play, or imprecision, in the system.

Genomic background is a contributor, and, as this paper points out, the structural nature of the variation seems to be an issue. It also means, I think, that we don't really understand the old concepts, 'recessive' and 'dominant'.

It has been known for a while that some of us carry known disease mutations, but we don't have the disease. Hemochromatosis, an iron absorption disorder, is a classic example of this. And, it's been known that the same allele that is associated with disease in one inbred mouse line is carried by perfectly healthy mice in another line.

I think the bottom line is, it's complicated.

Ken Weiss said...

At the very least it shows why single-site predictions are perilous in most cases.