The Mermaid's Tale: Looking at everything in everybody: the 1000 Genomes Project report

The latest report of the '1000 Genome Project,' still a work in progress, was published in Nature last week (no paywall). The paper suggests that 95% of common genetic variants, those that occur in more than 5% of the population (or, more accurately, of samples that have been sequenced or genotyped), were already identified in the pilot phase of the project. Now, however, they are reporting variants that occur at lower frequencies, or in non-coding regions of the genome.

Low frequency variants are likely to have arisen more recently than common variants and are likely to be functional mutations and under weak purifying selection. Old variants are likely to have reached high enough frequency in order to have survived to be observed. For various understandable reasons, rare variants tend to vary across populations more than more common variants do. Thus, the suggestion is, low frequency variants may be those that are more likely to explain diseases in families. This is a highly questionable, even knowingly convenient hope.

This paper is a report of 1,092 genomes from 14 populations from Europe, East Asia, sub-Saharan Africa and the Americas. For methodological reasons, the variants they concentrate on are short indels (one nucleotide missing or inserted) and large deletions. They "discovered and genotyped 38 million SNPs, 1.4 million bi-allelic indels and 14,000 large deletions."

The data, while still just a subset of the genomes, are characterized in detail, but freely available so there's no need for us to repeat the results here, except to say that as expected, the overwhelming majority of variants found at a frequency of 5% or greater were previously identified, and most of these are found in all populations, while fewer low-frequency variants were known before this project, and much fewer very rare variants were
previously identified, and most of these are in only one ancestry group. This is just what would be expected from sampling considerations, but at least it confirms the relevance of earlier SNP data.

And, also as previously reported, most 'healthy' people are found to have a number of deleterious loss-of-function (LOF, such as gene-inactivating) mutations, or non-synonymous variants which cause a change in the proteins coded for by a gene.

We estimate that individuals carry an excess of 76–190 rare deleterious non-synonymous variants and up to 20 LOF and disease-associated variants. Interestingly, the overall excess of low-frequency variants is similar to that of rare variants.

Part of the justification for these data is to use them in searches to identify disease causing mutations in people with rare diseases. This is a reasonable idea because everyone has hundreds if not thousands of potentially disease-causing mutations, and many many more variants that are unlikely to cause disease, but how would you know which is which in someone with a rare disease? So determining the mutation that is the cause of disease, or which will cause disease is impossible without more information, either from an affected family member or another patient, or from people without the disease. If a mutation found in an unaffected person is not found in family members or any other unaffected people, this can be a clue that it might truly be causal.

If the mutation is found in someone currently healthy, that can mean it's not causal. But not necessarily. The person won't always be healthy, and what will eventually make him or her sick is not likely to be predictable, but could be every bit as 'genetic' as much of what all the genomic medical research has found. If you're looking for the cause of a very rare disease, the probability that an unrelated person will develop that disease is small, but many mutations are this mysterious thing called 'non-penetrant', meaning that the same mutation that seems to be causal even in one family member doesn't make another one sick. So, the mutation may or may not be ruled out as causal.

If the same trait and variant is found in a few family members, or if different mutations are found in the same gene but different patients, one can begin to get excited. Then one must confirm the functional relevance of the gene. And remember that 'gene' can mean known protein code, regulatory regions.....or the countless other functional parts of the genome whose actions largely remain unknown.

This is a historical outgrowth of the earlier HapMap, and smaller-scale whole genome sequences, and also an unembarrassed retreat from the Common Variant hyperbole that justified those studies. We know that common variants just do not, generally, account for much of the risk of common disorders. But if you're convinced that genetics will account for those traits in the end (and/or you have a lab and your salary to maintain), you will find another reason.

It is not that genetics on any scale isn't worth doing, it's how worthwhile it is relative to other ways to invest in health improvement or even health research. To the extent that this is just a means to keep employment and the gene industry moving (about which we've posted recently, including recently on The Ladies' Paradise), the justification is not a purely scientific one. Whether that's good or bad is a judgment call.

As Ken and I have said many times, our view is that funds for genetics should be preferentially used to understand, prevent or treat illnesses that truly are genetic, not just feed the genomics industry. The 1000 Genome Project began largely as an avowed bridge to tide labs over from the Human Genome Project to the time when whole genome sequencing would become routine at affordable costs to be done on one and all. The rationale was largely post hoc or just that more 'information' is better than less. So the project is documenting the hell out of these genomes, in ways that will quickly be overwhelmed with even more whole genome sequence data. And much of what we've learned so far is that rare variants are rarer and more isolated than common variants. We knew this in principle from population genetics theory, and from a lot of data we already had. Also, we are seeing again that non-coding variants could be important. Well, that's not new, and no surprise.

There are various things one might do, at lower cost, to see what we should expect from such data and how or what more we should be generating at this stage. Computer simulation is one very inexpensive approach (that we are taking) to address some of these questions.

Overall, however, if the piles of genetic data these efforts ares contributing really can be useful in medicine, that's good news. But this is going to be for a subset of genetic diseases, with little if any clinical utility.

Link List

Friday, November 9, 2012

Looking at everything in everybody: the 1000 Genomes Project report

No comments: