Sunday, August 9, 2015

How many diseases does it take to map a SNP? Fifteen years on

Ken and I are here in Finland, preparing to teach a week of Logical Reasoning in Human Genetics with Joe Terwilliger and colleagues.  Not statistical methods, not laboratory techniques, not the latest way to analyze sequence data.  Concepts, logical reasoning.

Ken and Joe have been reasoning logically for a long time.  They've taught this course together in many places, and they wrote at least one logically reasoned paper 15 years ago.  That paper was published in Nature Genetics.  That journal shortly afterwards made an editorial policy decision to be the loudspeaker for genetic association studies (GWAS), and would be unlikely in the extreme to publish such a view today.  But Joe often says that it could, and probably should be published again, with very few wording changes.  (He also says that if overhead projectors were still available, he'd give the same talks he gave in 1995, since the issues in human genetics haven't changed.  We have lots more data, but no fundamentally new concepts or insights regarding SNP associations and complex traits.  In fairness, though, he does update his slides -- he adds photos of the latest places he has traveled.  Looking forward to photos of Crimea this week.)

The 2000 paper was called, "How many diseases does it take to map a SNP?"  They began:
There are more than a few parallels between the California gold rush and today's frenetic drive towards linkage disequilibrium (LD) mapping based on single-nucleotide polymorphisms (SNPs). This is fuelled by a faith that the genetic determinants of complex traits are tractable, and that knowledge of genetic variation will materially improve the diagnosis, treatment or prevention of a substantial fraction of cases of the diseases that constitute the major public health burden of industrialized nations. Much of the enthusiasm is based on the hope that the marginal effects of common allelic variants account for a substantial proportion of the population risk for such diseases in a usefully predictive way. A main area of effort has been to develop better molecular and statistical technologies often evaluated by the question: how many SNPs (or other markers) do we need to map genes for complex diseases? We think the question is inappropriately posed, as the problem may be one primarily of biology rather than technology.
Today, emphasis is more on ever larger sample sizes to find rare alleles, since common alleles turned out not to be the magical answer, but the issues are the same.  The problem is biological, rather than one of sample size. And not only do we have at least as much causal complexity due to environmental factors, but to the mix have been added the comparable complex 'genetic' causal factors as epigenetic modification of DNA affecting gene expression, and the potential contributions of highly complex microbiome.

The idea of mapping diseases from SNPs is that markers will be near the disease allele.  But, there are problems with this, as GWAS are successfully showing.
If traits do not strongly predict underlying genotypes, that is, if P(GP|Ph) is small, linkage and LD mapping may have very low power or may not work at all. As an extreme example, one's genotype cannot be reliably determined by merely stepping on the bathroom scale! But even if this could be done, there is a widespread but invalid belief that because something can be mapped (that is, P(GP|Ph) is high), the causal predictive power of the genotype (P(Ph|GP); Fig 1, blue arrow) will also be high. In fact, we have surprisingly little data on this latter topic, which requires extensive sampling from the general population, rather than patients. Note that the opposite can also be untrue—that is, if P(Ph|GP) is high it does not mean P(GP|Ph) will be high, as in genetically heterogeneous mendelian disorders such as retinitis pigmentosa. It is important to note that when we speak of P(Ph|GP) in this context, we speak of the marginal mode of inheritance, which is only valid for consideration of singletons, and relatives will not have independent and identically distributed penetrances (even without assuming epistasis or gene-environment interactions) because the other genetic and environmental factors are also correlated among them! Similar arguments can be made about detectance, P(GP|Ph), which must always be a function of the ascertainment, something that is often overlooked in the literature when investigators make comparisons of power for different study designs

 Figure 1. Schematic model of trait aetiology.
The phenotype under study, Ph, is influenced by diverse genetic, environmental and cultural factors (with interactions indicated in simplified form). Genetic factors may include many loci of small or large effect, GPi, and polygenic background. Marker genotypes, Gx, are near to (and hopefully correlated with) genetic factor, Gp, that affects the phenotype. Genetic epidemiology tries to correlate Gx with Ph to localize Gp. Above the diagram, the horizontal lines represent different copies of a chromosome; vertical hash marks show marker loci in and around the gene, Gp, affecting the trait. The red Pi are the chromosomal locations of aetiologically relevant variants, relative to Ph.
Other inconvenient biological issues, mentioned in the paper, include that linkage disequilibrium is stochastic, and this has implications for the use of SNPs in disease mapping, that regulatory rather than protein coding sites often affect disease risk, and these are generally impossible to identify (see below), that late-onset chronic diseases are much more complex than the clearly genetic pediatric disease, that the most effective disease mapping and association studies are done in "selective samples of individuals or families at high risk relative to the average risk in the population, and from populations with unusual histories" (hence, Joe's eclectic and interesting travelogue), etiology tends to be very heterogeneous, phenotype can't predict genotype and vice versa, environmental effects can be significant, but are unpredictable and often impossible to identify, and so on.

Whole genome sequencing will not be a general miracle cure.  Exome sequencing can sometimes find  coding variants that have strong effects because we know how to identify exomes and how to read their code.  But many if not most mapped sites for complex traits, as might be expected, are in regulatory regions.  Yet we are still quite inept at identifying regulatory regions, for many reasons not least having to do with their complexity and fluidity among individuals and populations.  So whole genome sequencing will likely have to be analyzed by using markers, as in GWAS, and that will not automatically show us where key regulatory affects are located or how they work.  If these are too heterogeneous, they'll vary hugely, so that mapping will still face the complexity problem.  Time will tell what transpires.
The problems faced in treating complex diseases as if they were Mendel's peas show, without invoking the term in its faddish sense, that 'complexity' is a subject that needs its own operating framework, a new twenty-first rather than nineteenth—or even twentieth—century genetics.
So, if the data are better and less costly now than 15 years ago, the basic issues haven't changed.  

No comments: