We interrupt our series on Probability and its meaning, for a post that we've been asked to write, related to a new paper on gene hunting that got some press last week, and will be stirring up controversy (and naturally, we can't resist including our usual editorializing, for better or worse):
Making complexity simple (again)?
Every age and every profession has its PT Barnums. They're the slick-talking, fast-moving guys who will say anything to draw customers into the show they're running. Truth, if it even exists other than ambiguously, is secondary in many ways to closing the deal.
One tactic that the pitchmen use in fields like politics is to change definitions so that problems never get 'solved' (that is, there is always something to keep you in office or to keep levying taxes, or to keep you afraid of some enemy or other). This is a way of making the same facts serve new interests.
Science is rife with these kinds of self-interested maneuvers. Changing the goalposts in genetics means redefining the objectives or the criteria for pushing ahead even in the face of contrary evidence. This is a system we've built, step by step, once government funding largesse started flowing some time during the Cold War. And today, in genetics, which plays on fear of death just as much as preachers do, we have to keep passing the plate. Yet science is supposed to be objective and 'evidence based'. So we have to change the goalposts to keep the game from ending. In this case, there is a recent paper, with promise of more to come, by Eric Lander and colleagues (Zuk et al.). Because of his prominence, skill, and salesmanship this will of course get a lot of attention.
The paper discusses at least one reason why GWAS have not been very good at accounting for heritability (something we ourselves have commented on in many posts). Some, who are critical, say that this paper finally shows that GWAS and related big-scale approaches are proliferating even though they have themselves shown that they've reached diminishing returns. Here's an example. Of course there's the resistance that says no, Bravo! to the new paper, which shows that we're just getting started!
Naturally, everyone shares an interest in saying that to show their favorite view we need mega-GWAS, Biobank, gobs of wholegenome sequence, or other similar open-ended approaches, and whether this will lead to the ultimate small scale objective of personalized 'genomic' medicine. Far too many vested interests are at play and, indeed, training in 'grantsmanship' and the whole research culture is about manipulating the system to get, keep, and increase funding.
Zuk et al. ask why GWAS have failed to account for the heritability of so many traits, that is, to explain the correlation in risk among family members that reflects genetic effects. The subject is complex, but here we can just say that if each gene adds a dose of effect to a trait, like nibbles on a given side of the caterpillar's mushroom added to Alice's stature, then even if the effect of nibbles is very small, if we sample enough mushrooms we can identify all of the effects. Then, knowing them, we can tell which nibbles a given person has made, and hence predict his stature: Voila! personalized medicine!
Yet it hasn't worked that way. Most heritability remains unaccounted for despite already very large studies. Hence the demand for ever larger studies. A glib commentary in Nature a few years ago coined the term 'hidden heritability' and made the search to find it akin to Sherlock Holmes' search for Moriarty. That was a fantastic, if anti-science, marketing ploy on Nature's part, since it fed an ever-increasing demand for funds for genomic scavenger hunting....and that's good for the science journal business!
But the search for this hidden treasures has been frustrating, and Zuk et al. claim they now know that the search is in vain, and they provide a very sophisticated mathematical account of the reason why. It is due to gene interactions, or 'epistasis'. That means that a large part of the correlation among relatives can't be found by looking only for additive effects. Here's roughly a basic underlying concept:
If trait T, such as stature or insulin levels (or their disease-risk consequences) is due to the effects of factor A plus those of factor B, then we can write
T = A + B
If your genotype includes a variant A-gene that gives you an additional level of factor A, then for you
T = A + A + B, or 2A + B.
But if the factors interact, say in a multiplicative way, then
T = AxB
and if that's what's going on, and your A-gene genotype adds a second dose of A, your trait is
T = (2A) x B = 2AB
So, let's say a normal person has A=3 trait units and B=4 trait units. In the additive case that person's trait would be A+B=7. And if you have a mutation that doubles your A-dose, then your genotype makes you 2A+B=10 for your trait value. But in the epistasis case, your trait would actually be 24. So we expect you to have trait value 10, and conclude that more than half your trait value is unexplained.
Nobody disputes a central fact in the Zuk et al. argument. Life works by interactions among genes. These are functional interactions in systems called 'networks' and other sorts of molecular interactions. One molecule doesn't do anything by itself, nor does one gene as a rule. Here, each gene-related component is subject to variation by mutation, and that will be inherited (its effects contributing to heritability). So it is obvious from a cell biology point of view that single-factor explanations are not going to tell the whole story. But the fact of multiple factors doesn't tell the story we're interested in predicting states among variable traits. That involves a different kind of interaction.
If the true-fact is that biological traits are the result of functional interactions, what epidemiological risk estimation is about is not a list of pathways but the effects of variation in the pathways. When factors interact in a non-additive way, their net result is estimated from sample data using statistical techniques. The A B example above showed conceptually how they work. The additive contributions of variants in factor A are estimated from samples that compare those with and those without the variant in question. You can estimate each factor's effects independently in this way and add up the estimates.
But if they interact, then in essence you have to have an additional estimate to make, of the average trait value in groups of people with each combination of the variants at the interacting factors.
Not only does this require more data, it won't show up in GWAS types of case-control or similar data. You need to look in other ways.
Zuk et al. address this. They build their idea that much of the observed heritability is estimated on a purely additive model, and yet at least some factors may be what in standard biochemical terms is known as 'rate limiting' effects. At some level of concentration of such factors, they or what they interact with no longer works the same way if it works at all. The authors outline a model which, under various assumptions about how many steps are rate-limiting such that variants in those steps define measures of heritability, might begin to explain familial correlations not accounted for by current additive effects.
It is already nigh impossible to get stable estimates of hundreds of additive effects, mostly very small (see our current series of posts on Probability does not exist!). It's one thing to estimate the effects of, say, 100 additively contributing genes. Variants will be many in each, variable in their frequencies among samples. If purely additive, the context doesn't matter: in each population you can estimate each factor's effect and add 'em up to get a given person's risk. But if context matters, that is, if the effect of one factor depends on the specific variants in the rest of the genome (forgetting environmental effects!) then it's quite another to estimate those interactions. Roughly, if 100 genes interact in pairwise fashion (2-way interactions like AxB only), that means 10,000 interaction effects to estimate. Zuk et al. certainly acknowledge this problem clearly. But the authors suggest kinds of data on relatives of various degrees that might be practicably collected and could reveal discrepancies from expectations under additive models, and account for more or all of the heritability.
Zuk et al. promise that if we study 'isolated' populations we'll have a shot at the answer! This is not new, and indeed studies in Finland and Iceland led a previous charge for similar reasons. Perhaps it is a good idea....and the authors provide tests that could be done if there were adequate data collected. It will be done and will expend more millions of research dollars in the process. But, if complexity is real, the most we'll get is a few hits and a few weak statistical signals. But we're in that place already, so in that sense this is another way of changing the goalposts, because we did not reap bonanza from those studies.
Like most such papers, Zuk et al. make gobs of assumptions about the model, the data, and the underlying basis of dichotomous disease (present or absent). Facing such complex problems it's hard or impossible not to make simplifying or exemplifying assumptions. As usual, if one probes these assumptions, there will be additional sources of variation and uncertainty that are being ignored or that must be estimated from data, so that even the rosy answers suggested by the authors, which are far from promising a complete understanding, will be overstated (and that is the clear-cut history of the senior author, and most of his peers in the profession, including Francis Collins).
The actual answer is well known, even if it's resisted because it's not very convenient: life is a mixture, or spectrum, of all of these kinds of components. We know that additive models work very well in many domains, such as agricultural and experimental breeding (artificial selection), and that as GWAS sample sizes have increased steadily they have steadily been identifying more contributing genes. That is, we have not plateaued to a point that bigger samples will not add more genes, the remaining recalcitrant effect due to interaction. We think even the authors acknowledge that the interactions may comprise sets of individually weak contributors.
Further, because most variants in our population are in fact, and indisputably, rare to very rare, they form a large aggregate of potentially contributing genes that will vary from sample to sample and population to population. This is, really, a fact of life.
Networks, the sugar plum fairies of promised genomic medical miracles, involve tens of genes, many with multi-way interactions, like AxBxC. And what of higher-order interactions such as the square of one or more factor levels? One might as well count stars in heaven as attempt to collect accurate data from the entire human species in order to get enough data to estimate these effects. And that, of course, is foolish for many reasons, not least being that environmental factors change and vary and heritability is a measure of genetic relative to environmental effects.
Zuk et al. are promising a series of papers along this track, and have coined a new name for their idea (a standard marketing ploy). We'll be hearing a lot of me-tooism, users avidly diving into the LP (limiting pathway) model. That certainly doesn't make the model wrong, but it does affect where the goalposts are. We're hopefully not hypocrites: we have our own simulation program, called ForSim (freely available to others) by which some of these things could be simulated, and we may do that.
Nobody can seriously question that context is very important, and that includes various kinds of interactions. But the issue is not just what the mix of types of contribution are, how stable, how variable among samples, and so on. Even if Zuk et al. are materially correct, it doesn't erase the problematic nature of trying to estimate, much less generally to do much about, the joint effects of the many genes, in their tens or hundreds, whose variation contributes to a trait of interest.
"Discovery efforts should continue vigorously"
We ourselves are far from qualified to find technical fault with the new model, if there is any. We doubt there is. But the point is not that this new paper is flim-flam, even if it simplifies and makes many assumptions that could be viewed as perhaps-necessary legerdemain, given the situation's complexity. Or, perhaps more clearly, red-herrings to distract attention from the real point, which is whether changing the goalposts in this kind of way changes the game.
One can ask--should ask--whether regardless of interaction or additivity, it's worth trying to document them, or whether Francis Collins' insistence on luxury medicine (personalized prediction of weak effects with lifelong treatment mainly available to paying customers) is a realizable goal. The same funds could, after all, be spent in other ways on indisputably stronger and simpler genetic problems, that could be far more directly and sooner relevant to the society that's paying the bill.
Arguing either/or (additive or non-additive) and attempting to relate that to the desirability of keeping the Big Study funds flowing is a carnival barker activity. The authors at one point subtly make the anti-Collinsian but obvious point that personalized gene-based prediction is generally going to be a bust. But then they argue let's plow ahead to discover pathways. Let's have our goalposts everywhere at once! There are other, better, logistically easier and probably less costly ways to find networks and if there aren't already, there ought to be research invested in figuring them out (e.g., in cell cultures). Epidemiological studies are expensive and of low payoff in this kind of context.
As the authors clearly say: discovery efforts should continue vigorously" despite their points, acknowledging in fact and cleverly hedging all bets, that many variants remain to be discovered by current approaches. This is a fair-grounds where PT Barnums thrive. It keeps their particular circus's seats filled. But that doesn't make it the best science in the public interest.