Showing posts with label missing heritability. Show all posts
Showing posts with label missing heritability. Show all posts

Friday, December 19, 2014

Survivorship bias and genetics

I was a mathematics major as an undergraduate.  However, not then or since have I been anything that one could call a mathematician.  At least, I hope I learned something about trying to think logically about life even if I never do equations.  But this interest led me to read a new book I was told of, called How Not to be Wrong: the Power of Mathematical Thinking, by Jordan Ellenberg (2014, NY, Penguin Press).

This is a popular rather than technical book, but it shows in interesting and serious ways how mathematical thinking can lead to improved understanding of the real world.  I think it has relevance to an important area in current evolutionary and biomedical or agricultural genetics.  So I thought I'd write a post about it.

Survivorship bias
Ellenberg begins his book with an illustration of how abstract logical thinking can solve important real-world problems in subtle ways.  In WWII a mathematics research group was asked by the Army to help them locate armor plating on fighter aircraft.  The planes were returning to base with scattered bullet holes from enemy fire and the idea was to put some protective plating where it would do the most good without adding cumbersome mileage-eating weight.  The mathematician suggested to put the plating where the bullet holes weren't.  This seemed strange until he explained that this was because the bullet holes that were observed hadn't done much damage: bullets hitting elsewhere had brought the plane down so it was never observed because the plane never returned to base.  The engine compartment was the case in point: a shot to the engine was fatal to the aircraft, but to the wings and body, much less so.

This is a case of survivorship bias.  It can apply widely, and evolution and genetic causation provide instances where it seems likely to be a useful principle.  As geneticists we ask, what the genes whose variation causes variation in adaptive or biomedically interesting outcomes.  This is what genome mapping in its various forms is intended to identify.

Ironically, it seems, when we do experiments involving development or testing of genetic mechanisms by, say, knocking out a gene, or when we observe the major gene-usage switches that occur when some part of an embryo's body are forming, we can identify specific genes that seem to be very important.

Several pieces of evidence can suggest they are important.  One is the finding that the same gene is used in similar roles in very distantly related species (often, even, between humans and flies or even more distant species). It's usage has been conserved.  Secondly, there is usually far less variation within or between species in such genes than in what we believe to be non-functional or marginally functional parts of genomes. This seems to suggest that variation hasn't been tolerated by natural selection.  Thirdly, many congenital diseases in plants and animals including humans have proven to be due to the effects of variants, often newly arisen mutations, in a specific gene.  Most cases of diseases like Cystic Fibrosis, Phenylketonuria, Muscular Dystrophy, or Tay Sachs Disease are of this sort.  Some congenital traits like, say, eye or skin color, are also due to inheriting specific variants in at least relatively few genes.

Such findings at least indirectly fueled the fervor for mapping every trait one can define, with grand promises of discovering the genes 'for' the trait.  Conscientious investigators justified expensive mapping efforts by showing that their trait of interest had substantial heritability, for example, because trait-values were to a substantial extent correlated among close relatives in predicted patterns.  However, for most traits like diabetes, cancer, heart disease, or behavioral characteristics, such findings are few and far between.

Despite a welter of PR spin to the contrary, instead of dramatic findings of the expected (and promised) sort, what was found was that the traits were affected by variation in tens, hundreds, or even thousands of different parts of the genome. Even taking all these together, they typically only accounted for a fraction--usually a small fraction--of the estimated heritability.

What is this 'missing heritability'?

Evolutionary survivorship bias
A central theoretical idea is that a fundamental genomic criterion for showing biological function is sequence conservation.  Most evolution is purifying: what has been put together over billions of years is risky to change.  So most mutations in clearly functional areas of DNA are either neutral or deleterious.  As a result, more variation accumulates in non- or weakly functional DNA than in important genes.

What that means is that the variation we see misses what existed heretofore and hence is not a representative sample of all the variation that arises.  The idea can be that most variation in genomes is of major importance.  As a result, the tendency is to assume that non-conserved areas of the genome are non-functional.  This may be true, but it may be that our belief that conservation equals function is a corollary of a belief in strong Darwinian natural selection in molding traits.  In fact, most genomic variation is not of the highly conserved sort, but our analysis and explanation of functional genomics is biased by our predilection for ignoring less-conserved variation.

This can be seen as a kind of survivorship bias in that we assume that variation in non-conserved genome areas just doesn't survive for very long--isn't conserved because it has no function.  That's a kind of circular reasoning and has been, for example, highly contentious in the interpretation of the ENCODE project's objective to identify all causal elements in the genome, and in questions about whether selectively neutral variation exists at all. The same conceptual bias leads to reconstructions of evolutionary adaptive history that centers on the conserved genes as if they were the genes that were involved.  Finally, important genes that were involved in a trait's evolution to its current state may no longer be involved, and hence not be considered because their role did not survive to be identified today.

Biomedical survivorship bias
The same sort of bias in ascertaining the spectrum of causal variation exists on the shorter life-time scale of biomedical genetics.   There is a big discrepancy between the clearly key role of genes identified in experimental and developmental genetics, and in the deeply conserved nature of those genes, and the general lack of 'hits' in those genes when genomewide mapping is done on traits those genes affect.

How can a gene be central to the development of the basis of a trait, and yet not be found in mapping to identify variation that causes failures of the trait?  Indeed, the basic finding of GWAS and most other mapping approaches is that the tens or hundreds or thousands of genome 'hits' have individually trivial effects.

The answer may lie in survivorship bias.  Like the lethality of bullets to the engine of a fighter, most variation in the main genes, those whose sequence is more highly conserved, is lethal to the embryo or manifest in pathology so clear that it never is the subject of case-control or other sorts of Big Data mapping.  In other words, genome mapping may systematically be inevitably constrained to find small effects!  That's exactly the opposite of what's been promised, and the reason is that the promises were, psychologically or strategically, based on extrapolation of the findings of strong, single-gene effects causing severe pediatric disease--a legacy of Mendel's carefully chosen two-state traits.

To the extent this is a correct understanding, then genomewide mapping as it's now being done is, from an evolutionary genomic perspective, necessarily rainbow-chasing.  Indeed, a possibility is that most adaptive evolution is itself also due to the effects of minor variants, not major ones.  Once the constraining interaction of the major genetic factors is in place, mostly what can nudge organisms in this direction or that, whether adaptively or in relation to complex, non-congenital disease, is based on assembled effects of individually very minor variants.  In turn, that could be why slow, gradualism was so obviously the way evolution worked to Darwin, and why it generally still seems that way today.

Survivorship bias is a kind of mis-understanding of statistics and sampling that careful reasoning can illuminate.  It is so easy to collect biased samples, and so hard to do otherwise, and consequently so easy to make convenient, but erroneous inferences.  Science is a complex business and it's an unending challenge to do it right--even to know when we are doing it right!

Monday, January 16, 2012

Changing the goal posts: heritability lost and found

We interrupt our series on Probability and its meaning, for a post that we've been asked to write, related to a new paper on gene hunting that got some press last week, and will be stirring up controversy (and naturally, we can't resist including our usual editorializing, for better or worse):

Making complexity simple (again)?
Every age and every profession has its PT Barnums.  They're the slick-talking, fast-moving guys who will say anything to draw customers into the show they're running.  Truth, if it even exists other than ambiguously, is secondary in many ways to closing the deal.

One tactic that the pitchmen use in fields like politics is to change definitions so that problems never get 'solved' (that is, there is always something to keep you in office or to keep levying taxes, or to keep you afraid of some enemy or other).  This is a way of making the same facts serve new interests.

Science is rife with these kinds of self-interested maneuvers.  Changing the goalposts in genetics means redefining the objectives or the criteria for pushing ahead even in the face of contrary evidence.  This is a system we've built, step by step, once government funding largesse started flowing some time during the Cold War.  And today, in genetics, which plays on fear of death just as much as preachers do, we have to keep passing the plate.  Yet science is supposed to be objective and 'evidence based'.  So we have to change the goalposts to keep the game from ending.  In this case, there is a recent paper, with promise of more to come, by Eric Lander and colleagues (Zuk et al.).  Because of his prominence, skill, and salesmanship this will of course get a lot of attention.

The paper discusses at least one reason why GWAS have not been very good at accounting for heritability (something we ourselves have commented on in many posts).  Some, who are critical, say that this paper finally shows that GWAS and related big-scale approaches are proliferating even though they have themselves shown that they've reached diminishing returns.  Here's an example.  Of course there's the resistance that says no, Bravo! to the new paper, which shows that we're just getting started!

Naturally, everyone shares an interest in saying that to show their favorite view we need mega-GWAS, Biobank, gobs of wholegenome sequence, or other similar open-ended approaches, and whether this will lead to the ultimate small scale objective of personalized 'genomic' medicine.  Far too many vested interests are at play and, indeed, training in 'grantsmanship' and the whole research culture is about manipulating the system to get, keep, and increase funding. 

Zuk et al. ask why GWAS have failed to account for the heritability of so many traits, that is, to explain the correlation in risk among family members that reflects genetic effects.  The subject is complex, but here we can just say that if each gene adds a dose of effect to a trait, like nibbles on a given side of the caterpillar's mushroom added to Alice's stature, then even if the effect of nibbles is very small, if we sample enough mushrooms we can identify all of the effects.  Then, knowing them, we can tell which nibbles a given person has made, and hence predict his stature:  Voila!  personalized medicine!

Yet it hasn't worked that way.  Most heritability remains unaccounted for despite already very large studies.  Hence the demand for ever larger studies.  A glib commentary in Nature a few years ago coined the term 'hidden heritability' and made the search to find it akin to Sherlock Holmes' search for Moriarty.  That was a fantastic, if anti-science, marketing ploy on Nature's part, since it fed an ever-increasing demand for funds for genomic scavenger hunting....and that's good for the science journal business!

But the search for this hidden treasures has been frustrating, and Zuk et al. claim they now know that the search is in vain, and they provide a very sophisticated mathematical account of the reason why.  It is due to gene interactions, or 'epistasis'.  That means that a large part of the correlation among relatives can't be found by looking only for additive effects.  Here's roughly a basic underlying concept:

If trait T, such as stature or insulin levels (or their disease-risk consequences) is due to the effects of factor A plus those of factor B, then we can write

T = A + B

If your genotype includes a variant A-gene that gives you an additional level of factor A, then for you

T = A + A + B,   or 2A + B.

But if the factors interact, say in a multiplicative way, then

T = AxB

and if that's what's going on, and your A-gene genotype adds a second dose of A, your trait is

T = (2A) x B =  2AB

So, let's say a normal person has A=3 trait units and B=4 trait units.  In the additive case that person's trait would be A+B=7.  And if you have a mutation that doubles your A-dose, then your genotype makes you 2A+B=10 for your trait value.  But in the epistasis case, your trait would actually be 24.  So we expect you to have trait value 10, and conclude that more than half your trait value is unexplained.

Functional interaction
Nobody disputes a central fact in the Zuk et al. argument.  Life works by interactions among genes.  These are functional interactions in systems called 'networks' and other sorts of molecular interactions.  One molecule doesn't do anything by itself, nor does one gene as a rule.  Here, each gene-related component is subject to variation by mutation, and that will be inherited (its effects contributing to heritability).  So it is obvious from a cell biology point of view that single-factor explanations are not going to tell the whole story. But the fact of multiple factors doesn't tell the story we're interested in predicting states among variable traits.  That involves a different kind of interaction.

Quantitative interaction
If the true-fact is that biological traits are the result of functional interactions, what epidemiological risk estimation is about is not a list of pathways but the effects of variation in the pathways. When factors interact in a non-additive way, their net result is estimated from sample data using statistical techniques.  The A B example above showed conceptually how they work.  The additive contributions of variants in factor A are estimated from samples that compare those with and those without the variant in question.  You can estimate each factor's effects independently in this way and add up the estimates.

But if they interact, then in essence you have to have an additional estimate to make, of the average trait value in groups of people with each combination of the variants at the interacting factors.

Not only does this require more data, it won't show up in GWAS types of case-control or similar data.  You need to look in other ways.

Zuk et al. address this.  They build their idea that much of the observed heritability is estimated on a purely additive model, and yet at least some factors may be what in standard biochemical terms is known as 'rate limiting' effects.  At some level of concentration of such factors, they or what they interact with no longer works the same way if it works at all.  The authors outline a model which, under various assumptions about how many steps are rate-limiting such that variants in those steps define measures of heritability, might begin to explain familial correlations not accounted for by current additive effects.

It is already nigh impossible to get stable estimates of hundreds of additive effects, mostly very small (see our current series of posts on Probability does not exist!).  It's one thing to estimate the effects of, say, 100 additively contributing genes. Variants will be many in each, variable in their frequencies among samples.  If purely additive, the context doesn't matter: in each population you can estimate each factor's effect and add 'em up to get a given person's risk.  But if context matters, that is, if the effect of one factor depends on the specific variants in the rest of the genome (forgetting environmental effects!) then it's quite another to estimate those interactions.  Roughly, if 100 genes interact in pairwise fashion (2-way interactions like AxB only), that means 10,000 interaction effects to estimate. Zuk et al. certainly acknowledge this problem clearly.  But the authors suggest kinds of data on relatives of various degrees that might be practicably collected and could reveal discrepancies from expectations under additive models, and account for more or all of the heritability.

Zuk et al. promise that if we study 'isolated' populations we'll have a shot at the answer!  This is not new, and indeed studies in Finland and Iceland led a previous charge for similar reasons.  Perhaps it is a good idea....and the authors provide tests that could be done if there were adequate data collected.  It will be done and will expend more millions of research dollars in the process.  But, if complexity is real, the most we'll get is a few hits and a few weak statistical signals.  But we're in that place already, so in that sense this is another way of changing the goalposts, because we did not reap bonanza from those studies.

The answer
Like most such papers, Zuk et al. make gobs of assumptions about the model, the data, and the underlying basis of dichotomous disease (present or absent).  Facing such complex problems it's hard or impossible not to make simplifying or exemplifying assumptions.  As usual, if one probes these assumptions, there will be additional sources of variation and uncertainty that are being ignored or that must be estimated from data, so that even the rosy answers suggested by the authors, which are far from promising a complete understanding, will be overstated (and that is the clear-cut history of the senior author, and most of his peers in the profession, including Francis Collins).

The actual answer is well known, even if it's resisted because it's not very convenient: life is a mixture, or spectrum, of all of these kinds of components.  We know that additive models work very well in many domains, such as agricultural and experimental breeding (artificial selection), and that as GWAS sample sizes have increased steadily they have steadily been identifying more contributing genes.   That is, we have not plateaued to a point that bigger samples will not add more genes, the remaining recalcitrant effect due to interaction.  We think even the authors acknowledge that the interactions may comprise sets of individually weak contributors.

Further, because most variants in our population are in fact, and indisputably, rare to very rare, they form a large aggregate of potentially contributing genes that will vary from sample to sample and population to population.  This is, really, a fact of life.

Networks, the  sugar plum fairies of promised genomic medical miracles, involve tens of genes, many with multi-way interactions, like AxBxC. And what of higher-order interactions such as the square of one or more factor levels?  One might as well count stars in heaven as attempt to collect accurate data from the entire human species in order to get enough data to estimate these effects.   And that, of course, is foolish for many reasons, not least being that environmental factors change and vary and heritability is a measure of genetic relative to environmental effects.

Zuk et al. are promising a series of papers along this track, and have coined a new name for their idea (a standard marketing ploy). We'll be hearing a lot of me-tooism, users avidly diving into the LP (limiting pathway) model.  That certainly doesn't make the model wrong, but it does affect where the goalposts are. We're hopefully not hypocrites: we have our own simulation program, called ForSim (freely available to others) by which some of these things could be simulated, and we may do that.

Nobody can seriously question that context is very important, and that includes various kinds of interactions.  But the issue is not just what the mix of types of contribution are, how stable, how variable among samples, and so on.  Even if Zuk et al. are materially correct, it doesn't erase the problematic nature of trying to estimate, much less generally to do much about, the joint effects of the many genes, in their tens or hundreds, whose variation contributes to a trait of interest.

"Discovery efforts should continue vigorously"
We ourselves are far from qualified to find technical fault with the new model, if there is any.  We doubt there is.  But the point is not that this new paper is flim-flam, even if it simplifies and makes many assumptions that could be viewed as perhaps-necessary legerdemain, given the situation's complexity.  Or, perhaps more clearly, red-herrings to distract attention from the real point, which is  whether changing the goalposts in this kind of way changes the game.

One can ask--should ask--whether regardless of interaction or additivity, it's worth trying to document them, or whether Francis Collins' insistence on luxury medicine (personalized prediction of weak effects with lifelong treatment mainly available to paying customers) is a realizable goal.  The same funds could, after all, be spent in other ways on indisputably stronger and simpler genetic problems, that could be far more directly and sooner relevant to the society that's paying the bill.

Arguing either/or (additive or non-additive) and attempting to relate that to the desirability of keeping the Big Study funds flowing is a carnival barker activity.  The authors at one point subtly make the anti-Collinsian but obvious point that personalized gene-based prediction is generally going to be a bust.  But then they argue let's plow ahead to discover pathways. Let's have our goalposts everywhere at once!  There are other, better, logistically easier and probably less costly ways to find networks and if there aren't already, there ought to be research invested in figuring them out (e.g., in cell cultures).  Epidemiological studies are expensive and of low payoff in this kind of context.

As the authors clearly say: discovery efforts should continue vigorously" despite their points, acknowledging in fact and cleverly hedging all bets, that many variants remain to be discovered by current approaches.  This is a fair-grounds where PT Barnums thrive.  It keeps their particular circus's seats filled.  But that doesn't make it the best science in the public interest.