Thursday, October 13, 2016

Genomic causation....or not

By Ken Weiss and Anne Buchanan

The Big Story in the latest Nature ("A radical revision of human genetics: Why many ‘deadly’ gene mutations are turning out to be harmless," by Erika Check Hayden) is that genes thought to be clearly causal of important diseases aren't always (the link is to the People magazine-like cover article in that issue.)  This is a follow-up on an August Nature paper describing the database from which the results discussed in this week's Nature are drawn.  The apparent mismatch between a gene variant and a trait can be, according to the paper, the result of technical error, a mis-call by a given piece of software, or due to the assumption that the identification of a given mutation in affected but not healthy individuals means the causal mutation has been found, without experimentally confirming the finding--which itself can be tricky for reasons we'll discuss.  Insufficient documentation of 'normal' sequence variation has meant that the frequency of so-called causal mutations hasn't been available for comparative purposes.  Again we'll mention below what 'insufficient' might mean, if anything.

People in general and researchers in particular need to be more than dismissively aware of these issues, but the conclusion that we still need to focus on single genes as causal of most disease, that is, do MuchMoreOfTheSame, which is an implication of the discussion, is not so obviously justified.   We'll begin with our usual contrarian statement that the idea here is being overhyped as if it were new, but we know that except for its details it clearly is not, for reasons we'll also explain.  That is important because presenting it as a major finding, and still focusing on single genes as being truly causal vs mistakenly identified, ignores what we think the deeper message needs to be.

The data come from a mega-project known as ExAC, a consortium of researchers sharing DNA sequences to document genetic variation and further understand disease causation, and now including data from approximately 60,000 individuals (in itself, rather small compared to the need for purpose). The data are primarily exome sequences, that is, from protein-coding regions of the human genome, not from whole genome sequences, again a major issue.  We have no reason at all to critique the original paper itself, which is large, sophisticated, and carefully analyzed as far as we can tell; but the excess claims about its novelty are we think very much hyperbolized, and that needs to be explained.

Some of the obvious complicating issues
We know that a gene generally does not act alone.  DNA in itself is basically inert.  We've been and continue to be misled by examples of gene causation in which context and interactions don't really matter much, but that leads us still to cling to these as though they are the rule.  This reinforces the yearning for causal simplicity and tractability.  Essentially even this ExAC story, or its public announcements, doesn't properly acknowledge causal context and complexity because it is critiquing some simplistic single-gene inferences, and assuming that the problems are methodological rather than conceptual.

There are many aspects of causal context that complicate the picture, that are not new and we're not making them up, but which the Bigger-than-Ever Data pleas don't address:
1.  Current data are from blood-samples and that may not reflect the true constitutive genome because of early somatic mutation, and this will vary among study subjects,
2.  Life-long exposure to local somatic mutation is not considered nor measured, 
3.  Epigenetic changes, especially local tissue-specific ones, are not included, 
4.  Environmental factors are not considered, and indeed would be hard to consider,
5.  Non-Europeans, and even many Europeans are barely included, if at all, though this is  beginning to be addressed, 
6.  Regulatory variation, which GWAS has convincingly shown is much more important to most traits than coding variation, is not included. Exome data have been treated naively by many investigators as if that is what is important, and exome-only data have been used a major excuse for Great Big Grants that can't find what we know is probably far more important, 
7.  Non-coding regions, non-regulatory RNA regions are not included in exome-only data,
8.  A mutation may be causal in one context but not in others, in one family or population and not others, rendering the determination that it's a false discovery difficult,
9.  Single gene analysis is still the basis of the new 'revelations', that is, the idea being hinted at that the 'causal' gene isn't really causal....but one implicit notion is that it was misidentified, which is perhaps sometimes true but probably not always so,
 10.  The new reports are presented in the news, at least, as if the gene is being exonerated of its putative ill effects.  But that may not be the case, because if the regulatory regions near the mutated gene have no or little activity, the 'bad' gene may simply not be being expressed.  Its coding sequence could falsely be assumed to be harmless, 
11. Many aspects of this kind of work are dependent on statistical assumptions and subjective cutoff values, a problem recently being openly recognized, 
12.  Bigger studies introduce all sorts of statistical 'noise', which can make something appear causal or can weaken its actual apparent cause.  Phenotypes can be measured in many ways, but we know very well that this can be changeable and subjective (and phenotypes are not very detailed in the initial ExAC database), 
13.  Early reports of strong genetic findings have well known upward bias in effect size, the finder's curse that later work fails to confirm.

Well, yes, we're always critical, but this new finding isn't really a surprise
To some readers we are too often critical, and at least some of us have to confess to a contrarian nature.  But here is why we say that these new findings, like so many that are by the grocery checkout in Nature, Science, and People magazines, while seemingly quite true, should not be treated as a surprise or a threat to what we've already known--nor a justification of just doing more, or much more of the same.

Gregor Mendel studied fully penetrant (deterministic) causation.  That is what we now know to be 'genes', in which the presence of the causal allele (in 2-allele systems) always caused the trait (green vs yellow peas, etc.; the same is true of recessive as dominant traits, given the appropriate genotype). But this is generally wrong, save at best for the exceptions such as those that Mendel himself knowingly and carefully chose to study.  But even this was not so clear!  Mendel has been accused of 'cheating' by ignoring inconsistent results. This may have been data fudging, but it is at least as likely to have been reacting to what we have known for a century as 'incomplete penetrance'.  (Ken wrote on this a number of years ago in one of his Evolutionary Anthropology columns.)  For whatever reason--and see below--the presence of a 'dominant' gene or  'recessive' homozyosity at a 'causal' gene doesn't always lead to the trait.

In most of the 20th century the probabilistic nature of real-world as opposed to textbook Mendelism has been completely known and accepted.  The reasons for incomplete penetrance were not known and indeed we had no way to know them as a rule.  Various explanations were offered, but the statistical nature of the inferences (estimates of penetrance probability, for example) were common practice and textbook standards.  Even the original authors acknowledge incomplete penetrance, but this essentially shows that what the ExAC consortium is reporting are details but nothing fundamentally new nor surprising.  Clinicians or investigators acting as if a variant were always causal should be blamed for gross oversimplification, and so should hyperbolic news media.

Recent advances such as genomewide association studies (GWAS) in various forms have used stringent statistical criteria to minimize false discovery.  This has led to mapped 'hits' that satisfied those criteria only accounting for a fraction of estimated overall genomic causation.  This was legitimate in that it didn't leave us swamped with hundreds of very weak or very rare false positive genome locations.  But even the acceptable, statistically safest genome sites showed typically small individual effects and risks far below 1.0. They were not 'dominant' in the usual sense.  That means that people with the 'causal' allele don't always, and in fact do not usually, have the trait.  This has been the finding for quantitative traits like stature and qualitative ones like presence of diabetes, heart attack-related events, psychiatric disorders and essentially all traits studied by GWAS. It is not exactly what the ExAC data were looking at, but it is highly relevant and is the relevant basic biological principle.

This does not necessarily mean that the target gene is not important for the disease trait, which seems to be one of the inferences headlined in the news splashes.  This is treated as a striking or even fundamental new finding, but it is nothing of that sort.  Indeed, the genes in question may not be falsely identified, but may very well contribute to risk in some people under some conditions at some age and in some environments.  The ExAC results don't really address this because (for example) to determine when a gene variant is a risk variant one would have to identify all the causes of 'incomplete penetrance' in every sample, but there are multiple explanations for incomplete penetrance, including the list of 1 - 13 above as well as methodological issues such as those pointed out by the ExAC project paper itself.

In addition, there may be 'protective' variants in the other regions of the genome (that is, the trait may need the contribution of many different genome regions), and working that out would typically involve "hyper astronomical" combinations of effects using unachievable, not to mention uninterpretable, sample sizes--from which one would have to estimate risk effects of almost uncountable numbers of sequence variants.  If there were, say, 100 other contributing genes, each with their own variant genotypes including regulatory variants, the number of combinations of backgrounds one would have to sort through to see how they affected the 'falsely' identified gene is effectively uncountable.

Even the most clearly causal genes such as variants of BRCA1 and breast cancer have penetrance far less than 1.0 in recent data (here referring to lifetime risk; risk at earlier ages is very far from 1.0). The risk, though clearly serious, depends on cohort, environmental and other mainly unknown factors.  Nobody doubts the role of BRCA1 but it is not in itself causal.  For example, it appears to be a mutation repair gene, but if no (or not enough) cancer-related mutations arise in the breast cells in a woman carrying a high-risk BRCA1 allele, she will not get breast cancer as a result of that gene's malfunction.

There are many other examples of mapping that identified genes that even if strongly and truly associated with a test trait have very far from complete penetrance.  A mutation in HFE and hemochromatosis comes to mind: in studies of some Europeans, a particular mutation seemed always to be present, but if the gene itself were tested in a general data base, rather than just in affected people, it had little or no causal effect.  This seems to be the sort of thing the ExAC report is finding.

The generic reason is again that genes, essentially all genes, work only in their context. That context includes 'environment', which refers to all the other genes and cells in the body and the external or 'lifestyle' factors, and also age and sex as well.  There is no obvious way to identify, evaluate or measure the effects of all possibly relevant lifestyle effects, and since these change, retrospective evaluation has unknown bearing on future risk (the same can be said of genomic variants for the same reason).  How could these even be sampled adequately?

Likewise, volumes of long-existing experimental and highly focused results tell the same tale. Transgenic mice, for example, in which the same mutation is introduced into their 'same' gene as in humans, very often show little or no, or only strain-specific effects.  This is true in other experimental organisms. The lesson, and it's by far not a new or very recent one, is that genomic context is vitally important, that is, it is person-specific genomic backgrounds of a target gene that affect the latter's effect strength--and vice versa: that is, the same is true for each of these other genes. That is why to such an extent we have long noted the legerdemain being foist on the research and public communities by the advocates of Big Data statistical testing.  Certainly methodological errors are also a problem, as the Nature piece describes, but they aren't the only problem.

So if someone reports some cases of a trait that seem too often to involve a given gene, such as the Nature piece seems generally to be about, but searches of unaffected people also occasionally find the same mutations in such genes (especially when only exomes are considered), then we are told that this is a surprise.  It is, to be sure, important to know, but it is just as important to know that essentially the same information has long been available to us in many forms.  It is not a surprise--even if it doesn't tell us where to go in search of genetic, much less genomic, causation.

Sorry, though it's important knowledge, it's not 'radical' nor dependent on these data!
The idea being suggested is that (surprise, surprise!) we need much more data to make this point or to find these surprisingly harmless mutations.  That is simply a misleading assertion, or attempted justification, though it has become the intentional industry standard closing argument.

It is of course very possible that we're missing some aspects of the studies and interpretations that are being touted, but we don't think that changes the basic points being made here.  They're consistent with the new findings but show that for many very good reasons this is what we knew was generally the case, that 'Mendelian' traits were the exception that led to a century of genetic discovery but only because it focused attention on what was then doable (while, not widely recognized by human geneticists, in parallel, agricultural genetics of polygenic traits showed what was more typical).

But now, if things are being recognized as being contextual much more deeply than in Francis' Collins money-strategy-based Big Data dreams, or 'precision' promises, and our inferential (statistical) criteria are properly under siege, we'll repeat our oft-stated mantra: deeply different, reformed understanding is needed, and a turn to research investment focused on basic science rather than exhaustive surveys, and on those many traits whose causal basis really is strong enough that it doesn't really require this deeper knowledge.  In a sense, if you need massive data to find an effect, then that effect is usually very rare and/or very weak.

And by the way, the same must be true for normal traits, like stature, intelligence, and so on, for which we're besieged with genome-mapping assertions, and this must also apply to ideas about gene-specific responses to natural selection in evolution.  Responses to environment (diet etc.) manifestly have the same problem.  It is not just a strange finding of exome mapping studies for disease. Likewise, 'normal' study subjects now being asked for in huge numbers may get the target trait later on in their lives, except for traits basically present early in life.  One can't doubt that misattributing the cause of such traits is an important problem, but we need to think of better solutions that Big Big Data, because not confirming a gene doesn't help, or finding that 'the' gene is only 'the' gene in some genomic or environmental backgrounds is the proverbial and historically frustrating needle in the haystack search.  So the story's advocated huge samples of 'normals' (random individuals) cannot really address the causal issue definitively (except to show what we know, that there's a big problem to be solved).  Selected family data may--may--help identify a gene that really is causal, but even they have some of the same sorts of problems.  And may apply only to that family.

The ExAC study is focused on severe diseases, which is somewhat like Mendel's selective approach, because it is quite obvious that complex diseases are complex.  It is plausible that severe, especially early onset diseases are genetically tractable, but it is not obvious that ever more data will answer the challenge.  And, ironically, the ExAC study has removed just such diseases from their consideration! So they're intentionally showing what is well known, that we're in needle in haystacks territory, even when someone has reported big needles.

Finally, we have to add that these points have been made by various authors for many years, often based on principles that did not require mega-studies to show.  Put another way, we had reason to expect what we're seeing, and years of studies supported that expectation.  This doesn't even consider the deep problems about statistical inference that are being widely noted and the deeply entrenched nature of that approach's conceptual and even material invested interests (see this week's Aeon essay, e.g.).  It's time to change, but doing so would involve deeply revising how resources are used--of course one of our common themes here on the MT--and that is a matter almost entirely of political economy, not science.  That is, it's as much about feeding the science industry as it is about medicine and public health.  And that is why it's mainly about business as usual rather than real reform.

1 comment:

Ken Weiss said...

One thing we did not mention is that in many exome-based study designs, the association with an exomic allele (coding variant) could be due to linkage disequilibrium so that the effect is really due to a variant in nearby flanking DNA (which could be regulatory, or some other nearby gene, or something on the other DNA strand, and so on. So even if an exome data base incriminates an exon, it could really be reflecting something else, that may be unrelated to the purported causal gene. This is why exome sequencing can be seen as a misleading or careless understanding of the genome or simply an expedient tactic to keep the sequencers humming (and one knows from personal conversations that some investigators are quite aware of that, but knew among other things that whole genome sequencing is too costly and analytically problematic, especially in years when exome-only data was being generated, and that whole-genome sequence data can be impenetrably complex in its variation, even forgetting sequencing errors).