Tuesday, May 28, 2013

Who, me? I don't believe in single-gene causation! (or do I?). Part IV. Do we need the probabilistic hypothesis?

In the earlier posts in this series on the nature of genetic causation we showed that, while people would routinely say that they don't 'believe' in genetic determinism or single-gene causation, there are clear instances (that everyone recognizes) in which specific genetic variants do seem to have essentially deterministic effects relative to some outcome -- cystic fibrosis is one example, but there are many others.  The data aren't perfect -- there is always measurement noise and other errors, and prediction is rarely perfect -- but in this situation if an individual has a specific single-locus genotype s/he is essentially destined to have the trait it codes for.  Going back to Mendel's famous peas, there is a wealth of animal and plant data for such essentially perfect predictive genotypes.

Yet even there, there are subtleties, as we discussed in earlier installments.  Much of the time, the identified genotype at the 'causal' gene does not predict the outcome with anything resembling certainty.  There is usually evidence for factors that explain this other than lab errors, including variants in other genes, environmental exposures, and so on.

Since most traits seem to be complex far beyond these single-gene subtleties, and are affected by many genes (not to mention environmental exposures), genetics has moved to multilocus causal approaches, of which the most autopilot-like approaches are the omics ones like GWAS (and expression profiling, mirobiomics and many others).  These approaches rest on exhaustively enumerative technology and statistical (inherently and explicitly probabilistic) analysis to find associations that might, upon experimental follow-up, turn out to have a mechanistic causal basis (otherwise, the idea is that the association is a statistical fluke or reflects some unmeasured but correlated variable).  In this context, everyone sneers at the very idea of genetic determinism.  But might that attitude be premature, even for complex traits?

Last week we tried to explain how probabilistic causation comes into the picture. Essentially, each tested variable--each spot along the genome--is screened for variants that are more common in cases than in controls (or in individuals with larger rather than smaller values of some quantitatively measured trait; taller, higher blood pressure, etc.).  But the association is rarely complete: there are cases who don't have the test variant, and unaffected controls who do.  That is a strange kind of 'causation' unless, like quantum mechanics, we invoke fundamental probabilism--but what would be the evidence for that and what does it mean?

The probability hypothesis in genetic causation
As the astronomer Pierre Laplace is famously quoted as saying to Napoleon when asked why his astronomical theory didn't invoke God's causal hand, "Sire, I had no need of that hypothesis." Likewise, we should ask whether, how, or where we have need for the probability hypothesis in genetic causation.  Is probabilism actually the best explanation for what we observe?  If so, how do we find it?

The conclusion is that a genetic variant that passes a statistical significance tests (a deeply problematic notion in itself) has some truly causative effect but one that is in some way probabilistic rather than deterministic.  We assign a 'risk' of the outcome to the presence of the risk factor.  Sample survey methods, which is what the studies are, essentially assume probabilistic outcomes, so of course the results look that way.  But how can a risk factor be probabilistic? We know of some essentially random processes, like mutation, but how do genetic alleles act probabilistically?  One issue is that statistical analysis is fundamentally about repeatable observations....but what does that mean?

In our previous installment, we likened views that seem to be held about multi-genic causation that result from such studies to a conceptual extension of single-gene causal views, but now transformed into single-genometype causation.  But there are serious problems because genomewide, every individual we sample (or who has ever lived, or ever will live) is genomically unique!  And each person is either a case or a control, or has a specific blood pressure level, or whatever.  In that sense, in our actual data, we do not have probabilistic outcomes!  We have one, and only one, specific outcome for each individual.  Yet we don't want to confess to believing in genetic (or genomic) determinism, so we look at one site at a time and assign probabilistic risks to its variants, as if they were acting alone.

Last time we tried to explain serious problems we see with treating each site in the genome (as in, say, GWAS analysis) separately and then somehow trying to sum up the site-specific effects, each treated probabilistically, to get some personalized risk estimate.  We routinely blithely overlook the fundamental issues of the important, usually predominant, effects of environmental factors (most of which are unknown or not accurately estimable).  But, forgetting that huge issue for the moment, surprisingly, maybe deterministic notions of causation are not so far off the mark, after all.

The case for determinism 
The problem is that since in survey sample (epidemiological) research, everyone's genotype is unique we're forced to use probability sampling methods that decompose the genome into separate units, treating it as a sack of loose variants that we can sort through and whose effects we can measure individually.  But to answer the basic questions about genomic epistemology, we do not need to rely so fundamentally on such approaches. Unlike epidemiology, in which each individual is genomically unique, we actually, even routinely do have observation on essentially unrestricted numbers of replications of essentially the exact same genometype!  And that evidence shows a very high level of genetic predictive power, at least relative to given environments.  We have thousands upon thousands of inbred plants and animals, and they have a very interesting tale to tell.

A week or so ago we posted about the surprising level of trait variation due to environmental variation among inbred animals.  But in a standardized environment most inbred strains have characteristic traits, that are manifest uniformly or at least in far higher than few-percent frequency.  This is true of simple inbred strains, or of strains based on selection by investigators for some trait, followed by inbreeding.  (We also have some suggestive but much less informative data on human twins, but for which we only have two observations of each pair's unique genotype, and these are confounded by notoriously problematic environmental issues).

Inbred strains provide close to true replications of genotype-phenotype observations.  There is always going to be some variation, but the fact that inbred strains can be very well characterized by their traits (in a standardized  environment) reveals what is essentially a very high level of genetic--or genometypic--determinism!  So, is the sneering at genetic determinism misplaced?

Informative exceptions
A trait that, even in a standard environment in inbred animals, only appears in a fraction of the animals is particularly interesting in this light.  Such a trait might perhaps aptly be described in probabilistic terms.   We have repeated observations, but only some positive outcomes.  If the environment is really being held constant, then such traits provide good reason to try to investigate what it is that adds the probabilistic variation.  Here, in fact, we actually have a focused research question, to which we might actually get epistemically credible answers, unlike the often total lack of such controllable focus in omic-scale analysis. 

So, when the Jackson Labs description of the strain says the mice 'tend to become hypertensive' or  'tend to lose hearing by 1 year of age', that seems hardly environmental and we can ask what it might mean. Where might such outcome variation come from?  Somatic mutation and the continued influx of germline mutation over generations of the 'same' strain in the lab may account for at least some of it.  Unfortunately, to test directly for somatic mutational effects, one would have to clone many cells from each animal, and that would introduce other sources of variation. But there's a way around this, at least in part:  we can compare the fractions of the variable outcome found in my lab's C57 mice to the fraction of that outcome in the C57s in your lab; this gives you observations on the inbred germline several generations apart, during which some mutations may have accumulated.  If the percentages are similar, they may represent proper and truly (stochastic) probabilities, and we could try to see why and where those effects arise.  If they differ, and your environments really are similar, then mutation would be a suspect.   Such effects could be evaluated by simulation and some direct experimental testing.  It may not be easy, but at least we can isolate the phenomenon, which is not the case in natural populations. 

Not an easy rescue from GWAS and similar approaches!
So, Aha!, you say, inbreeding shows us that genotypes do predict traits, and now all we have to do is use these mice to dissect the causal genes.  Won't that in a sense rescue GWAS approaches experimentally?  Unfortunately, not so!

Interestingly, and wholly consistent with what we've been saying is that if you intercross just two inbred strains, each with its own unvarying alleles so that at most all you have is two different alleles at any spot in the genome (in those sites where the two strains differ), you immediately regenerate life-like complexity:  each animal differs, the intercross population has a typical distribution of trait values (e.g., a normal distribution).  And if you want to identify the genotypic reasons for each animal's trait, you no longer have clonal replicates but must resort back to statistical mapping methods involving the same kinds of false replicability assumptions, and you get the same kinds of complex probabilistic effects, that we see in genomewide association studies in natural populations.  You don't find a neat set of simple additive effects that account for things.

In a sense, despite the simple and replicable genometypic determinism of the parental strains, each intercross offspring has its own unique genometype and trait. This means that the integrated complexity is built into the genometype and not as a rule dissectable into simple causative sites (though as in humans, the occasional strong effect may be found).

Yet, if you took and inbred (repeatedly cloned) each animal in this now-variable intercross population, you would recapture, for each new lineage, its specific fixed and highly predictable phenotype!  And among these animals essentially inbred from the varying intercross animals, with their individually unique genometypes, you would find the 'normal' distribution of trait values.  Each animal's genometype would be highly predictive and determinitive, but among the set you would have the typical distribution of trait values in the intercross population.

That is, a natural population can be viewed as a collection of highly deterministic genometypes with trait effects distributed as a function of the population's reproductive history.

This strong evidence for genometypic determinism might seem to rekindle dreams of personalized genometype prediction, and rescue omics approaches, but it doesn't do that at all.  That's because in itself even a 2-way intercross mapping does not yield simple (or even much simplified) genomic answers for the individuals'  specific trait variants.

Another kind of data is revealing in this light -- transgenic animals: the same transgene typically has different effects in different inbred host strains (e.g., C57, DBA, FVB, ... mice), with the strain-specific effect largely fixed within each strain.  The obvious reason is unique sets of variants in each host strain's background genometype.

Is it time to think differently and stop riding the same horse?  There's no reason here to think anything will make it easy, but there are perhaps nuggets that can be seized and lead to informative approaches and new conceptual approaches. Perhaps we simply have to force ourselves to stop thinking in terms of a genome as a sackful of statistical units, but instead as an integrated whole.

At least, inbreeding reveals both the highly predictive, and hence plausibly determinative, power of genometypes, and provides enormous opportunity for clever research designs to explore genomic epistemology without having to rely on what are the basically counter-factual assumptions of statistical survey analysis.  We can try to work towards an improved actual theory of genetic causation.

But we can't dream of miracles.  Environments could vary, even in the lab as they do in life, and in this sense all bets are off, a topic about which we've posted before.  There seems zero hope of exhaustive DNA enumeration and enumeration of all non-DNA factors, especially in human populations.

At least, things to think about
We hope we are not just luxuriating in philosophical daydreaming here.  There is a lot at stake in the way we think about causation in nature, both intellectually and practically in terms of public resources, public health, disease prediction, and so on. It's incumbent upon the field to wrestle with these questions seriously and not dismiss them because they are inconvenient to think about.

In some ways, the issues are similar to those involved in quantum physics and statistical mechanics, where probabilities are used and their nature often assumed and not questioned, the net result being what is needed rather than an enumeration of the state of each specific factor (e.g., pressure, in the case of the ideal gas law).  The human equivalent, if there is one, might be not to worry about whether causation is ultimately probabilistic or deterministic, nor to have to estimate each element's probabilistic nature specifically, but to deal with the population--with public health--rather than personalized predictions.  But this goes against our sense of individuality and the research sales pitch of personalized genomic medicine.

At least, people should be aware of the issues when they make the kinds of rather hyperbolic claims to the public about the miracles genomics is claiming to deliver, based on statistical survey epistemology.  Genomes may be highly predictive and determinitive, at least in specific environments, but the lack of repeatable observations in natural populations raises serious questions about the meaning or usefulness of assuming genetic determinism.

No comments: