The Mermaid's Tale: Who, me? I don't believe in single-gene causation! (or do I?). Part II. Probabilistic causation--what do we mean?

Yesterday we discussed notions of determinism and complex causation, and the denial that every self-respecting scientists tries to make that s/he doesn't believe in deterministic causation. They don't 'believe' it because, in fact, most of the time and with some possible exceptions in the physical sciences, determinism is at best inaccurate and to an unknown extent. Knowing a putatively causal factor doesn't lead to anything close to perfectly accurate prediction (in biology). Yet of course we believe this is a causal rather than mystic world, and naturally want to identify 'causes', especially of things we wish to avoid so that we can avoid them. Being only human, accepting often rather unquestioningly the methods of science, and for various other reasons, we tend to promise miracles to those who will pay us to do the research to find them.

It is annoying to some readers of MT that we're constantly harping on the problem of determinism (and exhaustive rather than hypothesis or theory-driven approaches to inferring to causation). Partly this annoyance is to be expected, because our criticisms do challenge vested interests. And partly because if people had better ideas (even if one granted that our points are cogent), they might be trying them (if they thought they were fundable in our very competitive funding system).

But largely, as a scientific community, we don't know seem to know what better to do. But the issues about causation in this context seem very deep and we think not enough people are thinking seriously enough about them, for various pragmatic and other reasons. They aren't easy to understand. If we're right, then the critical test or experiment, or truly new concept, just doesn't seem to be 'in the air' the way, say, evolution was in Darwin's time, or relativity in Einstein's.

Yesterday we described various issues having to do with understanding 'single-gene' diseases like Cystic Fibrosis or familial breast cancer, or smoking-induced lung cancer. Our point was that causation isn't as simple as 'the' gene for disease, or 'the' environmental risk factor. Indeed, instead of deterministic billiard-ball like classic cause-effect views, we largely take a sample-based estimation approach, and what we generally find is that (1) multiple factors--genes or environmental--are associated with different proportions of test outcomes in those exposed compared to those not exposed to a given factor, (2) few factors by themselves account for the outcome in every exposed person, (3) multiple different factors are found to be associated with a given outcome, and yet (4) the known factors do not account for all the instances of that outcome.

In the case of genetics, our particular main interest, there are reasons to think that inherited factors cause our biological traits, normal as well as disease, but when we attempt to identify any such factors by our omics approaches (such as screening the entire genome to find them), most of the apparent inherited causation remains unidentified. So we have at least two different kinds of issue: First, causation seems probabilistic rather than deterministic, and second, the methods themselves, rather than the causal landscape, affect or even determine what we will infer. For example, when we use methods of statistical inference, such as significance testing, we essentially define influences that don't reach significance in our data as not being real.

Even though we have believable evidence that the collection of a great many minor factors account, somehow, for the bulk of the outcomes of interest, we keep investing large amounts of resources in approaches that are not getting us very far. It is this which, we think, justifies continual attempts to pressure people to think more creatively and at least spend resources in ways that might have a better chance of inspiring somebody. We try not to go overboard in voicing our comments, but of course that's what a blog is for--a forum for opinion, hopefully rooted in fact.

What the current research approach gets most of the time is not statistical associations between factor and outcome that can then be shown to be deterministic in a billiard-ball mechanistic sense, that is, in which a statistical exploration leads to a 'real' kind of cause. That does happen and we mentioned some instances yesterday, but it's far from the rule. The general finding is some sort of fuzzy probability that associates the presence of the putative causal factor--say, a given allele at a gene--and an outcome, such as a disease. So, the explanation, which we think is fair to characterize as a retreat, is to provide what sounds like a knowing answer that the cause really is 'probabilistic'. But what does this actually mean, if in fact it isn't just blowing smoke? It is in this context that, we think, the explanations are in a relevant sense in fact deterministic, even if expressed in terms of probabilities or 'risk'.

And besides, everybody knows that causation is probabilistic! So there!
Now what on earth does it mean to say that causation is probabilistic? In technical terms, one can invoke random processes and, often (sometimes not in a self-flattering way) relate this to the apparently inherent probabilistic nature of quantum physics. In my amateurish way, I'd say that the latter replaced classical billiard-ball notions of causation which were purely deterministic (if you had perfect measurements) with truly probabilistic causation even if you had perfect measurement.

Probably those who invoke the everyone-knows sense of causation being probabilistic is referring to our imperfect identification or measurement of contributing factors. But then, to make predictions based on genotypes, we need to know the 'true' probability of an outcome given both the genotype and the degree of imperfect measurement. The latter is estimated from data, but almost always in the ceteris paribus notion--that is, the probability of cancer given a particular BRCA1 mutation is p, averaging over all the range of other genetic and environmental variants that carriers of the observed genetic variant are individually exposed to.

But we must think carefully about where that p comes from. It is treated exactly as if we were flipping a coin, but with probability p of coming up Heads rather than a probability of 1/2. For quantitative traits like stature or blood pressure, some probability would be assigned to each possible trait value of a person carrying the observed genetic variant, as if these values were distributed in some fashion, say, like the usual bell-shaped Normal distribution around the mean, or expected effect on the trait for those who carry the variant. Her probability of being 5'3" tall is .35 but .12 of being 5'6", e.g.

Sounds fine. But how well known, or even how real, are such assumed, fixed, underlying probabilities?

Plausibility is not the same as truth
The probabilistic nature of smoking and lung cancer, or even BRCA1 variants and breast cancer, can be accounted for in meaningful ways related to the random nature of mutations in genes. Of course, we could be wrong even here, since there may be other kinds of causal processes, but at least we don't feel a need to understand every molecular event that may generate a mutation. We don't need to get into the molecular details or quantum mechanical vagaries to believe that we have a reasonable understanding of the association between BRCA1 mutations and breast cancer. In such situations, a probabilistic causation makes reasonable sense, even if it doesn't lead to a clean, closed, classically deterministic billiard-ball causal way.

But that is not the usual story. When we face what seems to be truly complex causation, in which many factors contribute, as in environmental epidemiology or GWAS-like omics studies, we face something different. What do we do in these situations? Generally we identify many different possible causal factors, such as points along the genome or dietary constituents, and estimate their individual statistical association with the outcome of interest (diabetes or Alzheimer's disease, etc.). First, we usually treat them as independent, and do our best to avoid being confused by confounding; for example, we only look at variants along the genome that are themselves not associated with each other by what is called linkage disequilibrium. Then, we see, factor by factor, whether its variation is statistically associated with the outcome, as described in our third paragraph above. We base this on a statistical cutoff criterion like a significance test (not our subject today, but again, see Jim Wood's earlier post on MT).

Given a 'significant' result, we then estimate the effect of the factor in probabilistic ways, such as the probability of the outcome if you carry the genetic variable (for quantitative traits, like stature or blood pressure, there are different sorts of probabilistic effect-size estimates). Well, this certainly is different from billiard-ball determinism, you say. Isn't it a modern way to deal with causation, and doesn't that make it clear that careful investigators are in fact not asserting determinism or single-gene causation? What could be clearer?

In fact, we think that, conceptually, what is happening is basically an extension of single-cause, or even basically deterministic thinking. As one way to see this, if you identify a single risk factor, such as the CFTR gene in which some known variants are very strongly associated with cystic fibrosis, then if you have a case of CF and you find a variant in the gene, you say that the variant caused the disease. But for many if not most such variants, the attribution is based on an assumption that the gene is causal and that therefore, in this particular instance the variant was the cause. There are some rather slippery issues, here, but they get clearer (we think) and even much more slippery, when it comes to multiple-factor, including multi-gene, causation. We will deal with that tomorrow.

Link List

Thursday, May 23, 2013

Who, me? I don't believe in single-gene causation! (or do I?). Part II. Probabilistic causation--what do we mean?

No comments: