Thursday, June 19, 2014

Correlation, cause and how can you tell?

A very good BBC radio program called More or Less tries to point out uses and, mainly, misuses of statistics and data analysis, especially where policy or political claims and the like are involved.  A recent program (June 9) addressed the problem of confusing correlation with causation.  This is something everyone should know about, even if you're not a scientist.  Just because two things can be seen together does not mean that one causes the other.

The program mentioned an entertaining web site, Spurious Correlations, that you can find here.  It's run by Tyler Vigen, a Harvard law student, and it makes the correlation/causation problem glaringly obvious. There is even a feature for finding your own spurious correlation.

Number of people who died by becoming tangled in their bedsheets
correlates with
Total revenue generated by skiing facilities (US); Source: Spurious Correlations

How to tell if a correlation is spurious is no easy matter.  If many things are woven together in nature, or society, they can change together because of some over-arching shared factor.   But just because they are found together does not mean that in any practical sense they are causally related. Statistical significance of the correlation, meaning that it is unlikely to arise by chance, is a subjective judgment about what you count as 'unlikely.' After all, very unlikely events do occur!

Causation can be indirect and in a sense it is not always easy to understand just what it means for one thing to 'cause' another.  If wealth leads to buying racy cars and racy cars are less safe, is it the driving or the car, or the wealth that 'causes' accidents?  If AIDS can be a result of HIV infection, but you can get some symptoms of aids without HIV, or have HIV without the symptoms, does HIV cause AIDS? If the virus is indeed responsible, but only drug users or hookers get or transmit it, is the virus, the drug-use, or prostitution, or using prostitutes the 'cause'?

Another problem, besides just the way of thinking of causation, and the judgment about when a correlation is 'significant', relates to what you measure and how you search for shared patterns.  If you look through enough randomly generated patterns, eventually you'll find ones that are similar, with absolutely no causal connection between them.

Looking at the examples on the above web site should be a sobering lesson in how to recognize bad, or overstated, or over-reported science.  It won't by itself answer the question about how to determine when a correlation means causation.  Nobody has really solved that one, if indeed it has any sort of single answer.  And there are some curious things to think about.

Just what is causation?
In a purely Newtonian, deterministic universe, in a sense everything is causally connected to everything else, and was determined by the Big Bang.  For example, with universal gravity everything literally affects everything else and through a totally deterministic causal process.

In that sense nothing at all is truly probabilistic.  But quantum mechanics and various related principles of physical science hold that some things may really be truly probabilisitic rather than deterministic. If that is right, then the idea of a 'cause' becomes rather unclear.  How can some outcome truly occur only with some probability?  It verges on an effect without an actual cause.  For example, if the probability of something happening is, say, 15%, what establishes that value--what causes it?  A systematic or random process with a truly random cause, that is not just our inability to measure it precisely, in a sense redefines the very notion of cause.  Such things, some of them seemingly true of the quantum world, really do violate common sense.  So the whole idea of correlation vs causation takes on many different, subtle colors.


John R. Vokey said...

There has been some advancement on how to think about causality: check out Judea Pearl's book, Causality.

Ken Weiss said...

Nice to hear from you. Several smart people have recommended Pearl's book and I have a copy and have tried to read it. But I find it too formalized and dense, and based on statistical modeling derived from Wright's path analysis, that I think assume too much parametric regularity for many situations. But again I confess I cannot follow it beyond the first couple of chapters. His 'do' calculus seemed as if it was going to go somewhere, but as I remember, he basically drops it or at least his use isn't clear to me.

As to regularity, if factors are all measured, one might work out their network properties etc. But unmeasured factors are essentially buried in the 'noise' and, to me worse, their future prevalence and that of new factors not present in the analysis of data, undermine the analysis. So, again to me and perhaps I'm very wrong, it's a way of data fitting as a definition of causality, rather than building on a substantive theory.

Have I got it wrong?

Manoj Samanta said...

Causality: a chapter by chapter review

Manoj Samanta said...

and the book itself -

John R. Vokey said...

There is a simple reason Pearl emphasizes Simpson's paradox in his *causal* analyses: it is the quintessential test case. If the conditions of Simpson's paradox are the universal they appear to be (i.e., we can never know what we have collapsed over in non experimental research), then non-experimental research is always fundamentally and irremediably flawed. Now, that is not a conclusion most of us would otherwise object to. We have always said so, but have otherwise paid lip-service to it (non experimental research) being important in some otherwise poorly or not at all well articulated ``other ways''.

I routinely hear that correlations from observational research are ``at least suggestive'' (and, thereby, meritorious), when, statistically, we know that such claims are without merit even with respect to sign (subject to Cornfield's conditions).

Beyond that, if you want to draw (limited) causal conclusions, manipulate something. Otherwise, recognize and acknowledge the casual limitations of the conclusions otherwise drawn, and do not elide the sharp and impassable distinction between them. I am a fan of Peirce's abduction approach to science, but it is not a licence for sloppy causal reasoning.

Ken Weiss said...

I have to back and read about Peirce again, as I keep forgetting his points. One problem I guess is that when we can't experiment we have to rely on observational data and have to decide how to design data-acquisition procedures (e.g., random sampling). Since causation should under many or most conditions lead to correlation, I guess we're stuck; many correlation studies do succeed in identifying cause.

I think the acknowledging limitations is a major problem if not the major problem. It allows us to keep doing what we're doing even if it isn't working, rather than thinking harder about the designs or even the questions.

In my own view the most successful sciences have had some theory, derived by a mix of data and reasoning, to apply to new data. That is an external application and works very well when we have such a theory. In much of genetics and evolutionary biology, we simply don't have a good enough theory, and must rely (hopefully) on internal comparison (statistical signficance of compared hypotheses, case-controls, etc.). That makes us vulnerable to Simpson's paradox and much beyond.

We set up rules (e.g., a specified p-value cutoff--something we've written about before), but if we don't get what we want, we call our results 'suggestive' and keep on applying for grants on that basis, rather than thinking more clearly about both the idea of cutoff values and of doing things differently.

Even manipulating variables is way less powerful in these areas than is usually acknowledged. Experimental biology often assumes far more control than is actually the case.