The causal assumption
A well-posed question can be about the fact of a numerical association, say between some specified variable and some outcome. Or, it can be about the reason for that association. Do we live in a causal universe? This may seem like a very stupid question, but it is actually quite profound. If we live in a causal universe, then everything we see must have a cause: there can be no effect, no observation, without an antecedent cause--a situation that existed before our observation and that is responsible for it.
Actually, some physicists argue from what is known of quantum mechanics (physical processes on the atomic or smaller scale) that there are, indeed, effects that do not have causes. One way to think of this, without being quite as strange relative to common sense as much of quantum mechanics is, is this: the world must be ultimately deterministic. Some set of conditions must make our outcome inevitable! If you knew those conditions, the probability of the effect would be 1 (100%).
In that sense, the word 'probability' has no real meaning. But if to the contrary, outcomes were not inevitable, because things were truly probabilistic, then our attempts to understand nature would or should be different in some way. The usual idea of 'laws' of Nature is that the cosmos is deterministic, even if many things like sampling and measurement yield errors relative to our ability to infer the whole, exact truth. And 'chaos' theory says that it may be that even the tiniest errors, even in a truly deterministic system, can lead to uncontrollably or unknowably large errors.
These ideas are fundamental when we think about our topics of interest in genetics, public health, and evolutionary biology.
The statistical gestalt
A widespread approach in these areas is to see how much juice we can squeeze out of a rock with ever-fancier statistical methods (this is helped by the convenient software we have to play with). Whether the variables are genomic or environmental or both, we'll slice and dice this way and that, and find the fine-points of 'risk' (that is, retrospectively ascertained already achieved outcomes).
Now, one may think that there really is some 'juice' in the rock: that the world is truly deterministic, and that the juice we see is the causal truth masked by all the mismeasured and unmeasured variables, incomplete samples, and the other errors we fallible beings inevitably make. Our statistical results may not be good for prediction, or may be good but to an unknown extent, but if we think of the world as causal, there is for each instance of some trait, the wholly deterministic conditions that made it as inevitable as sunrise. In defense of our hard-to-interpret methods and results, we'll say "it's the best we can do, and it will work at least for some sets of factors." And so let's keep on using our same statistical methods to squeeze the rock.
One may explain this view by saying that life is complex, and statistical sampling and inferential methods are how to address it. This could be, and certainly in part seems to be, because many issues of importance involve so many factors, interacting in so many different ways that each individual is unique (even if causation is 100% deterministic). This raises fundamental issues as to how we could ever deliver the promised successes (people will still get sick, despite what Francis Collins and so many others implicitly promise). Perhaps what we're doing is not just the best we can do now, it's the best that can be done, because it's just how Nature is!
|Rococo mirror and stuccowork, Schloss Ludwigsburg, Stuttgart|
But we should keep in mind one aspect of the usual sorts of statistical methods, no matter how many rococo filigrees they may be gussied up with. These methods find associations (correlated presence) among variables, and if we can specify which occurred first, then we can interpret those as causes and the latter as effects. We can estimate the strength or 'significance' of the observed association. But we should confess openly that these are only what was captured in some particular sample. Even if they truly do reflect underlying causation, this can be so indirect as to be largely impenetrable (unless the cause is so strong that we can design experiments to demonstrate it, or find it with samples so small and cheap that our Chair will penalize us for not proposing a bigger, more expensive study).
Now associations of this sort can be very indirect relative to the actual underlying mechanism. Why? The mix of factors--their relative proportion or frequency--in our sample may affect the strength of association. An associated, putatively causal factor may be correlated with some truly causal, but unmeasured, factors. The complexity of contexts may mean that causation only occurs in the presence of other factors (context-dependency).
If we do not have a theory, or mechanism, of causation, we are largely doomed to floundering in the sea of complexity, as we clearly are experiencing in many relevant areas today. More data won't help if it generates more heterogeneity rather than enough replications of similar causal situations. More data will feed SPSS or Plink, very fine analysis software, but not necessarily real knowledge.
This might be true, but to me it's just stalling to justify business more or less as usual. We have had a century of ever increasing data scale and numerical statistical tricks and we are where we are. As we and others note, there have been successes. But most of them are because of strong, replicable effects (that were or could have been found with cheaper methods and smaller, more focused studies). We're not that much better off for many of the more common problems.
Often it is argued that we can use a simplified 'model' of causation, ignoring minor effects and looking for just the most important ones. This can of course be a good strategy, and some times a true attempt is made to connect relevant variables with at least half-way relevant interactions (linear, multiplicative, etc.); but more often what is modeled is a set of basic regression or correlation associations between some sets of variables (such as a 'network'), but without a serious attempt at a mechanism that explains the quantitative causal relationships and in a sense still relies on very indirect associations (usually we just don't have sufficient knowledge for being more specific). We've posted before (in a series that starts here, e.g.) on some of the 'strange' phenomena we know apply to genetics that mean that local sites along the genome can have their effects only because of complex, very non-local aspects of the genome. Significant associations only sometimes lead to understanding of why.
But if we have a good causal model, both the statistical associations, and nature of causation become open to understanding--and to prediction. With a theory, or mechanism, we can sample more efficiently, control for and understand measurement and other errors more effectively, select variables more relevantly, separate out human fallibility variation (the truly statistical aspects of studies) from causation, interpret the results more specifically, and predict--an ultimate goal--with more confidence and precision. A good causal mechanism provides these and many other huge advantages over generic association testing, which is mainly what is going on today.
Of course, it's hellishly difficult even in controlled areas like chemistry, to develop good mechanistic understanding of life. That is to me no excuse for the kind of many-layered removal between what we estimate from data now, and what we should be trying to understand. Even if it's harder to get grants if we actually try to do things right.
There are some seemingly insuperable issues, and we've written about them before. One is that we fit data to restrospective data, about events that have already happened, yet we often speak of that as if it is retrodiction as if that means it has predictive value. But we know that many risk factors, even genetic ones, depend on future events unpredictable in principle, so we can't know enough about what we're modeling. And second, if we're modeling one outcome--diabetes, say--then the risks of other 'competing' outcomes, say cancer rates, clearly affect our risks for the target outcome, and in subtle ways. If cancer were eliminated, that alone would raise the risk of diabetes because people would live to have longer exposures to diabetes risk factors. Since we don't know about such changes in other competing factors that may occur, this is not really a problem that we can build into causal or mechanistic models of a given trait--a real dilemma and, probably, a rationale for doing the best statistical data-fitting that we can. But I think we too routinely we use such things as excuses for not thinking more deeply
Of course, if the causation in our cosmos is truly probabilistic, and effects just don't have causes, or we don't know the probability values involved or how such values are 'determined' (or if they are), then we may really be in the soup when it comes to predicting things we want to predict that have many complex contributing factors.
Forming well posed questions, not easy but ....
To me personally, it's time to stop squeezing the statistical rock. Whether or not the world is ultimately deterministic or not, it is time to ask: What if our current concepts about how the world works are somehow wrong, or more particularly if the concepts underlying statistical approaches are not of the right kind? Or if we rely on vanilla methods that are chosen because they allow us to go collect data without actually understanding anything (this is, after all, rather what genomics' boasts of 'hypothesis free' methods confesses).
Even if we have no good alternatives (yet), one can ask how might we look for something different--not while we continue to do the same sorts of statistical sampling approaches, but instead of continuing to do that? Theory can have a bad name, if it just means woolly speculation or great abstraction irrelevant to the real world. But much of standard statistical analysis is based on probability theory and degrees of indirection and 'models' so generic that they are just as woolly, even if they generate answers, because the 'answers' aren't really answers, and won't really be answers, until we are somehow forced to start asking well-posed questions.
Some day, some lucky person, will hit upon a truly new idea....