Thursday, February 13, 2014

Even more on posing the well-posed....

Here's another installment on the idea of well-posed questions in evolutionary and biomedical genetics, and related areas of epidemiology.  We wrote about why the kinds of questions that these fields have been able to ask about causation has changed, and why well-posed questions are less and less askable now.  We've tried to deconstruct the questions that are being asked currently and explore why they aren't well-posed and why they can't be.  Today, we think further about what it means to pose answerable questions about causality.

The causal assumption
A well-posed question can be about the fact of a numerical association, say between some specified variable and some outcome.  Or, it can be about the reason for that association.  Do we live in a causal universe?  This may seem like a very stupid question, but it is actually quite profound. If we live in a causal universe, then everything we see must have a cause: there can be no effect, no observation, without an antecedent cause--a situation that existed before our observation and that is responsible for it.

Actually, some physicists argue from what is known of quantum mechanics (physical processes on the atomic or smaller scale) that there are, indeed, effects that do not have causes.  One way to think of this, without being quite as strange relative to common sense as much of quantum mechanics is, is this:  the world must be ultimately deterministic.  Some set of conditions must make our outcome inevitable!  If you knew those conditions, the probability of the effect would be 1 (100%).

In that sense, the word 'probability' has no real meaning.  But if to the contrary, outcomes were not inevitable, because things were truly probabilistic, then our attempts to understand nature would or should be different in some way.  The usual idea of 'laws' of Nature is that the cosmos is deterministic, even if many things like sampling and measurement yield errors relative to our ability to infer the whole, exact truth.  And 'chaos' theory says that it may be that even the tiniest errors, even in a truly deterministic system, can lead to uncontrollably or unknowably large errors.

These ideas are fundamental when we think about our topics of interest in genetics, public health, and evolutionary biology.

The statistical gestalt
A widespread approach in these areas is to see how much juice we can squeeze out of a rock with ever-fancier statistical methods (this is helped by the convenient software we have to play with).  Whether the variables are genomic or environmental or both, we'll slice and dice this way and that, and find the fine-points of 'risk' (that is, retrospectively ascertained already achieved outcomes).

Now, one may think that there really is some 'juice' in the rock:  that the world is truly deterministic, and that the juice we see is the causal truth masked by all the mismeasured and unmeasured variables, incomplete samples, and the other errors we fallible beings inevitably make.   Our statistical results may not be good for prediction, or may be good but to an unknown extent, but if we think of the world as causal, there is for each instance of some trait, the wholly deterministic conditions that made it as inevitable as sunrise.  In defense of our hard-to-interpret methods and results, we'll say "it's the best we can do, and it will work at least for some sets of factors." And so let's keep on using our same statistical methods to squeeze the rock.

One may explain this view by saying that life is complex, and statistical sampling and inferential methods are how to address it.   This could be, and certainly in part seems to be, because many issues of importance involve so many factors, interacting in so many different ways that each individual is unique (even if causation is 100% deterministic).  This raises fundamental issues as to how we could ever deliver the promised successes (people will still get sick, despite what Francis Collins and so many others implicitly promise).  Perhaps what we're doing is not just the best we can do now, it's the best that can be done, because it's just how Nature is!

Rococo mirror and stuccowork, Schloss Ludwigsburg, Stuttgart

But we should keep in mind one aspect of the usual sorts of statistical methods, no matter how many rococo filigrees they may be gussied up with.  These methods find associations (correlated presence) among variables, and if we can specify which occurred first, then we can interpret those as causes and the latter as effects.  We can estimate the strength or 'significance' of the observed association.  But we should confess openly that these are only what was captured in some particular sample.  Even if they truly do reflect underlying causation, this can be so indirect as to be largely impenetrable (unless the cause is so strong that we can design experiments to demonstrate it, or find it with samples so small and cheap that our Chair will penalize us for not proposing a bigger, more expensive study).

Now associations of this sort can be very indirect relative to the actual underlying mechanism.  Why?  The mix of factors--their relative proportion or frequency--in our sample may affect the strength of association.  An associated,  putatively causal factor may be correlated with some truly causal, but unmeasured, factors.  The complexity of contexts may mean that causation only occurs in the presence of other factors (context-dependency).

If we do not have a theory, or mechanism, of causation, we are largely doomed to floundering in the sea of complexity, as we clearly are experiencing in many relevant areas today.  More data won't help if it generates more heterogeneity rather than enough replications of similar causal situations.  More data will feed SPSS or Plink, very fine analysis software, but not necessarily real knowledge.

This might be true, but to me it's just stalling to justify business more or less as usual.  We have had a century of ever increasing data scale and numerical statistical tricks and we are where we are.  As we and others note, there have been successes.  But most of them are because of strong, replicable effects (that were or could have been found with cheaper methods and smaller, more focused studies).   We're not that much better off for many of the more common problems.

Often it is argued that we can use a simplified 'model' of causation, ignoring minor effects and looking for just the most important ones.  This can of course be a good strategy, and some times a true attempt is made to connect relevant variables with at least half-way relevant interactions (linear, multiplicative, etc.); but more often what is modeled is a set of basic regression or correlation associations between some sets of variables (such as a 'network'), but without a serious attempt at a mechanism that explains the quantitative causal relationships and in a sense still relies on very indirect associations (usually we just don't have sufficient knowledge for being more specific). We've posted before (in a series that starts here, e.g.) on some of the 'strange' phenomena we know apply to genetics that mean that local sites along the genome can have their effects only because of complex, very non-local aspects of the genome.  Significant associations only sometimes lead to understanding of why.

But if we have a good causal model, both the statistical associations, and nature of causation become open to understanding--and to prediction.  With a theory, or mechanism, we can sample more efficiently, control for and understand measurement and other errors more effectively, select variables more relevantly, separate out human fallibility variation (the truly statistical aspects of studies) from causation, interpret the results more specifically, and predict--an ultimate goal--with more confidence and precision.  A good causal mechanism provides these and many other huge advantages over generic association testing, which is mainly what is going on today.

Of course, it's hellishly difficult even in controlled areas like chemistry, to develop good mechanistic understanding of life.  That is to me no excuse for the kind of many-layered removal between what we estimate from data now, and what we should be trying to understand.  Even if it's harder to get grants if we actually try to do things right.

There are some seemingly insuperable issues, and we've written about them before.  One is that we fit data to restrospective data, about events that have already happened, yet we often speak of that as if it is retrodiction as if that means it has predictive value.  But we know that many risk factors, even genetic ones, depend on future events unpredictable in principle, so we can't know enough about what we're modeling.  And second, if we're modeling one outcome--diabetes, say--then the risks of other 'competing' outcomes, say cancer rates, clearly affect our risks for the target outcome, and in subtle ways.  If cancer were eliminated, that alone would raise the risk of diabetes because people would live to have longer exposures to diabetes risk factors.  Since we don't know about such changes in other competing factors that may occur, this is not really a problem that we can build into causal or mechanistic models of a given trait--a real dilemma and, probably, a rationale for doing the best statistical data-fitting that we can.  But I think we too routinely we use such things as excuses for not thinking more deeply

Of course, if the causation in our cosmos is truly probabilistic, and effects just don't have causes, or we don't know the probability values involved or how such values are 'determined' (or if they are), then we may really be in the soup when it comes to predicting things we want to predict that have many complex contributing factors.

Forming well posed questions, not easy but ....
To me personally, it's time to stop squeezing the statistical rock.  Whether or not the world is ultimately deterministic or not, it is time to ask:  What if our current concepts about how the world works are somehow wrong, or more particularly if the concepts underlying statistical approaches are not of the right kind?  Or if we rely on vanilla methods that are chosen because they allow us to go collect data without actually understanding anything (this is, after all, rather what genomics' boasts of 'hypothesis free' methods confesses).

Even if we have no good alternatives (yet), one can ask how might we look for something different--not while we continue to do the same sorts of statistical sampling approaches, but instead of continuing to do that?  Theory can have a bad name, if it just means woolly speculation or great abstraction irrelevant to the real world.  But much of standard statistical analysis is based on probability theory and degrees of indirection and 'models' so generic that they are just as woolly, even if they generate answers, because the 'answers' aren't really answers, and won't  really be answers, until we are somehow forced to start asking well-posed questions.

Some day, some lucky person, will hit upon a truly new idea....

4 comments:

  1. There is some Twitter activity about this post. But I can't Tweet this sort of thing, in 140 characters. But if something is not essentially deterministic, what generates ('causes'?) it's probability of occurring? Does the event even have a probability? Is it some value, p? Some distribution of values with average of p? In such cases, something must determine the distribution

    If things truly have no cause, we're in trouble at least at some level (and in some ways, as I at least try to understand it, quantum mechanics may have some such properties (unless you hold to some multiverse theory).

    Here, we're likely far removed from such ultimate issues. But we have to deal with similar problems if things are as complex as they seem to be. The same situation won't recur, so retrofitting outcomes doesn't provide any clear way to turn that into prediction, unless something stable determines (causes?) the value of the probability.

    Again, many aspects of life are clear enough without these issues, and those ones work well. But many-level indirect retrofitting without a good theory does not provide a good guide to prediction, except perhaps by the weak assumption that the future is like the past and that the next sample is like the last.

    To me, in many key areas of work today, we know that we don't know to what extent these assumptions are accurate or 'true'. We simply refuse to address these issues squarely because they are too daunting. Instead, we press ahead with scaling up things we already know aren't working well.

    Yes, this will generate new knowledge and if there exist much better solutions (which is itself of course doubtful) then presumably they'll eventually be found. But are there far better ways for society to invest in relevant or basic knowledge than what we're doing? I think so, but of course it does pose a threat to conceptual and practical vested interests....

    ReplyDelete
  2. A basic idea related to retrofitted statistical approaches is that if you really fitted a good 'model' and want to understand causation, you need to set up a situation where a parameter is changed, see the model-predicted change in outcome, basically an experiment. But when our 'model' is evolutionary and we know that we can't control things in that way, we are kidding ourselves and to an unknown extent.

    Well, we can't literally do that, we know. So a 'trick' (that has perhaps evolved into another more-of-the-same excuse?) is a meta-analylsis: replicate the study, supposedly, in different samples and see if the same SNP genotype still confers risk (judged, of course, by subjective cutoff criteria of some sort).

    The inherent variability that evolution has produced, one might argue that was necessary for biological evolution to produce in order to avoid too-ready extinction, was complex and redundant causation. That means that two samples are not expected to agree--falsifiability is not a good criterion. Replication is only a partially useful one. And animal models, as we know, are VERY dicey 'replicates'

    Maybe there is nothing better we can do, but we persist in writing about this because people aren't really trying hard enough to get away from what are basically 19th century concepts of inference, gussied up in the 20th.

    As an analogy to the situation as we see it, but at best only an analogy, the 'multiverse' theory in physics is a way to rationalize probabilistic uncertainties with determinstic theory. Whether true or not, it has provided a different, if contra-commonsense way of thinking and it, and other aspects of questioning common sense have led to a lot of progress in physics including immensely accurate predictions. Even though chemistry and the cosmos are complex, there have been very clever ways to actually test things since, a century ago they faced perplexing problems. They developed many answers with very high predictive power, even if clearly still incomplete. They considered abandoning commonsense, and it worked to a very good extent.

    What we are doing in a sense is operating in denial, clinging to commonsense, roughly the way Ptolemaic astronomy kept adding epicycles to hypothesized planetary motions to make the solar system predictable in an earth-centered universe. Thanks to the 20th century, we have developed a profound theoretical understanding of life and its basic causal nature, thanks basically to Darwin (and, operationally, to Mendel for his version of a genetic testing approach). But we're not (yet) really facing up to what we know, as physics and chemistry were forced to do.

    We are not yet ready to give up our epicycles.

    ReplyDelete
  3. A few more thoughts:
    Ptolemy wasn't 'wrong', but his method was ad hoc and cumbersome and inaccurate. Copernicus was more convenient and accurate and in some relative sense more 'right'.

    They had two major advantages over us in biology: they had (though didn't know it at the time) Newton's laws of motion and universal gravity. And they had the ideal perfection of Euclidean figures--circles, conic sections and so on.

    Relativity and the absolute speed of light alters things in a way, but basically by allowing the same kinds of ideas to be extrapolated to any other solar system or galaxy etc.

    We have nothing close to these tools, and probably we're not close to a similarly clear-cut 'paradigm' shift. New sequencing or statistical analytic approaches are, as we have said, our 'epicycles'. Right now, we can't seem to give them up; indeed, we're always adding more of them in our post-hoc 'explanations' and our promises of 'personalized' genomic prediction etc. As long as the biostat, bioinformatics, and computer industries are kept humming by our purchases, it won't change....

    ReplyDelete
  4. Here's a new commentary in Nature on this general subject:
    http://www.nature.com/news/scientific-method-statistical-errors-1.14700

    Personally, I think even this while recognizing some of the problems, is a sort of apologia for some vague kind of expertise rather than transformative rethinking, whatever that might be.

    ReplyDelete