Tuesday, February 11, 2014

The well-posed problem problem

Yesterday Ken wrote about the Gary Taubes piece in Sunday's NYT that asked why nutrition science has failed to explain the obesity and type 2 diabetes epidemics in the US and elsewhere.  As Ken wrote, "After decades of huge, expensive studies by presumably the most knowledgeable investigators doing state-of-the-art science, we know relatively little with relatively little firmness, about what we eat does for or to us."  He suggested that this can be attributed in large part to the state of the research establishment, the pressure scientists feel to keep their labs churning out 'product' -- Taubes notes that there have been 600,000 papers and books on obesity in the last few decades, with little true understanding gained.  The science industry relies on, and indeed encourages more of the same reductionist thinking that hasn't worked in the past.  And even if not intentionally, that discourages true innovation.

Well-posed questions
So, the research establishment is a large part of the problem.  But must an unwieldy, top-heavy, conceptually conservative research establishment be inefficient?  Is it possible that even a large establishment could be better at answering the questions it asks?  Maybe not, but if it would at least be possible in principle, perhaps the place to start is to wonder whether it's the questions being asked that are the problem.

There's an idea in mathematics that posits that to explain a physical phenomenon, the problem should be "well-posed".  That is,
  1. It is structured so that a solution exists, at least in principle
  2. Such a solution should be unique
  3. The solution's behavior changes systematically with the initial conditions
And to make this more relevant to our discussion, we'll add another criterion:
     4. An appropriately sufficient set of variables should be adequately measured.
Is it reasonable to apply this framework to the kinds of questions asked in epidemiology, as well as genetics?  Sometimes, certainly.  In 1854, John Snow asked why people living around the Broad Street water pump in a neighborhood in London were getting cholera.  His solution was that the water was contaminated.  Most people at the time believed cholera was caused by dirty air, so this wasn't a wildly popular solution.  But he was right, and when the water was no longer contaminated, people no longer got sick.   A well-posed problem with a unique solution.

John Snow's cholera map - the halcyon days of epidemiology (Wikipedia)

In the early days of epidemiology, the field could ask well-posed questions about causality and indeed correctly discovered the cause of many infectious diseases as well as diseases caused by environmental risk factors like asbestosis or smoking-induced lung cancer or, not even that long ago, Legionnaire's disease.  Genetics, too, grew up expecting to be able to ask, and answer, well-posed questions -- what causes cystic fibrosis, or Tay Sachs disease or Huntington's disease?

But, if we ask what causes obesity, and we try to answer it with even state-of-the-art epidemiological or genetic methods, we are now in the territory of ill-posed questions.  Something(s) must cause it because it exists.  But we know that everyone gets there in their own way -- no two people eat exactly the same thing and expend the exact same amount of energy on the path to obesity, so there can and will be no single answer to the question.  In fact, this is true of many single-gene diseases as well -- for example, there are over 1000 different variants that are attributed to cause cystic fibrosis, at least in the sense that they are found in affected persons but transmitted from unaffected parents, suggesting that the combination is responsible for the trait.  However, it is often found even in purportedly simple genetic diseases that a large fraction of people with the 'causal' allele are unaffected, as is the case for hemachromotosis (an iron absorption disease).  This version of reality can't be the answer to a well-posed question.

And therein lies a fundamental problem.  If there can be so many answers to questions of causation, how do we know when we're done?  Have we found the answer when we identify the cause of Mary's obesity or is it not a solution until we've found that in general taking in too many calories relative to calories expended is the cause?  That is, until we can generalize from one or more observations to assume we've found the solution for all cases?  Have we found the cause of cystic fibrosis when we know which variant causes John's, or is it when we know that there are thousands of variants on the CFTR gene that can be causal?  But then what of persons with 'causal' genotypes who are disease-free?

This isn't a trivial problem, because when we try to explain individual issues from the group level -- when we're taking the generalizations we've made from individuals and trying to apply them to people who didn't contribute to the original data set -- we're doing this without some important relevant information, and we often don't know what that is.  We're also, in effect, assuming that everyone is alike, which we know is not true.  And, when we get into the realm of prediction, we're assuming that future environments will be just like past environments, and we know that's not true.  Even then, we assign group risks probabilistically, since usually not everyone in the group (say, smokers) gets the disease.  What kind of cause are we implying?

Well, causation becomes statistical and that can mean we don't really understand the mechanism or, often, are not measuring enough relevant variables.  We use terms like 'cause' too casually, when what we really are suggesting is that the factor contributes to the trait at least in some circumstances among those who have the factor.

So it becomes impossible to know how it applies to any given individual, either to explain or predict disease.  What we know is that smoking causes cancer.  Except when it doesn't.  Or, let's consider Gary Taubes' favorite poison these days, sugar.  Can he actually answer the question of what causes obesity and diabetes with a single answer -- 'sugar', as he seems to believe?  If he can, then it is a well-posed question.  But, then, why, is my 86 year old father thin and diabetes-free even though he has been known to order 2 desserts following hearty restaurant meals?  Indeed, he used to feed his children ice cream for dinner (we did appreciate it). And his own father had type 2 diabetes, so there's no excuse for my father to be free of this disease.  Or, so far, his 3 children, no longer young, who were given those long ago ice cream dinners.  So, sugar can't be the answer to what causes obesity, though it might be an answer of sorts; but asking what causes obesity and type 2 diabetes can't be a well-posed question if it isn't going to have a unique solution in some specifiable way.

Even thoughtful people who believe GWAS have been a great success will acknowledge that it's hard to know what to make of the findings of tens or hundreds of genes with small effect for many diseases. What causes X?  Well, either this group of 200 genes, or that group, or that other group there.  But with 200 genes, each with only (say) two different alleles (sequence states), there are 3 to the 200th power different genotypes, even if only a tiny fraction--unknown--ever actually exist in the population, much less in our particular study sample.

Add to this the environmental factors, but we usually don't know which ones, or which ones are significant for which people. It is a kind of legerdemain by which we seem scientific by saying that gene X raises the relative risk of an outcome by such and such an amount.  What do we really mean by that, other than to say that's the excess number of cases in our retrospectively collected data, that we extrapolate to everyone else?  Is the answer in any way clear?  Or was the question about 'risk' not well-posed?

Perhaps, then, we must strike the well-posed problem criterion for understanding disease.  Forget unique solution, for all practical purposes no solution exists when an effect is different for everyone.   That is clearly true since a unique observation can't be repeated to test the original assertion.

Or better said, when you've got essentially as many solutions as you have individuals, or you've only got aggregate solutions, it's both hard to predict and hard to know how to apply group solutions to individuals.  We're all told to lower our LDL cholesterol and raise our HDL, which seems to lower heart disease risk -- except when it doesn't.  And this is something that can't be predicted because, well, because it depends.  It depends on many unknown and unknowable factors; in many many cases, causation is not reducible.

The best that epidemiology has come up with for determining whether a theorized cause really is one is a set of criteria pulled together long ago, the Hill criteria, a list of 9 rules that looks nice on the face of it; the larger an association because a factor and an effect, the more likely the factor is the cause; the cause should be consistent; it should seem plausible, and so forth.  The only problem is that only one of these criteria always must be true -- the cause must precede the effect.  So, in effect we've got a branch of science that sets out to find the cause of disease that has no clue how to tell when it's done.*  Never being done is nice for keeping the medical research establishment in business, but not so nice for the enhancement of knowledge.

A colleague and friend of ours, Charlie Sing, is a genetic epidemiologist at Michigan.  He has for decades tried to stress the clear fact that it is not just a 'cause' but its context that is relevant.  One gene is another gene's context in any given individual, and ditto for environmental exposures.  Without the context there is no cause, or at least no well-posed cause question.  If the contexts are too complex and individual, and not systematic, or involve too many factors, we're in the soup.  Because, it depends.

Where do we go from here?
So, what's the bottom line?  We don't know what questions to ask, we don't know when we've answered them, and we don't know how to do it better.  Our questions and even many of our answers sound clear but are in fact not well-posed.  For reasons of historical success in other fields, we're stuck in a reductionist era looking for single explanations for disease, but that's no longer working.  It's time to expand our view of causation, to understand what it means that my father can eat all the desserts he wants and never gain weight but that seems not to be true for a lot of people.  But it is fair, indeed necessary, to ask, What are we supposed to do if we reject reductionism, which has stood the test of at least a few centuries' time?

Gregor Mendel, who asked well-posed questions, and answered them

It seems likely that we were seduced by the early successes of epidemiology with point-causes with large effects -- infectious diseases -- and we were similarly seduced by Mendel's carefully engineered successes with similar point causes -- single genes -- for carefully chosen traits, but these are paradigms that don't fit the complex world we're now in.  What Mendel showed was that causal elements were inherited with some specifiable probability, and he did that in a well-posed setting (selective choice of traits and highly systematized breeding experiments).  But Mendel's ideas rested on the notion that while the causal elements (we now call them alleles) were transmitted in a probabilistic way, once inherited they acted deterministically.  Every pea plant with the same genotype had the same color peas, etc.  We now know that that's an approximation for some simple situations, but not really applicable generally.

We are really good at gathering and generating data.  Big Data.  Making sense of it is harder.  The casual assumption is that we can, do, and should, ask well-posed questions, but at present perhaps we no longer can.  When every person is different, when there can be more than one pathway to a given outcome (and defining outcomes itself is problematic), and when the pathways are complex, finding simple answers in heterogeneous data sets isn't going to happen.  And finding complex answers in heterogeneous data sets is something we don't really know how to do yet.  And again, generalizing from individuals to groups and back again adds its own layer of difficulty because of just what our friend Charlie says -- the answers are context dependent, but you've lost the context.

There have been a number of attempts to form statistical analyses that take many variables into account.  Some go site by site along the genome, and sum up estimated risk for the variants found at each site, then account for other variables such as sex and perhaps age and others, maybe even some environmental factors.  The basic idea is that context is identified as combinations of factors that together  have a discernible effect that the variables don't have on their own. 

This kind of statistical partitioning may help identify very strong risk categories, such as responses to some environmental factor in people with some particular genotype, but in general everyone is at some risk and we as a society will have to ask how much detail is worth knowing, or whether when too many variables constitute exchangeable sets (that is, with similar risk) its usefulness may diminish.  Will it be expected to tell us how many cigs and cokes and Snickers bars we can consume, how many minutes on the treadmill, for our genotype, sex, and age, an apple a day, and keep our risk of, say, stroke, to less than some percent for the next 10 years?  Or is that not a very useful kind of goal--a well-posed question?  Time will tell.

An unfair rebuke
We are often chastised for being critical of the current system because while what we say is correct we haven't provided the magic what-to-do-instead answer.  That is little more than a rebuke of self-interested convenience, because if our points are correct then the system is misbehaving and needs to change rather than just keep the factory humming.  The rebuke is like a juror voting to convict an accused, even if there isn't any evidence against him, because the defense hasn't identified the actual guilty person.

That we can't supply miracles is certainly a profound limitation on our part.  But it isn't an excuse for still looking for lost keys under the lamp-post when that's not where they were lost.  If anything, our certainly acknowledged failure should be a stimulus to our critics to do better.

If we can't expect to ask well-posed questions any longer, what kinds of questions can we ask?

*This is somewhat hyperbolic because with infectious diseases, or environmental factors with large effects -- coal dust and Black Lung Disease, asbestos and asbestosis, e.g. -- it is possible to know with experimental data and overwhelming evidence from exposure data.  But it's certainly true for the chronic late onset diseases that are killing us these days.


Manoj Samanta said...

I guess it is too late to change course. Do you really think Francis Collins will write a book titled - "Sorry folks, we over-promised based on faulty theory" after selling dreams in 1999-2000 about how, by 2010, little Johnny would know everything about his life through quick $1000 genome test after birth?

Anne Buchanan said...

Indeed. You are right. And that was the point of yesterday's post.