Friday, December 10, 2010

Why most Big Splash findings in science are wrong

"Our facts are losing their truth"
We all are taught the 'scientific method' by which we now understand the world.  It is widely held that science is marching steadily toward an ever more refined and objective understanding of the one true truth that's out there.  Some philosophers and historians of science question just how objective the process is, or even whether there's just one truth.  But what about the method itself?  Does it really work as we believe (and is 'believe' the right word for it)?

An article in the Dec 13 issue of the New Yorker raises important issues in a nontechnical way.  "The Truth Wears Off", by Jonah Lehrer.
On September 18, 2007, a few dozen neuroscientists, psychiatrists, and drug-company executives gathered in a hotel conference room in Brussels to hear some startling news. It had to do with a class of drugs known as atypical or second-generation antipsychotics, which came on the market in the early nineties. The therapeutic power of the drugs appeared to be steadily falling. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineties. Before the effectiveness of a drug can be confirmed, it must be tested again and again. The test of replicability, as it’s known, is the foundation of modern research. It’s a safeguard for the creep of subjectivity. But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts are losing their truth. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. 
Unofficially, it has been called the 'decline effect', and Lehrer cites many examples of strong effects  going on to 'suffer from falling effect size'.

He mentions the assertion by John Ioannidis, a leading advocate of meta-analysis--pooling studies to gain adequate sample size and more stable estimates of effects--related to a paper he wrote about why most big-splash findings are wrong. The gist of the argument is that major scientific journals like Nature (if that's actually a scientific journal) publish big findings--that's what sells, after all.  But what are big findings?  They're the unexpected ones, strong statistical evidence behind some study.

But statistical effects arise by chance as well as by cause.  That's why we have to support our case with some kind of statistical criteria, such as the level of a 'significance' test.  But if hundreds of investigators are investigating countless things, even if they use a test such as the standard 5% (or 1%) significance criterion, some of them will, just by chance, get such a result. The more studies are tried by the more people, the more dramatic will be the fluke result.  Yet that is what gets submitted to Nature and what history shows they love to publish.

GWAS studies magnify the problem greatly.  By doing hundreds of thousands of marker tests in a given study, the chance of some  'significant' result arising just by chance is substantial.  Investigators are well aware of this, and try to adjust for that by using more stringent significance criteria, but nonetheless with lots of studies and markers and traits, flukes are bound to arise.

Worse, what makes for a Big Splash is not the significance test value but the effect size.  The usual claim is not just that someone found a GWAS 'hit' in relation to a given trait, but that the effect of the high-risk variant is major--explains a huge fraction of a disease, for example, making it a juicy target for Big Pharma to try to develop a drug or screen for.

From Dec 6 New Yorker
But a number of years ago Joe Terwilliger and John Blangero showed by simulation that even when there is no causal element in a genomic search, the estimates of the effect size for the sites that survive the significance tests are bloated....that's how they reached their significance when the criteria were cautiously stringent.  The effect size, conditional on a high significance test, is biased upwards.  So, as Joe et al. said way back then, you have to do new, unbiased sample of the particular purported cause to begin estimating the strength of effect that the cause actually has.

And this brings us back to the New Yorker story of the diminution of findings with follow-up, and why facts are losing their truth.

Lehrer concludes,
Even the law of gravity hasn't always been perfect at predicting real world phenomena.  (In one test, physicists measuring gravity by means of deep boreholes in the Nevada desert found a two-and-a-half-per-cent discrepancy between the theoretical predictions and the actual data.)  Despite these findings...the law of gravity remains the same.
...Such anomalies demonstrate the slipperiness of empiricism.  Although many scientific ideas generate conflicting results and suffer from fallng effect sizes, they continue to get cited in teh textbooks and drive standard medical practice.  Why?  Because these ideas seem true.  Beacuse they make sense.  Because we can't bear to let them go.  And this is why the decline effect is so troubling. reminds us how difficult it is to prove anything.  We like to pretend that our experiments define the truth for us.  But that's often not the case.  Just because an idea is true doesn't mean it can be proved.  And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to choose what to believe. 

These points are directly relevant to evolutionary biology and genetics--and the over-selling of genetic determinacy that we post so often about.  They are sobering for those who actually want to do science rather than build their careers on hopes and dreams, using science as an ideological vehicle to do it, in the same way that other ideologies, like religion are used to advance societal or individual self-interest.

But these are examples of 'inconvenient truths'.   They are well-known but often honored mainly in the breech.  Indeed, even as we write, on our campus a speaker is going to invoke all sorts of high-scale 'omics' (proteomics, genomics, etc.) to tech our way out of what we know to be true: the biological world is largely complex.  Some Big Splash findings may not be entirely wrong, but most are at least exaggerated.  There are too many ways that flukes or subtle wishful thinking lead science astray, and why the supposedly iron-clad 'scientific method' as an objective way to understand our world, isn't so objective after all.


James Goetz said...

I'm still a fan of multiple tests with big sample sizes, but I suppose we need to reevaluate the methodology in known cases of decline effect. I know that I'm suggesting that we study these cases in hindsight, which has its limitations in science, but this is how businesses evaluate their processes.

When I've more time, I'll carefully study cases of decline effect.

Ken Weiss said...

The problems are at least two. First, there is an effect-size bias in the first reports, so whether or not to include them in meta-analysis is a legitimate question. Even with several studies jointly analyzed, if the first was inflated by statistical bias, it will continue to inflate the net results

Second, in biological data you don't necessarily expect replication even if the first finding is not a statistical fluke. Population B may just not have much or any of the causal variation identified in samples from Population A. So 'negative' follow-up may not indicate a false positive first result.

James Goetz said...

Ken, I suppose one way to look at what your suggesting is that the sample size wasn't large enough. In the case of a medicine with decline effect, the sum of the trials never evaluated a large enough sample size to predict the results of the medicine on a large scale.

"'Negative' follow-up [may or] may not indicate a false positive first result." If that's the case, then a much larger sample size was needed before any strong conclusion can be made.

Anne Buchanan said...

Jim, increasing sample size isn't always a plus. It can have the effect of increasing heterogeneity and expanding the possible genetic, environmental exposure, or gene x environment interaction pathways to the effect, thus _lowering_ the study's power to find causation.

Ken Weiss said...

Big sample size will not eliminate first-study bias. Anne's right about heterogeneity, too. But the problem with big samples is that they can identify effects better...but what they do better is identify small effects.

The follow-up issue has to do with not being able to replicate the same initial sample. Two samples, even of Europeans, may not capture the same mix of allelic effects. And samples from different populations will be different genomically, and hence not (necessarily) expected to replicate the first result. Indeed, it will largely be relatively common, and hence older, alleles whose effects will be replicated by meta-analysis.

There are other ways to follow-up studies, but the effect being talked of here is largely of the kind discussed. Other ways have their own problems.

Overall, however, publication and promotion-spinning bias are, along with statistical fluke results, at the root of many of the problems.

James Goetz said...

Anne, I suppose that if expanding various factors/contingencies ("increasing heterogeneity and expanding the possible genetic, environmental exposure, or gene x environment interaction pathways to the effect") in a larger sample changes the results, then the scientists need to understand those factors and test a larger sample size within the limits of those factors. If the scientists cannot do that, then they actually discovered next to nothing about causation.

James Goetz said...

Ken or Anne, I'm curious, do you agree or disagree with my last comment in this thread? Or was my last comment incomprehensible and you have no idea what I said? (I understand that I might have botched the language.:)

Anne Buchanan said...

Jim, it's not entirely clear what you were saying here, so answering was going to involve sorting that out. This week is crazy (finals and grades and so on), so I haven't yet had the time to do that. Maybe tomorrow!

James Goetz said...

Anne, sorry that I'm unclear about this. Perhaps after your done grading finals, you could help me to understand what you mean by "increasing heterogeneity and expanding the possible genetic, environmental exposure, or gene x environment interaction pathways to the effect." I suppose I have trouble making myself clear because I know next to nothing about procedures for clinical trials, yet I make some health decisions based on these trials. And I'm concerned about the decline effect for personal decisions and society in general.

I looked up some basics about clinical trials, which I hope will help me to express my concerns.

I suppose what you describe as "increasing heterogeneity and expanding the possible genetic, environmental exposure, or gene x environment interaction pathways to the effect" could have something to do with guidelines and inclusion/exclusion criteria for the participants that make up the sample in a clinical trial. I suppose that increasing the sample size while keeping strict inclusion/exclusion criteria could prevent "lowering the study's power to find causation." For example, if increasing a sample size increases heterogeneity and lowers the study's power to find causation, then the trial might need better inclusion/exclusion criteria that identifies the heterogeneity. Am I anywhere close to the ballpark on this?

James Goetz said...

I thought of the most basic philosophical concept behind my concern. Science is based on repeatable empirical experiments. Researchers must repetitively test new hypothesis. And if the repeatable tests indicate predictable results, then the respective hypothesis ascends to the status of theory. Likewise, a decline effect would indicate methodological problems in the previous experiments. And these methodological problems might involve unknown factors in the sample such as unknown heterogeneity in the sample of participants.

If increases in heterogeneity impact the results of the tests, then researchers must refine the hypothesis by focusing on a more homogeneous sample of participants, which would involve stricter inclusion/exclusion criteria for the participants. However, if researchers cannot identify the variation (heterogeneity) that impacts the test results, then they discovered next to nothing about causation.

Anne Buchanan said...

Exactly. Nicely put. And to complicate things even further, inconsistent results can actually be true -- that is, they can reveal real things going on in study populations that just happen to differ genetically or in environmental exposures, or whatever -- or one or all of them can be false, due to poor study design or methods. Or of course, they can be due to a combination of both.

Remember that every genome is unique. When we're dealing with factors with small effects (whether many factors or just 1 or a few), then genetic background can be more of an influence than when a variable has a large effect, and that itself will influence results.

You're right that replicability is considered a test of a finding's robustness, but singleton results can be just as valid as those replicated many times over -- Mendelian diseases that are unique to single families are a good example.

Ken Weiss said...

Replicability is one of several epistemological criteria in science. The criteria vary and none is perfect. What we use in practice is a rather informal, ad hoc mix of criteria.

For many of the criteria, one can think of the idea of signal vs noise. Noise should represent factors not related to the causation we're trying to infer. It includes mistakes, measurement error, sampling variation, and so on.

In some fields, perhaps chemistry being one, the noise is really that--all electrons are identical for any practical purpose, and whether we detect molecules doing this or that is based on so many molecules that individual variation is small, or ignorable, by comparison.

But as Anne just said, in biology chance is a major actual factor in what we're trying to understand. Thus, discriminating what is experimental noise vs what is natural variation is very difficult. Causation in life is largely stochastic (probabilistic) in nature, that is, it's of a kind with measurement and other sources of 'noise'.

James Goetz said...

I'll clarify that "predictable results" are often probabilistic, for example, microreversibility. And I tend to see probability in most things.

I also completely understand Anne's point that there's no possible way to obtain a large sample size for rarely occurring genetic diseases (tautology?). And something affecting only one family is just as real as something affecting millions of families. However, science has limits for understanding rarely occurring phenomena, while some conclusions are educated guesswork. In fact, reality never changes because repeated empirical tests finally discovers some of it.