Jim Wood
In
my recent post advocating the abandonment of NHST (null-hypothesis significance
testing), I skimmed over two important issues that I think need more
elaboration: the multiple-test problem and publication bias (also called the
“file-drawer” bias). The two are deeply
related and ought to make everyone profoundly uncomfortable about the true
meaning of achieved significance levels reported in the scientific literature.
The multiple-test problem is wonderfully illustrated by this cartoon, xkcd's 'Significant', by the incomparable Randall Munroe. The cartoon was subject of Monday's post here on MT (so you can scroll down to see it), but I thought it worthy of further reflection. (If you don’t know Munroe's cartoons, check ‘em out. In combination they’re like a two-dimensional “Big Bang Theory”, only deeper. I especially like “Frequentists vs. Bayesians”.)
If you don’t get the "Significant" cartoon, you need to study up on the multiple-test problem. But briefly stated, here it is: You (in your charming innocence) are doing the Neyman-Pearson version of NHST. As required, you preset an a value, i.e. the highest probability of making a Type 1 error (rejecting a true null hypothesis) that you’re willing to accept as compatible with a rejection of the null. Like most people, you choose a = 0.05, which means that if your test were to be repeated 20 times on different samples or using different versions of the test/model, you’d expect to find about one “significant” test even when your null hypothesis is absolutely true. Well, okay, most people seem to find 0.05 a sufficiently small value to proceed with the test and reject the null whenever p < 0.05 on a single test. But what if you do multiple tests in the course of, for example, refining your model, examining confounding effects, or just exploring the data-set? If, for example, you do 20 tests on different colors like the jellybean scientists, then there’s a quite high probability of getting at least one result with p < 0.05 even if the null hypothesis is eternally true. If you own up to the multiplicity of tests – or, indeed, if you’re even aware of the multiple-test problem – then there are various corrections you can do to adjust your putative p values (the Bonferroni correction is undoubtedly the best-known and most widely used of these). But if you don’t and if you only publish the one “significant” result (see the cartoon), then you are, whether consciously or unconsciously, lying to your readers about your test results. Green jellybeans cause acne!! (Who knew?)
The multiple-test/publication-bias problem is increasingly being seen as a major crisis in several areas of science. In research involving clinical trials, it’s making people think that many – perhaps most – reported results in epidemiology and pharmacology are bogus. Ben Goldacre of “Bad Science” and “Bad Pharma” fame has been especially effective in making this point. There are now several groups advocating the archiving of negative results for public access; see, for example, the Cochrane Collaboration. But in my own field of biological anthropology, the problem is scarcely even acknowledged. This is why N. T. Longford, the researcher cited in my previous post, called the scientific literature based on NHST a “junkyard” of unwarranted positive results. Yet more reason to abandon NHST.
The multiple-test problem is wonderfully illustrated by this cartoon, xkcd's 'Significant', by the incomparable Randall Munroe. The cartoon was subject of Monday's post here on MT (so you can scroll down to see it), but I thought it worthy of further reflection. (If you don’t know Munroe's cartoons, check ‘em out. In combination they’re like a two-dimensional “Big Bang Theory”, only deeper. I especially like “Frequentists vs. Bayesians”.)
xkcd: Frequentists v Baysians |
If you don’t get the "Significant" cartoon, you need to study up on the multiple-test problem. But briefly stated, here it is: You (in your charming innocence) are doing the Neyman-Pearson version of NHST. As required, you preset an a value, i.e. the highest probability of making a Type 1 error (rejecting a true null hypothesis) that you’re willing to accept as compatible with a rejection of the null. Like most people, you choose a = 0.05, which means that if your test were to be repeated 20 times on different samples or using different versions of the test/model, you’d expect to find about one “significant” test even when your null hypothesis is absolutely true. Well, okay, most people seem to find 0.05 a sufficiently small value to proceed with the test and reject the null whenever p < 0.05 on a single test. But what if you do multiple tests in the course of, for example, refining your model, examining confounding effects, or just exploring the data-set? If, for example, you do 20 tests on different colors like the jellybean scientists, then there’s a quite high probability of getting at least one result with p < 0.05 even if the null hypothesis is eternally true. If you own up to the multiplicity of tests – or, indeed, if you’re even aware of the multiple-test problem – then there are various corrections you can do to adjust your putative p values (the Bonferroni correction is undoubtedly the best-known and most widely used of these). But if you don’t and if you only publish the one “significant” result (see the cartoon), then you are, whether consciously or unconsciously, lying to your readers about your test results. Green jellybeans cause acne!! (Who knew?)
If
you do multiple tests and include all of them in your publication (a very rare
practice), then even if you don’t do the Bonferroni correction (or whatever)
your readers can. But what if you do a
whole bunch of tests and include only the
“significant” result(s) without acknowledging the other tests? Then you’re lying. You’re publishing the apparent green
jellybean effect without revealing that you tested 19 other colors, thus
invalidating the p < 0.05 you
achieved for the greenies. Let me say it
again: you’re lying, whether you know it or not.
This
problem is greatly compounded by publication bias. Understandably wanting your paper to be cited
and to have an impact, you submit only your significant results for
publication. Or your editor won’t accept
a paper for review without significant results.
Or your reviewers find “negative” results uninteresting and unworthy of
publication. Then significant test
results end up in print and non-significant ones are filtered out – thus the
bias. As a consequence, we have no idea
how to interpret your putative p value,
even if we buy into the NHST approach.
Are you lying? Do you even know
you’re lying?
This problem is real and serious. Many publications have explored how widespread the problem is, and their findings are not encouraging. I haven’t done a thorough review of those publications, but a few results stick in my mind. (If anyone can point me to the sources, I’d be grateful.) One study (in psychology, I think) found that the probability of submitting significant test results for publication was about 75% whereas the probability of submitting non-significant results was about 5%. (The non-significant results are, or used to be, stuck in a file cabinet, hence the alternative name for the bias.) Another study of randomized clinical trials found that failure to achieve significance was the single most common reason for not writing up the results of completed trials. Another study of journals in the behavioral and health sciences found that something like 85-90% of all published papers contain significant results (p < 0.05), which cannot even remotely reflect the reality of applied statistical testing. Again, please don’t take these numbers too literally (and please don’t cite me as their source) since they’re popping out of my rather aged brainpan rather than from the original publications.
This problem is real and serious. Many publications have explored how widespread the problem is, and their findings are not encouraging. I haven’t done a thorough review of those publications, but a few results stick in my mind. (If anyone can point me to the sources, I’d be grateful.) One study (in psychology, I think) found that the probability of submitting significant test results for publication was about 75% whereas the probability of submitting non-significant results was about 5%. (The non-significant results are, or used to be, stuck in a file cabinet, hence the alternative name for the bias.) Another study of randomized clinical trials found that failure to achieve significance was the single most common reason for not writing up the results of completed trials. Another study of journals in the behavioral and health sciences found that something like 85-90% of all published papers contain significant results (p < 0.05), which cannot even remotely reflect the reality of applied statistical testing. Again, please don’t take these numbers too literally (and please don’t cite me as their source) since they’re popping out of my rather aged brainpan rather than from the original publications.
The multiple-test/publication-bias problem is increasingly being seen as a major crisis in several areas of science. In research involving clinical trials, it’s making people think that many – perhaps most – reported results in epidemiology and pharmacology are bogus. Ben Goldacre of “Bad Science” and “Bad Pharma” fame has been especially effective in making this point. There are now several groups advocating the archiving of negative results for public access; see, for example, the Cochrane Collaboration. But in my own field of biological anthropology, the problem is scarcely even acknowledged. This is why N. T. Longford, the researcher cited in my previous post, called the scientific literature based on NHST a “junkyard” of unwarranted positive results. Yet more reason to abandon NHST.
Nice post. I've seen that number (likelihood of publishing) somewhere, and I'm in psychology, and follow the problems with it. Some of the usual suspects would be Greg Francis, Uri Simonsohn, Leslie John.
ReplyDeleteI also checked Keith Laws blog, as he talks about this and similar questions, esp in this particular post
http://keithsneuroblog.blogspot.se/2012/07/negativland-what-to-do-about-negative.html
I also collect a lot about these questions in my own two blogs. (Especially on Åse Fixes Science).
The discussion needs to be spread across disciplines, because I think one of the reasons nothing have changed is that there just have not been enough power to shake things out of equilibrium (I see the Crowd wisdom economics link to the right as I speak, where they mention that problem.
The hope is if things are shook up enough, perhaps we can move out of this space and into a more robust space.
Thanks for the links. Interestingly, active and ardent discussion of this problem (as well as the general NHST problem) seems to be largely confined to psychology. So you're absolutely right that the discussion needs to spread across disciplines.
ReplyDeleteWe have dealt with this issue on Mermaid's Tale as well, over the past couple of years. Actually, the problem is worse. Because first of all you usually have poked around in the area, or in your data, before you actually do your tests. You don't put your analytic plans specified in very precise terms, in a vault before collecting data, then do them.
ReplyDeleteSecondly, if you do enough tests, you can find what looks interesting and publish that without the multiple testing criterion. For example, it's not just the color of the jellybeans that needs multiple-test correction. Suppose in doing the color tests, you notice that there may be some pattern of their shape, so you then do a test of the 'spherical hypothesis' and reject it because (as you had noticed incidentally) the jellybeans were rather oblate.
Instead of using statistics as data exploration and simply saying that such-and-such seems interesting, and then designing a specific study to address that, we want the semblance of rigor when it just doesn't apply.
I think genetics has faced all of these issues in its study design history as genomic technology has advanced, including the NHST issue, but the latter is deeply rooted in epidemiological thinking, and for me the bottom line is that investigators will do whatever it takes to be able to report 'important' results.
Actually, the next stage in your "analysis" after showing that green jellybean have a significant effect on the risk of acne is to cook up an ex post facto explanation for that effect. No doubt one compound in the green dye has an adverse effect on the sebaceous glands. Worthy of another press release!
DeleteYes, well that's the gig! Sadly, it is always possible that surprise results are true and could lead to progress, but too often its post-hoc justifying or the effect is so weak as to be uninteresting even if 'significant'
DeleteTimely post on BMJ group Blogs: "Wim Weber on OPEN—the European research project to study publication bias".
ReplyDeleteRegarding the above cartoon, Bayesian Andrew Gelman has some interesting things to say:
ReplyDelete"I am happy to see statistical theory and methods be a topic in popular culture, and of course I’m glad that, contra Feller, the Bayesian is presented as the hero this time, but . . . . I think the lower-left panel of the cartoon unfairly misrepresents frequentist statisticians."
and,
"I think the cartoon as a whole is unfair in that it compares a sensible Bayesian to a frequentist statistician who blindly follows the advice of shallow textbooks."
Full post:
http://andrewgelman.com/2012/11/10/16808/