Tuesday, November 25, 2014

Let's Abandon Significance Tests

We thought we'd re-run the first blog post Jim Wood wrote (or read), from May 2013, on significance testing.  It was an excellent post the first time, and it's an excellent post again, with a message that doesn't get old.

By Jim Wood

It’s time we killed off NHST.

Ronald Aylmer Fisher


NHST (also derisively called “the intro to stats method”) stands for Null Hypothesis Significance Testing, sometimes known as the Neyman-Pearson (N-P) approach after its inventors, Jerzy Neyman and Egon Pearson (son of the famous Karl). There is also an earlier, looser, slightly less vexing version called the Fisherian approach (after the even more famous R. A. Fisher), but most researchers seem to adopt the N-P form of NHST, at least implicitly – or rather some strange and logically incoherent hybrid of the two approaches. Whichever you prefer, they both have very weak philosophical credentials, and a growing number of statisticians, as well as working scientists who care about epistemology, are calling – loudly and frequently – for their abandonment. Nonetheless, whenever my colleagues and I submit a manuscript or grant proposal that says we’re not going to do significance tests – and for the following principled reasons ­­– we always get at least one reviewer or editor telling us that we’re not doing real science. The demand by scientific journals for “significant” results has led over time to a substantial publication bias in favor of Type I errors, resulting in a literature that one statistician has called a “junkyard” of unwarranted conclusions (Longford, 2005).

Jerzy Neyman
Let me start this critique by taking the N-P framework on faith. We want to test some theoretical model. To do so, we need to translate it into a statistical hypothesis, even if the model doesn’t really lend itself to hypothesis formulation (as, I would argue, is often the case in population biology, including population genetics and demography). Typically, the hypothesis says that some predictor variable of theoretical interest (the so-called “independent” variable) has an effect on some outcome variable (the “dependent” variable) of equal interest. To test this proposition we posit a null hypothesis of no effect, to which our preferred hypothesis is an alternative – sometimes the alternative, but not necessarily. We want to test the null hypothesis against some data; more precisely, we want to compute the probability that the data (or data even less consistent with the null) could have been observed in a random sample of a given size if the null hypothesis were true. (Never mind whether anyone in his or her right mind would believe in the null hypothesis in the first place or, if pressed on the matter, would argue that it was worth testing on its own merits.)   

Egon Pearson
Now we presumably have in hand a batch of data from a simple random sample drawn from a comprehensive sample frame – i.e. from a completely-known and well-characterized population (these latter stipulations are important and I return to them below). Before we do the test, we need to make two decisions that are absolutely essential to the N-P approach. First, we need to preset a so-called α value for the largest probability of making a Type I error (rejecting the null when it’s true) that we’re willing to consider consistent with a rejection of the null. Although we can set α at any value we please, we almost inevitably choose 0.05 or 0.01 or (if we’re really cautious) 0.001 or (if we’re happy-go-lucky) maybe 0.10. Why one of these values? Because we have five fingers – or ten if we count both hands. It really doesn’t go any deeper than that. Let’s face it, we choose one of these values because other people would look at us funny if we didn’t. If we choose α = 0.10, a lot of them will look at us funny anyway.

Suppose, then, we set α = 0.05, the usual crowd-pleaser. The next decision we have to make is to set a β value for the largest probability of committing a Type II error (accepting the null when it’s not true) that we can tolerate. The quantity (1 – β) is known as the power of the test, conventionally interpreted as the likelihood of rejecting a false null given the size of our sample and our preselected value of α. (By the way, don’t worry if you neglect to preset β because, heck, almost no one else bothers to – so it must not matter, right?) Now make some assumptions about how the variables are distributed in the population, e.g. that they’re normal random variates, and you’re ready to go.

So we do our test and we get = 0.06 for the predictor variable we’re interested in. Damn. According to the iron law of α = 0.05 as laid down by Neyman and Pearson, we must accept the null hypothesis and reject any alternative, including our beloved one – which basically means that this paper is not going to get published. Or suppose we happen to get = 0.04. Ha! We beat a, we get to reject the null, and that allows us to claim that the data support the alternative, i.e. the hypothesis we liked in the first place. We have achieved statistical significance! Why? Because God loves 0.04 and hates 0.06, two numbers that might otherwise seem to be very nearly indistinguishable from each other. So let’s go ahead and write up a triumphant manuscript for publication. 
Significance is a useful means toward personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux…. [S]tatistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing “significant” results. Enjoy promotion. (Ziliak and McCloskey, 2008: 32)
This account may sound like a crude and cynical caricature, but it’s not – it’s the Neyman-Pearson version of NHST. The Fisherian version gives you a bit more leeway in how to interpret p, but it was Fisher who first suggested less than or equal to  0.05 as a universal standard of truth. Either way, it is the established and largely unquestioned basis for assessing hypothesized effects. A of 0.05 (or whatever value of a you prefer) divides the universe into truth or falsehood. It is the main – usually the sole – criterion for evaluating scientific hypotheses throughout most of the biological and behavioral sciences. 

What are we to make of this logic? First and most obviously, there is the strange practice of using a fixed, inflexible, and totally arbitrary a value such as 0.05 to answer any kind of interesting scientific question. To my mind, 0.051 and 0.049 (for example) are pretty much identical – at least I have no idea how to make sense of such a tiny difference in probabilities. And yet one value leads us to accept one version of reality and the other an entirely different one.

To quote Kempthorne (1971: 490):
To turn to the case of using accept-reject rules for the evaluation of data, … it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brainwashed.

Think Kempthorne’s being hyperbolic in that last sentence? Nelson et al. (1986) did a survey of active researchers in psychology to ascertain their confidence in non-null hypotheses based on reported p values and discovered a sharp cliff effect (an abrupt change in confidence) at p = 0.05, despite the fact that p values change continuously across their whole range (a smaller cliff was found at p = 0.10). In response, Gigerenzer (2004: 590) lamented, “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

But now suppose we’ve learned our lesson: and so, chastened, we abandon our arbitrary threshold a value and look instead at the exact p value associated with our predictor variable, as many writers have advocated. And let’s supposed that it is impressively low, say = 0.00073. We conclude, correctly, that if the null hypothesis were true (which we never really believed in the first place) then the data we actually obtained in our sample would have been pretty unlikely. So, following standard practice, we conclude that the probability that the null hypothesis is true is only 0.00073. Right? Wrong. We have confused the probability of the data if you are given the hypothesis,  P(Data|H0), which is p, with its inverse probability P(H0|Data), the probability of the hypothesis if you are given the data, which is something else entirely. Ironically, we can compute the inverse probability from the original probability – but only if we adopt a Bayesian approach that allows for “subjective” probabilities. That approach says that you begin the study of some prior belief (expressed as a probability) in a given hypothesis, and adjust that in light of your new data.

Alas, the whole NHST framework is by definition frequentist (that means it interprets your results as if you could do the same study countless times and your data are but one such realization)  and does not permit the inversion of probabilities, which can only be done by invoking that pesky Bayes’s theorem that drives frequentists nuts. In the frequentist worldview, the null hypothesis is either true or false, period; it cannot have an intermediate probability assigned to it. Which, of course, means that 1 – P(H0|Data), the probability that the alternative hypothesis is correct, is also undefined. In other words, if we do NHST, we have no warrant to conclude that either the null or the alternative hypothesis is true or false, or even likely or unlikely for that matter. To quote Jacob Cohen (1994), “The earth is round (< 0.05).” Think about it.

(And what if our preferred alternative hypothesis is not the one and only possible alternative hypothesis? Then even if we could disprove the null, it would tell us nothing about the support provided by the data for our particular pet hypothesis. It would only show that some alternative is the correct one.)

But all this is moot. The calculation of p values assumes that we have drawn a simple random sample (SRS) from a population whose members are known with exactitude (i.e. from a comprehensive and non-redundant sample frame). There are corrections for certain kinds of deviations from SRS such as stratified sampling and cluster sampling, but these still assume an equal-probability random sampling method. This condition is almost never met in real-world research, including, God knows, my own. It’s not even met in experimental research – especially experiments on humans, which by moral necessity involve self-selection. In addition, the conventional interpretation of values assumes that random sampling error associated with a finite sample size is the only source of error in our analysis, thus ignoring measurement error, various kinds of selection bias, model-specification error, etc., which together may greatly outweigh pure sampling error.

And don’t even get me started on the multiple-test problem, which can lead to completely erroneous estimates of the attained “significance” level of the test we finally decide to go with. This problem can get completely out of hand if any amount of exploratory analysis has been done. (Does anyone keep careful track of the number of preliminary analyses that are run in the course of, say, model development? I don’t.) As a result, the values dutifully cranked out by statistical software packages are, to put it bluntly, wrong.

One final technical point: I mentioned above that almost no one sets a β value for their analysis, despite the fact that β determines how large a sample you’re going to need to meet your goal of rejecting the null hypothesis before you even go out and collect your data. Does it make any difference? Well, one survey calculated that the median (unreported) power of a large number of nonexperimental studies was about 0.48 (Maxwell, 2004). In other words, when it comes to accepting or rejecting the null hypothesis you might as well flip a coin.

And one final philosophical point: what do we really mean when we say that a finding is “statistically significant”? We mean an effect appears, according to some arbitrary standard such as < 0.05, to exist. It does not tell us how big or how important the effect is. Statistical significance most emphatically is not the same as scientific or clinical significance. So why call it “significance” at all? With due deference to R. A. Fisher, who first came up with this odd and profoundly misleading bit of jargon, I suggest that the term “statistical significance” has been so corrupted by bad usage that it ought to be banished from the scientific literature.

In fact, I believe that NHST as a whole should be banished. At best, I regard an exact value as providing nothing more than a loose indication of the uncertainty associated with a finite sample and a finite sample alone; it does not reveal God’s truth or substitute for thinking on the researcher’s part. I’m by no means alone (or original) in coming to this conclusion. More and more professional statisticians, as well as researchers in fields as diverse as demography, epidemiology, ecology, psychology, sociology, and so forth, are now making the same argument – and have been for some years (for just a few examples, see Oakes 1986; Rothman 1998; Hoem 2008; Cumming 2012; Fidler 2013; Kline 2013).

But if we abandon NHST, do we then have to stop doing anything other than purely descriptive statistics? After all, we were taught in intro stats that NHST is a virtual synonym for inferential statistics. But it’s not. This is not the place to discuss the alternatives to NHST in any detail (see the references provided a few sentences back), but it seems to me that instead of making a categorical yes/no decision about the existence of an effect (a rather metaphysical proposition), we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation. We should also be fitting theoretically-interesting models and estimating their parameters, from which effect sizes can often be computed. And I have to admit, despite having been a diehard frequentist for the last several decades, I’m increasingly drawn to Bayesian analysis (for a crystal-clear introduction to which, see Kruschke, 2011). Thinking in terms of the posterior distribution, of support for a model provided by previous research as modified by new data seems a quite natural and intuitive way to capture how scientific knowledge actually accumulates. Anyway, the current literature is full of alternatives to NHST, and we should be exploring them.

By the way, the whole anti-NHST movement is relevant to the “Mermaid’s Tale” because most published biomedical and epidemiological “discoveries” (including what’s published in press releases) amount to nothing more than the blind acceptance of values less than 0.05. I point to Anne Buchanan’s recent critical posting here about studies supposedly showing that sunlight significantly reduces blood pressure. At the < 0.05 level, no doubt.


Cohen, J. (1994) The earth is round (p < 0.05). American Psychologist 49: 997-1003.

Cumming, G. (2012) Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

Fidler, F. (2013) From Statistical Significance to Effect Estimation: Statistical Reform in Psychology, Medicine and Ecology. Routledge, New York.

Gigerenzer, G. (2004) Mindless statistics. Journal of Socio-Economics 33: 587-606.

Hoem, J. M. (2008) The reporting of statistical significance in scientific journals: A reflexion. Demographic Research 18: 437-42.

Kempthorne, O. (1971) Discussion comment in Godambe, V. P., and Sprott, D. A. (eds.), Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston.

Kline, R. B. (2013) Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.

Kruschke, J. K. (2011) Doing Bayesian Analysis. Amsterdam: Elsevier.

Longford, N. T. (2005) Model selection and efficiency: Is “which model…?” the right question? Journal of the Royal Statistical Society (Series A) 168: 469-72.

Maxwell, S. E. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods 9: 147-63.

Nelson, N., Rosenthal, R., and Rosnow, R. L. (1986) Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299-1301.

Oakes, M. (1986) Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: John Wiley and Sons.

Rothman, K. J. (1998) Writing for Epidemiology. Epidemiology 9: 333-37.

Ziliak, S., and McCloskey, D. N. (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.


Miika A. said...

Thanks for this entertaining read.

I somehow get the feeling something's gotten mixed up here though. Bunching p-values, Fisher and Neyman-Pearson into a single 'Frequentist' entity is a sure way of causing confusion.

Have a read of Efron's brilliant paper 'R. A. Fisher in the 21st century', http://projecteuclid.org/download/pdf_1/euclid.ss/1028905930

I think the paper is very clear on the differences between the three (3, not 2) schools of thought: Fisherian, Frequentist and Bayesian.

The assumptions behind the theory of using p-values may not hold in practice, but then again isn't it true that all models are wrong but some are useful? I've found p-values to be extremely useful in practice due to the simplicity of setting them up. Compare this to the nightmare of putting (quickly - no way) together a Bayesian model and estimating it in 5 milliseconds rather than 5 weeks.

'we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation': Neyman-Pearson confidence intervals are for this exact purpose, no? They're very frequentist.

Anonymous said...

Worse: a p=0.05 result usually isn't as good as most people think it is. There's much more in this vein (with added equations) brilliantly expounded by David Colquhoun at:

On the hazards of significance testing (blog)


An investigation of the false discovery rate and the misinterpretation of p-values (paper).