The Mermaid's Tale: Let's Abandon Significance Tests

Wednesday, May 15, 2013

Let's Abandon Significance Tests

By his own admission, this is not only the first blog post Jim Wood has ever written, it's also the first one he's ever read. We're hoping not the last of either, given this debut. Jim is a biological anthropologist and demographer in the Penn State Dept of Anthropology, a man of many interests. His statistical knowledge is deep and of course informs his every academic endeavor, from endocrinology to infectious disease to household agriculture and beyond. We think every student should print this post and carry it around with them everywhere. Well, anyone.
------------------------


Ronald Aylmer Fisher (1890-1962)

By Jim Wood

It’s time we killed off NHST.

NHST (also derisively called “the intro to stats method”) stands for Null Hypothesis Significance Testing, sometimes known as the Neyman-Pearson (N-P) approach after its inventors, Jerzy Neyman and Egon Pearson (son of the famous Karl). There is also an earlier, looser, slightly less vexing version called the Fisherian approach (after the even more famous R. A. Fisher), but most researchers seem to adopt the N-P form of NHST, at least implicitly – or rather some strange and logically incoherent hybrid of the two approaches. Whichever you prefer, they both have very weak philosophical credentials, and a growing number of statisticians, as well as working scientists who care about epistemology, are calling – loudly and frequently – for their abandonment. Nonetheless, whenever my colleagues and I submit a manuscript or grant proposal that says we’re not going to do significance tests – and for the following principled reasons – we always get at least one reviewer or editor telling us that we’re not doing real science. The demand by scientific journals for “significant” results has led over time to a substantial publication bias in favor of Type I errors, resulting in a literature that one statistician has called a “junkyard” of unwarranted conclusions (Longford, 2005).

Jerzy Neyman
(1894-1981)

Let me start this critique by taking the N-P framework on faith. We want to test some theoretical model. To do so, we need to translate it into a statistical hypothesis, even if the model doesn’t really lend itself to hypothesis formulation (as, I would argue, is often the case in population biology, including population genetics and demography). Typically, the hypothesis says that some predictor variable of theoretical interest (the so-called “independent” variable) has an effect on some outcome variable (the “dependent” variable) of equal interest. To test this proposition we posit a null hypothesis of no effect, to which our preferred hypothesis is an alternative – sometimes the alternative, but not necessarily. We want to test the null hypothesis against some data; more precisely, we want to compute the probability that the data (or data even less consistent with the null) could have been observed in a random sample of a given size if the null hypothesis were true. (Never mind whether anyone in his or her right mind would believe in the null hypothesis in the first place or, if pressed on the matter, would argue that it was worth testing on its own merits.)

Egon Pearson
(1895-1980)

Now we presumably have in hand a batch of data from a simple random sample drawn from a comprehensive sample frame – i.e. from a completely-known and well-characterized population (these latter stipulations are important and I return to them below). Before we do the test, we need to make two decisions that are absolutely essential to the N-P approach. First, we need to preset a so-called α value for the largest probability of making a Type I error (rejecting the null when it’s true) that we’re willing to consider consistent with a rejection of the null. Although we can set α at any value we please, we almost inevitably choose 0.05 or 0.01 or (if we’re really cautious) 0.001 or (if we’re happy-go-lucky) maybe 0.10. Why one of these values? Because we have five fingers – or ten if we count both hands. It really doesn’t go any deeper than that. Let’s face it, we choose one of these values because other people would look at us funny if we didn’t. If we choose α = 0.10, a lot of them will look at us funny anyway.

Suppose, then, we set α = 0.05, the usual crowd-pleaser. The next decision we have to make is to set a β value for the largest probability of committing a Type II error (accepting the null when it’s not true) that we can tolerate. The quantity (1 – β) is known as the power of the test, conventionally interpreted as the likelihood of rejecting a false null given the size of our sample and our preselected value of α. (By the way, don’t worry if you neglect to preset β because, heck, almost no one else bothers to – so it must not matter, right?) Now make some assumptions about how the variables are distributed in the population, e.g. that they’re normal random variates, and you’re ready to go.

So we do our test and we get p = 0.06 for the predictor variable we’re interested in. Damn. According to the iron law of α = 0.05 as laid down by Neyman and Pearson, we must accept the null hypothesis and reject any alternative, including our beloved one – which basically means that this paper is not going to get published. Or suppose we happen to get p = 0.04. Ha! We beat a, we get to reject the null, and that allows us to claim that the data support the alternative, i.e. the hypothesis we liked in the first place. We have achieved statistical significance! Why? Because God loves 0.04 and hates 0.06, two numbers that might otherwise seem to be very nearly indistinguishable from each other. So let’s go ahead and write up a triumphant manuscript for publication.

Significance is a useful means toward personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux…. [S]tatistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing “significant” results. Enjoy promotion. (Ziliak and McCloskey, 2008: 32)

This account may sound like a crude and cynical caricature, but it’s not – it’s the Neyman-Pearson version of NHST. The Fisherian version gives you a bit more leeway in how to interpret p, but it was Fisher who first suggested p less than or equal to 0.05 as a universal standard of truth. Either way, it is the established and largely unquestioned basis for assessing hypothesized effects. A p of 0.05 (or whatever value of a you prefer) divides the universe into truth or falsehood. It is the main – usually the sole – criterion for evaluating scientific hypotheses throughout most of the biological and behavioral sciences.

What are we to make of this logic? First and most obviously, there is the strange practice of using a fixed, inflexible, and totally arbitrary a value such as 0.05 to answer any kind of interesting scientific question. To my mind, 0.051 and 0.049 (for example) are pretty much identical – at least I have no idea how to make sense of such a tiny difference in probabilities. And yet one value leads us to accept one version of reality and the other an entirely different one.

To quote Kempthorne (1971: 490):

To turn to the case of using accept-reject rules for the evaluation of data, … it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brainwashed.

Think Kempthorne’s being hyperbolic in that last sentence? Nelson et al. (1986) did a survey of active researchers in psychology to ascertain their confidence in non-null hypotheses based on reported p values and discovered a sharp cliff effect (an abrupt change in confidence) at p = 0.05, despite the fact that p values change continuously across their whole range (a smaller cliff was found at p = 0.10). In response, Gigerenzer (2004: 590) lamented, “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

But now suppose we’ve learned our lesson: and so, chastened, we abandon our arbitrary threshold a value and look instead at the exact p value associated with our predictor variable, as many writers have advocated. And let’s supposed that it is impressively low, say p = 0.00073. We conclude, correctly, that if the null hypothesis were true (which we never really believed in the first place) then the data we actually obtained in our sample would have been pretty unlikely. So, following standard practice, we conclude that the probability that the null hypothesis is true is only 0.00073. Right? Wrong. We have confused the probability of the data if you are given the hypothesis, P(Data|H₀), which is p, with its inverse probability P(H₀|Data), the probability of the hypothesis if you are given the data, which is something else entirely. Ironically, we can compute the inverse probability from the original probability – but only if we adopt a Bayesian approach that allows for “subjective” probabilities. That approach says that you begin the study of some prior belief (expressed as a probability) in a given hypothesis, and adjust that in light of your new data.

Alas, the whole NHST framework is by definition frequentist (that means it interprets your results as if you could do the same study countless times and your data are but one such realization) and does not permit the inversion of probabilities, which can only be done by invoking that pesky Bayes’s theorem that drives frequentists nuts. In the frequentist worldview, the null hypothesis is either true or false, period; it cannot have an intermediate probability assigned to it. Which, of course, means that 1 – P(H₀|Data), the probability that the alternative hypothesis is correct, is also undefined. In other words, if we do NHST, we have no warrant to conclude that either the null or the alternative hypothesis is true or false, or even likely or unlikely for that matter. To quote Jacob Cohen (1994), “The earth is round (p < 0.05).” Think about it.

(And what if our preferred alternative hypothesis is not the one and only possible alternative hypothesis? Then even if we could disprove the null, it would tell us nothing about the support provided by the data for our particular pet hypothesis. It would only show that some alternative is the correct one.)

But all this is moot. The calculation of p values assumes that we have drawn a simple random sample (SRS) from a population whose members are known with exactitude (i.e. from a comprehensive and non-redundant sample frame). There are corrections for certain kinds of deviations from SRS such as stratified sampling and cluster sampling, but these still assume an equal-probability random sampling method. This condition is almost never met in real-world research, including, God knows, my own. It’s not even met in experimental research – especially experiments on humans, which by moral necessity involve self-selection. In addition, the conventional interpretation of p values assumes that random sampling error associated with a finite sample size is the only source of error in our analysis, thus ignoring measurement error, various kinds of selection bias, model-specification error, etc., which together may greatly outweigh pure sampling error.

And don’t even get me started on the multiple-test problem, which can lead to completely erroneous estimates of the attained “significance” level of the test we finally decide to go with. This problem can get completely out of hand if any amount of exploratory analysis has been done. (Does anyone keep careful track of the number of preliminary analyses that are run in the course of, say, model development? I don’t.) As a result, the p values dutifully cranked out by statistical software packages are, to put it bluntly, wrong.

One final technical point: I mentioned above that almost no one sets a β value for their analysis, despite the fact that β determines how large a sample you’re going to need to meet your goal of rejecting the null hypothesis before you even go out and collect your data. Does it make any difference? Well, one survey calculated that the median (unreported) power of a large number of nonexperimental studies was about 0.48 (Maxwell, 2004). In other words, when it comes to accepting or rejecting the null hypothesis you might as well flip a coin.

And one final philosophical point: what do we really mean when we say that a finding is “statistically significant”? We mean an effect appears, according to some arbitrary standard such as p < 0.05, to exist. It does not tell us how big or how important the effect is. Statistical significance most emphatically is not the same as scientific or clinical significance. So why call it “significance” at all? With due deference to R. A. Fisher, who first came up with this odd and profoundly misleading bit of jargon, I suggest that the term “statistical significance” has been so corrupted by bad usage that it ought to be banished from the scientific literature.

In fact, I believe that NHST as a whole should be banished. At best, I regard an exact p value as providing nothing more than a loose indication of the uncertainty associated with a finite sample and a finite sample alone; it does not reveal God’s truth or substitute for thinking on the researcher’s part. I’m by no means alone (or original) in coming to this conclusion. More and more professional statisticians, as well as researchers in fields as diverse as demography, epidemiology, ecology, psychology, sociology, and so forth, are now making the same argument – and have been for some years (for just a few examples, see Oakes 1986; Rothman 1998; Hoem 2008; Cumming 2012; Fidler 2013; Kline 2013).

But if we abandon NHST, do we then have to stop doing anything other than purely descriptive statistics? After all, we were taught in intro stats that NHST is a virtual synonym for inferential statistics. But it’s not. This is not the place to discuss the alternatives to NHST in any detail (see the references provided a few sentences back), but it seems to me that instead of making a categorical yes/no decision about the existence of an effect (a rather metaphysical proposition), we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation. We should also be fitting theoretically-interesting models and estimating their parameters, from which effect sizes can often be computed. And I have to admit, despite having been a diehard frequentist for the last several decades, I’m increasingly drawn to Bayesian analysis (for a crystal-clear introduction to which, see Kruschke, 2011). Thinking in terms of the posterior distribution, of support for a model provided by previous research as modified by new data seems a quite natural and intuitive way to capture how scientific knowledge actually accumulates. Anyway, the current literature is full of alternatives to NHST, and we should be exploring them.

By the way, the whole anti-NHST movement is relevant to the “Mermaid’s Tale” because most published biomedical and epidemiological “discoveries” (including what’s published in press releases) amount to nothing more than the blind acceptance of p values less than 0.05. I point to Anne Buchanan’s recent critical posting here about studies supposedly showing that sunlight significantly reduces blood pressure. At the p < 0.05 level, no doubt.

REFERENCES

Cohen, J. (1994) The earth is round (p < 0.05). American Psychologist 49: 997-1003.

Cumming, G. (2012) Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

Fidler, F. (2013) From Statistical Significance to Effect Estimation: Statistical Reform in Psychology, Medicine and Ecology. Routledge, New York.

Gigerenzer, G. (2004) Mindless statistics. Journal of Socio-Economics 33: 587-606.

Hoem, J. M. (2008) The reporting of statistical significance in scientific journals: A reflexion. Demographic Research 18: 437-42.

Kempthorne, O. (1971) Discussion comment in Godambe, V. P., and Sprott, D. A. (eds.), Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston.

Kline, R. B. (2013) Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.

Kruschke, J. K. (2011) Doing Bayesian Analysis. Amsterdam: Elsevier.

Longford, N. T. (2005) Model selection and efficiency: Is “which model…?” the right question? Journal of the Royal Statistical Society (Series A) 168: 469-72.

Maxwell, S. E. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods 9: 147-63.

Nelson, N., Rosenthal, R., and Rosnow, R. L. (1986) Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299-1301.

Oakes, M. (1986) Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: John Wiley and Sons.

Rothman, K. J. (1998) Writing for Epidemiology. Epidemiology 9: 333-37.

Ziliak, S., and McCloskey, D. N. (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.

30 comments:

Holly DunsworthMay 15, 2013 at 6:56 AM
Thank you Jim! I recently attended my first medical conference and was absolutely flummoxed by the obsession with p values. People were scanning posters only for a p value and then moving to the next. I was literally frightened. And now I know better why.
ReplyDelete
Replies
Holly DunsworthMay 15, 2013 at 8:40 AM
Here's a link someone just tweeted (thank you!) to a Kruschke paper of relevance (not same as referenced above but...):
Bayesian Data Analysis
ReplyDelete
Replies
Anne BuchananMay 15, 2013 at 8:46 AM
Twitter has lit up on this! Here's another paper just tweeted, James Lumsden, 1976, "Test theory", and one from Jason Moore, 2013, "The limits of p-values for biological data mining".
ReplyDelete
Replies
Ken WeissMay 15, 2013 at 8:59 AM
Ian Hacking has a very fine book called The Taming of Chance, about the origins of probability in science, and there are others.

Significance testing has its own history, but basically it all boils down to the Enlightment era's change from deductive to inductive reasoning: that is, from just thinking vaguely about nature to observing repeatable aspects and inferring causation from them.

For various reasons, this led to the immensely practical use of statistics and probability _where the replication assumption is valid_

We're now in a new territory and need new thinking.

For example, "The limits of p-values" paper deals with genomic significance testing and shows how silly it has become.

There, to illustrate a point, GWAS etc makes the null hypothesis that genes don't affect the trait (like presence of disease), and tests every SNP in a huge set of markers for a case-control difference. Since hundreds of thousands of tests are done, the accepted p-value for a site to be called 'significant' is exceedingly small.

In fact, GWAS is done (when it's done properly, which isn't all the time) only if we already know the trait has some genetic component (e.g., it runs in families). So the significance tests in mapping are bogus. All you can do is search for regions of the genome that are most likely to be having an effect. There is no need or call to interpret the SNP sites as 'significant', even if one needs a way to decide which (if any) to follow up.

There is so much mis-use of statistical concepts that people are or should be aware of, as to be amazing to contemplate.
ReplyDelete
Replies
Herman PontzerMay 15, 2013 at 10:45 AM
Fantastic skewering of simple minded stats. Great piece. I'd be curious to hear your thoughts on the value of the other "beta" estimates (effect estimates) and r2 values and the like, which provide a measure of effect size but rely on many of the same problematic assumptions you note here for p-values. I for one would happily switch to Bayesian stats but am too lazy to put up with the BS of being an early adopter (e.g., uninformed reviewers, lack of stats software, etc). Perhaps we're near a tipping point on this?
ReplyDelete
Replies
Dave NussbaumMay 15, 2013 at 12:03 PM
Given that these critiques have been made before, I think the most important question is: how do we get from here to there? It's clearly not sufficient to merely explain rationally that our current methods are flawed -- the status quo is hard to change. The real question is how do you go about changing it.

The fact that there are many NHST alternatives out there is great -- but could actually be harmful to the overall goal of replacing NHST because there is no clear alternative. And there are big hurdles to overcome when, in the name of better science, researchers get fewer publishable results. That's clearly a hurdle worth overcoming, but it's generally one we ignore rather than grapple with.

Any thoughts?
ReplyDelete
Replies
AnonymousMay 16, 2013 at 12:24 PM
Great post. I completely agree with most of your criticisms. But I feel I should push back a bit. In particular I'm afraid that some might read this post and think 'frequentist statistics bad'. Frequentist reasoning at its core just means: think about what you may have observed instead of what you did observe. And this is a good idea. On the other hand, Bayesian reasoning is good too. I, and many other people who are smarter than me, find that it provides good methods of estimation. For example, in a small sample we might find a large and significant effect of ESP, but Bayesian reasoning allows us to temper how much we 'listen' to the raw data.

Here's a bit more on my opinions:

http://stevencarlislewalker.wordpress.com/2012/10/23/basis-functions-for-locating-statistical-philosophies/

Andrew Gelman's blog is another great place to find this kind of statistical pluralism. He has an idea that I like which he calls the methodological attribution problem: when someone does successful research, we have a tendency to attribute that success to the methods, but often the methods themselves have less to do with it. Rather, good research gets done when the researcher uses whatever methodological framework well. Gelman notices that good Bayesians let themselves use frequentist reasoning (a good description of Gelman himself) and good frequentists let themselves use Bayesian reasoning (eg. Efron -- the boorstrap guy).

In general, I agree with these criticisms of NHST here but think that maybe the tone is a little too strong?? Not sure...great post all the same.
ReplyDelete
Replies
John K. KruschkeMay 16, 2013 at 5:06 PM
Bayesian estimation supersedes the t test. That's a new article, explaining how to do Bayesian analysis and what's wrong with p values. Here: http://www.indiana.edu/~kruschke/BEST/
ReplyDelete
Replies

Add comment