Wednesday, May 15, 2013

Let's Abandon Significance Tests

By his own admission, this is not only the first blog post Jim Wood has ever written, it's also the first one he's ever read. We're hoping not the last of either, given this debut. Jim is a biological anthropologist and demographer in the Penn State Dept of Anthropology, a man of many interests. His statistical knowledge is deep and of course informs his every academic endeavor, from endocrinology to infectious disease to household agriculture and beyond. We think every student should print this post and carry it around with them everywhere.  Well, anyone.

Ronald Aylmer Fisher

By Jim Wood

It’s time we killed off NHST.

NHST (also derisively called “the intro to stats method”) stands for Null Hypothesis Significance Testing, sometimes known as the Neyman-Pearson (N-P) approach after its inventors, Jerzy Neyman and Egon Pearson (son of the famous Karl). There is also an earlier, looser, slightly less vexing version called the Fisherian approach (after the even more famous R. A. Fisher), but most researchers seem to adopt the N-P form of NHST, at least implicitly – or rather some strange and logically incoherent hybrid of the two approaches. Whichever you prefer, they both have very weak philosophical credentials, and a growing number of statisticians, as well as working scientists who care about epistemology, are calling – loudly and frequently – for their abandonment. Nonetheless, whenever my colleagues and I submit a manuscript or grant proposal that says we’re not going to do significance tests – and for the following principled reasons ­­– we always get at least one reviewer or editor telling us that we’re not doing real science. The demand by scientific journals for “significant” results has led over time to a substantial publication bias in favor of Type I errors, resulting in a literature that one statistician has called a “junkyard” of unwarranted conclusions (Longford, 2005).

Jerzy Neyman
Let me start this critique by taking the N-P framework on faith. We want to test some theoretical model. To do so, we need to translate it into a statistical hypothesis, even if the model doesn’t really lend itself to hypothesis formulation (as, I would argue, is often the case in population biology, including population genetics and demography). Typically, the hypothesis says that some predictor variable of theoretical interest (the so-called “independent” variable) has an effect on some outcome variable (the “dependent” variable) of equal interest. To test this proposition we posit a null hypothesis of no effect, to which our preferred hypothesis is an alternative – sometimes the alternative, but not necessarily. We want to test the null hypothesis against some data; more precisely, we want to compute the probability that the data (or data even less consistent with the null) could have been observed in a random sample of a given size if the null hypothesis were true. (Never mind whether anyone in his or her right mind would believe in the null hypothesis in the first place or, if pressed on the matter, would argue that it was worth testing on its own merits.)   

Egon Pearson
Now we presumably have in hand a batch of data from a simple random sample drawn from a comprehensive sample frame – i.e. from a completely-known and well-characterized population (these latter stipulations are important and I return to them below). Before we do the test, we need to make two decisions that are absolutely essential to the N-P approach. First, we need to preset a so-called α value for the largest probability of making a Type I error (rejecting the null when it’s true) that we’re willing to consider consistent with a rejection of the null. Although we can set α at any value we please, we almost inevitably choose 0.05 or 0.01 or (if we’re really cautious) 0.001 or (if we’re happy-go-lucky) maybe 0.10. Why one of these values? Because we have five fingers – or ten if we count both hands. It really doesn’t go any deeper than that. Let’s face it, we choose one of these values because other people would look at us funny if we didn’t. If we choose α = 0.10, a lot of them will look at us funny anyway.

Suppose, then, we set α = 0.05, the usual crowd-pleaser. The next decision we have to make is to set a β value for the largest probability of committing a Type II error (accepting the null when it’s not true) that we can tolerate. The quantity (1 – β) is known as the power of the test, conventionally interpreted as the likelihood of rejecting a false null given the size of our sample and our preselected value of α. (By the way, don’t worry if you neglect to preset β because, heck, almost no one else bothers to – so it must not matter, right?) Now make some assumptions about how the variables are distributed in the population, e.g. that they’re normal random variates, and you’re ready to go.

So we do our test and we get p = 0.06 for the predictor variable we’re interested in. Damn. According to the iron law of α = 0.05 as laid down by Neyman and Pearson, we must accept the null hypothesis and reject any alternative, including our beloved one – which basically means that this paper is not going to get published. Or suppose we happen to get p = 0.04. Ha! We beat a, we get to reject the null, and that allows us to claim that the data support the alternative, i.e. the hypothesis we liked in the first place. We have achieved statistical significance! Why? Because God loves 0.04 and hates 0.06, two numbers that might otherwise seem to be very nearly indistinguishable from each other. So let’s go ahead and write up a triumphant manuscript for publication. 
Significance is a useful means toward personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux…. [S]tatistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing “significant” results. Enjoy promotion. (Ziliak and McCloskey, 2008: 32)
This account may sound like a crude and cynical caricature, but it’s not – it’s the Neyman-Pearson version of NHST. The Fisherian version gives you a bit more leeway in how to interpret p, but it was Fisher who first suggested p less than or equal to  0.05 as a universal standard of truth. Either way, it is the established and largely unquestioned basis for assessing hypothesized effects. A p of 0.05 (or whatever value of a you prefer) divides the universe into truth or falsehood. It is the main – usually the sole – criterion for evaluating scientific hypotheses throughout most of the biological and behavioral sciences.

What are we to make of this logic? First and most obviously, there is the strange practice of using a fixed, inflexible, and totally arbitrary a value such as 0.05 to answer any kind of interesting scientific question. To my mind, 0.051 and 0.049 (for example) are pretty much identical – at least I have no idea how to make sense of such a tiny difference in probabilities. And yet one value leads us to accept one version of reality and the other an entirely different one.

To quote Kempthorne (1971: 490):
To turn to the case of using accept-reject rules for the evaluation of data, … it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brainwashed.

Think Kempthorne’s being hyperbolic in that last sentence? Nelson et al. (1986) did a survey of active researchers in psychology to ascertain their confidence in non-null hypotheses based on reported p values and discovered a sharp cliff effect (an abrupt change in confidence) at p = 0.05, despite the fact that p values change continuously across their whole range (a smaller cliff was found at p = 0.10). In response, Gigerenzer (2004: 590) lamented, “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

But now suppose we’ve learned our lesson: and so, chastened, we abandon our arbitrary threshold a value and look instead at the exact p value associated with our predictor variable, as many writers have advocated. And let’s supposed that it is impressively low, say p = 0.00073. We conclude, correctly, that if the null hypothesis were true (which we never really believed in the first place) then the data we actually obtained in our sample would have been pretty unlikely. So, following standard practice, we conclude that the probability that the null hypothesis is true is only 0.00073. Right? Wrong. We have confused the probability of the data if you are given the hypothesis,  P(Data|H0), which is p, with its inverse probability P(H0|Data), the probability of the hypothesis if you are given the data, which is something else entirely. Ironically, we can compute the inverse probability from the original probability – but only if we adopt a Bayesian approach that allows for “subjective” probabilities. That approach says that you begin the study of some prior belief (expressed as a probability) in a given hypothesis, and adjust that in light of your new data.

Alas, the whole NHST framework is by definition frequentist (that means it interprets your results as if you could do the same study countless times and your data are but one such realization)  and does not permit the inversion of probabilities, which can only be done by invoking that pesky Bayes’s theorem that drives frequentists nuts. In the frequentist worldview, the null hypothesis is either true or false, period; it cannot have an intermediate probability assigned to it. Which, of course, means that 1 – P(H0|Data), the probability that the alternative hypothesis is correct, is also undefined. In other words, if we do NHST, we have no warrant to conclude that either the null or the alternative hypothesis is true or false, or even likely or unlikely for that matter. To quote Jacob Cohen (1994), “The earth is round (< 0.05).” Think about it.

(And what if our preferred alternative hypothesis is not the one and only possible alternative hypothesis? Then even if we could disprove the null, it would tell us nothing about the support provided by the data for our particular pet hypothesis. It would only show that some alternative is the correct one.)

But all this is moot. The calculation of p values assumes that we have drawn a simple random sample (SRS) from a population whose members are known with exactitude (i.e. from a comprehensive and non-redundant sample frame). There are corrections for certain kinds of deviations from SRS such as stratified sampling and cluster sampling, but these still assume an equal-probability random sampling method. This condition is almost never met in real-world research, including, God knows, my own. It’s not even met in experimental research – especially experiments on humans, which by moral necessity involve self-selection. In addition, the conventional interpretation of p values assumes that random sampling error associated with a finite sample size is the only source of error in our analysis, thus ignoring measurement error, various kinds of selection bias, model-specification error, etc., which together may greatly outweigh pure sampling error.

And don’t even get me started on the multiple-test problem, which can lead to completely erroneous estimates of the attained “significance” level of the test we finally decide to go with. This problem can get completely out of hand if any amount of exploratory analysis has been done. (Does anyone keep careful track of the number of preliminary analyses that are run in the course of, say, model development? I don’t.) As a result, the p values dutifully cranked out by statistical software packages are, to put it bluntly, wrong.

One final technical point: I mentioned above that almost no one sets a β value for their analysis, despite the fact that β determines how large a sample you’re going to need to meet your goal of rejecting the null hypothesis before you even go out and collect your data. Does it make any difference? Well, one survey calculated that the median (unreported) power of a large number of nonexperimental studies was about 0.48 (Maxwell, 2004). In other words, when it comes to accepting or rejecting the null hypothesis you might as well flip a coin.

And one final philosophical point: what do we really mean when we say that a finding is “statistically significant”? We mean an effect appears, according to some arbitrary standard such as < 0.05, to exist. It does not tell us how big or how important the effect is. Statistical significance most emphatically is not the same as scientific or clinical significance. So why call it “significance” at all? With due deference to R. A. Fisher, who first came up with this odd and profoundly misleading bit of jargon, I suggest that the term “statistical significance” has been so corrupted by bad usage that it ought to be banished from the scientific literature.

In fact, I believe that NHST as a whole should be banished. At best, I regard an exact p value as providing nothing more than a loose indication of the uncertainty associated with a finite sample and a finite sample alone; it does not reveal God’s truth or substitute for thinking on the researcher’s part. I’m by no means alone (or original) in coming to this conclusion. More and more professional statisticians, as well as researchers in fields as diverse as demography, epidemiology, ecology, psychology, sociology, and so forth, are now making the same argument – and have been for some years (for just a few examples, see Oakes 1986; Rothman 1998; Hoem 2008; Cumming 2012; Fidler 2013; Kline 2013).

But if we abandon NHST, do we then have to stop doing anything other than purely descriptive statistics? After all, we were taught in intro stats that NHST is a virtual synonym for inferential statistics. But it’s not. This is not the place to discuss the alternatives to NHST in any detail (see the references provided a few sentences back), but it seems to me that instead of making a categorical yes/no decision about the existence of an effect (a rather metaphysical proposition), we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation. We should also be fitting theoretically-interesting models and estimating their parameters, from which effect sizes can often be computed. And I have to admit, despite having been a diehard frequentist for the last several decades, I’m increasingly drawn to Bayesian analysis (for a crystal-clear introduction to which, see Kruschke, 2011). Thinking in terms of the posterior distribution, of support for a model provided by previous research as modified by new data seems a quite natural and intuitive way to capture how scientific knowledge actually accumulates. Anyway, the current literature is full of alternatives to NHST, and we should be exploring them.

By the way, the whole anti-NHST movement is relevant to the “Mermaid’s Tale” because most published biomedical and epidemiological “discoveries” (including what’s published in press releases) amount to nothing more than the blind acceptance of p values less than 0.05. I point to Anne Buchanan’s recent critical posting here about studies supposedly showing that sunlight significantly reduces blood pressure. At the < 0.05 level, no doubt.


Cohen, J. (1994) The earth is round (p < 0.05). American Psychologist 49: 997-1003.

Cumming, G. (2012) Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

Fidler, F. (2013) From Statistical Significance to Effect Estimation: Statistical Reform in Psychology, Medicine and Ecology. Routledge, New York.

Gigerenzer, G. (2004) Mindless statistics. Journal of Socio-Economics 33: 587-606.

Hoem, J. M. (2008) The reporting of statistical significance in scientific journals: A reflexion. Demographic Research 18: 437-42.

Kempthorne, O. (1971) Discussion comment in Godambe, V. P., and Sprott, D. A. (eds.), Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston.

Kline, R. B. (2013) Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.

Kruschke, J. K. (2011) Doing Bayesian Analysis. Amsterdam: Elsevier.

Longford, N. T. (2005) Model selection and efficiency: Is “which model…?” the right question? Journal of the Royal Statistical Society (Series A) 168: 469-72.

Maxwell, S. E. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods 9: 147-63.

Nelson, N., Rosenthal, R., and Rosnow, R. L. (1986) Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299-1301.

Oakes, M. (1986) Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: John Wiley and Sons.

Rothman, K. J. (1998) Writing for Epidemiology. Epidemiology 9: 333-37.

Ziliak, S., and McCloskey, D. N. (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.


Holly Dunsworth said...

Thank you Jim! I recently attended my first medical conference and was absolutely flummoxed by the obsession with p values. People were scanning posters only for a p value and then moving to the next. I was literally frightened. And now I know better why.

Holly Dunsworth said...

Here's a link someone just tweeted (thank you!) to a Kruschke paper of relevance (not same as referenced above but...):
Bayesian Data Analysis

Anne Buchanan said...

Twitter has lit up on this! Here's another paper just tweeted, James Lumsden, 1976, "Test theory", and one from Jason Moore, 2013, "The limits of p-values for biological data mining".

Ken Weiss said...

This looks very good! Thanks for adding it to the Comments so readers could look at this.

Bayesian concepts are controversial largely because they are explicitly subjective, about 'degrees of belief' interpretation of 'probability' whereas the frequentist and other similar schools of thought use probability to mean the proportion of repeated trials with a certain outcome.

The issue is that probability and statistical testing are designed for widespread circumstances when we try to infer causation from imperfect data. Since often data cannot be repeated (your observed whatever, say the behavior of adults in a small deme of humans or primates, is all there is and there can be no replication).

Since decision-making criteria are inherently subjective, the Bayesians, I think, correctly acknowledge that our inferences are subjective and our obligation is to acknowledge that formally, and not pretend otherwise.

Jim Wood said...

And I thank you, Holly. One of the things I find frightening -- something I didn't discuss in my posting -- is the incredibly simpleminded view biomedical types have of biological causation, especially when they do bivariate analyses with no attention to confounding influences. There's either an effect of the "independent" variable or there's not. Context doesn't matter.

Ken Weiss said...

Ian Hacking has a very fine book called The Taming of Chance, about the origins of probability in science, and there are others.

Significance testing has its own history, but basically it all boils down to the Enlightment era's change from deductive to inductive reasoning: that is, from just thinking vaguely about nature to observing repeatable aspects and inferring causation from them.

For various reasons, this led to the immensely practical use of statistics and probability _where the replication assumption is valid_

We're now in a new territory and need new thinking.

For example, "The limits of p-values" paper deals with genomic significance testing and shows how silly it has become.

There, to illustrate a point, GWAS etc makes the null hypothesis that genes don't affect the trait (like presence of disease), and tests every SNP in a huge set of markers for a case-control difference. Since hundreds of thousands of tests are done, the accepted p-value for a site to be called 'significant' is exceedingly small.

In fact, GWAS is done (when it's done properly, which isn't all the time) only if we already know the trait has some genetic component (e.g., it runs in families). So the significance tests in mapping are bogus. All you can do is search for regions of the genome that are most likely to be having an effect. There is no need or call to interpret the SNP sites as 'significant', even if one needs a way to decide which (if any) to follow up.

There is so much mis-use of statistical concepts that people are or should be aware of, as to be amazing to contemplate.

Ken Weiss said...

Yes, and that's weird since DNA doesn't do anything at all (other than slowly degrade) except in the context of other things in the cell and organism's environment! The idea that genes act on their own is simply wrong and only sometimes sufficiently accurate (as in Mendel's peas) to let us ignore context.

Herman Pontzer said...

Fantastic skewering of simple minded stats. Great piece. I'd be curious to hear your thoughts on the value of the other "beta" estimates (effect estimates) and r2 values and the like, which provide a measure of effect size but rely on many of the same problematic assumptions you note here for p-values. I for one would happily switch to Bayesian stats but am too lazy to put up with the BS of being an early adopter (e.g., uninformed reviewers, lack of stats software, etc). Perhaps we're near a tipping point on this?

Jim Wood said...

There's no simple answer. I do use estimates of effect size, R2, and confidence intervals (which are sometimes accused of being hypothesis tests in disguise, but that depends on how you use them). But I interpret them (I hope) critically, and I refuse to use a single number as the sole criterion for scientific judgment. I also believe quite firmly that no single statistical analysis has ever conclusively answered an interesting scientific question. Replication is all. I understand and sympathize with your reservations about Bayesian stats -- laziness is a trait I'm all too familiar with -- but I do think things are changing. You can certainly do Bayesian stuff in R (the Kruschke book is a nice introduction). Some journals are becoming much more welcoming to Bayesian analysis -- but you're right: some reviewers will still reject it out of hand. I also think that the anti-NHST crowd is still casting about for a consensus on what the alternatives are. I guess the one encouraging thing is that a lot of smart people are now struggling with the issues. Stay tuned...

Holly Dunsworth said...

I wonder how many scientists we lost who were unable to overcome these issues and play along like there's no problem.

Ken Weiss said...

We need all to try to understand and implement these points in our own writing and work. We need to avoid excessive claims of definitive 'findings'.

And then, we need to train students in appropriate critical thinking....from the very first time we see them. The idea would be to avoid building in bad habits that are then much harder to get out of.

It's not easy, though Jim is about as good at it as anyone I know....

Dave Nussbaum said...

Given that these critiques have been made before, I think the most important question is: how do we get from here to there? It's clearly not sufficient to merely explain rationally that our current methods are flawed -- the status quo is hard to change. The real question is how do you go about changing it.

The fact that there are many NHST alternatives out there is great -- but could actually be harmful to the overall goal of replacing NHST because there is no clear alternative. And there are big hurdles to overcome when, in the name of better science, researchers get fewer publishable results. That's clearly a hurdle worth overcoming, but it's generally one we ignore rather than grapple with.

Any thoughts?

Ken Weiss said...

Very cogent comments!

Reform has to come from forced circumstances (like uncontrollable but serious budget cuts for research) or the grass roots (faculty, esp. junior faculty, resisting the publish-every-day-or-perish and get-a-grant-or-perish) pressures that administrators (except, perhaps the very few with vision and imagination) have no incentive to change.

There are deeper issues, as well, and they have (I think) to do with the very notion of probabilistic causation and inference, especially with complex many-to-many factor-outcome relationships.

The negative or lack-of-real results is part of the safety of doing incremental science that the above pressures entail. But you're very right to ask what can bring the change about...

Jim Wood said...

Excellent point. (And, by the way, I hope I managed to communicate that none of what I said was either new or original.) As I mentioned in an earlier reply, anti-NHST forces clearly exist and are growing, but they have not reached any consensus about what the best alternatives are. I'm a consumer not a producer of that kind of wisdom! Until the alternatives are firmly established, I guess I recommend flexibility, not taking your inferences too seriously, and, well, intellectual humility. A typical analysis produces a lot of numerical output in addition to p values. Any judgement should take all this evidence into account, and should also admit that none of it is absolutely definitive (even p values provide a little bit of information). Things like AICs and R2s and likelihood ratios are all useful as long as we don't take them as revealed truth.

Ken Weiss said...

I can say first of all that these issues were lively way (way, way) back when I was a graduate student. Then, likelihood methods were being promoted (as well as Bayesian ones) to get around the assumption of replicability. Even then, some sort of 'significance' tests (like LOD score cutoffs) were used...people trying then, as now, to have it both ways.

I would also say that as I think you and others noted, we mess around a lot with data and ad hoc study issues as if they did not affect the assumptions of our significance or other statistical testing. To me this immediately affects the actual, as opposed to computed p-values, and to an unknown extent but likely to inflate them. That's because you do all of this to force the data to show what one believes is true.

So if you have reason to believe something is, say, genetic (because it clusters in families in ways that appear to be correlated appropriately), all the mapping tests you do should not test the 'null' hypothesis, but instead should just ask 'given that I think there are genes involved, where in the genome are they most likely?' or perhaps the 'null' hypothesis should be what you think is the truth, and really try to falsify it. That's harder to construct but to me makes more sense.

So why not just accept one's prior beliefs and see how well or if the add or subtract from the strength you can have of the belief?

This is at least as pernicious, and far more subtle, than the formal study design issues most discussions are about. So the problem is actually worse than most of even this discussion above would have it.

Jim Wood said...

Fisher advocated using your pet hypothesis as the null and trying your damnedest to falsify it. That at least gets around the problem that most null hypotheses are vacuous.

Ken Weiss said...

To me, that's basically verges on Bayesian analysis.

(By the way, this approach is named after Thomas Bayes (1701-1761), who stated the basic relationships among prior probability--your confidence or degree of belief in a hypothesis with which you begin, say, a study--and posterior probability, or your confidence as it has been increased or decreased by the results of the study. That's what 'Bayesian analysis' is about)

Jim Wood said...

Dave, I want to disagree (partially) with one thing you said. You said it's not enough merely to explain rationally why our current methods are flawed. Yes, we have to do more to develop new methods. But my experience is that NHST is still the core dogma for almost all of the science I have contact with -- which spans the biological, epidemiological, and social sciences. It's also still the foundation for most stats instruction. One of my grad students just took a grad-level mathematical statistics course (i.e. NOT an intro course), and she complained that it was all NHST. So there's still plenty of room for critiques of the current methods.

Dave Nussbaum said...

Thanks for all the comments! Jim, on your last point, I agree. I'm not saying that critique is not useful or necessary or worth repeating. It's just that, by itself, it's not enough. Sometimes the time isn't ripe for reform, but it seems to me that this is one of those times where there's at least a hint at an opening. That's why I'm so interested in people's thoughts about how to take steps to make these sorts of changes in an effective way. It's not a criticism of the criticism, it's a genuine attempt to consider how to turn the criticisms into real changes. Thanks for the thoughtful comments and suggestions. Cheers! Dave

Ken Weiss said...

Grants are expected, by reviewing panels, to justify their study design by showing the estimated power to detect an effect (here, of a genetic variant somewhere in the genome, if such exists).

Since there is (or should be) already good evidence that the trait is 'genetic' (e.g., elevated familial risk consistent with kinship relationship), then the study logic really is: (1) this trait is genetic, (2) the genetic risk factor(s) must be somewhere in the genome, (3) my sample size should give me xx% power to detect the effect.

NIH reviewers usually expect something like 80% power for studies they'll fund. Joe Terwilliger has informally (and perhaps formally--I don't know) estimated what fraction of studies actually find what they have promised to find, and it's in a sense way less than 80% (unless, perhaps, one counts effects so trivial as to be basically irrelevant to find); yet the same sorts of studies keep getting funded.

So some sort of accountability criterion, with some sort of 'sanction', by funding agencies might help. If you propose an overly rosy grant, your 'Bayesian' success likelihood should be lowered, and that should lower you on the funding priority list for your next grant (adjust the priority score to reflect past record of promised achievement).

That might reduce wasteful funding on me-too incremental research designs. Really innovative studies don't or needn't have the 'power-estimation' structure and can be more exploratory. And they might, under a revised set of criteria, have a better chance of actually being funded compared to now.

But it would be hard to impose such structures on the funding system (or in publishing as well), since it would undermine what many people are, treadmill hamster style, spending their research lives doing.

Jim Wood said...

Dave, it looks like you and I are in complete agreement. So, okay, folks, where do you see statistical inference going if we abandon NHST?

stevencarlislewalker said...

Great post. I completely agree with most of your criticisms. But I feel I should push back a bit. In particular I'm afraid that some might read this post and think 'frequentist statistics bad'. Frequentist reasoning at its core just means: think about what you may have observed instead of what you did observe. And this is a good idea. On the other hand, Bayesian reasoning is good too. I, and many other people who are smarter than me, find that it provides good methods of estimation. For example, in a small sample we might find a large and significant effect of ESP, but Bayesian reasoning allows us to temper how much we 'listen' to the raw data.

Here's a bit more on my opinions:

Andrew Gelman's blog is another great place to find this kind of statistical pluralism. He has an idea that I like which he calls the methodological attribution problem: when someone does successful research, we have a tendency to attribute that success to the methods, but often the methods themselves have less to do with it. Rather, good research gets done when the researcher uses whatever methodological framework well. Gelman notices that good Bayesians let themselves use frequentist reasoning (a good description of Gelman himself) and good frequentists let themselves use Bayesian reasoning (eg. Efron -- the boorstrap guy).

In general, I agree with these criticisms of NHST here but think that maybe the tone is a little too strong?? Not sure...great post all the same.

Ken Weiss said...

I don't think the tone is too strong, but I agree with what you say, and in a way it boils down (I think--I can't speak for Jim!) to whether the assumption that the same experiment--like flipping coins or rolling dice--could really be repeated countless times, and if the hypothesis is carefully focused. That is, essentially, what the frequentist approach was designed for and where it can work very well.

But that is not the same as saying that the use of the approach is carefully restricted to appropriate situations, I think.

Jim Wood said...

Good point. I didn't mean to dis the frequentists. I've been one for a long time, and I still remain one in many respects. Insofar as I'm a Bayesian at all, I'm a reluctant and uncertain one. I'm still not sure that it's proper to quantify subjective uncertainty. And I'm not sure that a uniform distribution is the right way to specify a completely neutral prior -- and on and on. Bayesian analysis is not a panacea. I've come to accept it because it seems to be USEFUL in some contexts. But I still use confidence intervals, which are basically frequentist. So I totally agree with you about the virtues of statistical pluralism. And I love Andrew Gelman's point about the methodological attribution problem. I got the result I wanted, so my method must be a really good one! (I've actually had people say this to me.)

Steve Walker said...

Maybe its like the 'guns don't kill people...' thing: p-values don't hurt people, bad hypotheses do. For example you can calculate a p-value under an 'interesting' (i.e. non-null) hypothesis. If p is small, in such a case, I think you've learned something interesting. Now it might be hard to figure out what in particular was bad about the hypothesis, but at least you've got something interesting to think about. On the other hand p-values can be virtually useless if the null hypothesis is so devoid of relevance that you are really just waiting for a large enough sample to reject. This is the kind of thing Jim rightly criticizes. But this is not the only way to use p-values.

So yes 'coin-flipping' hypotheses are often (usually) stupid, but not all statistical hypotheses are stupid and p-values can be another useful tool for assessing them.

OK let me now admit that I like confidence intervals better than p-values, usually. The sample size dependence of p-values *is* weird to say the least, etc...what Jim said. But it feels wrong to just throw them out.

One final thought, I think that any serious philosophical attack on p-values these days probably needs to address the work of Debra Mayo, the who is providing very interesting justifications for NHST:

Jim Wood said...

Thanks, Steve. I checked out Deborah Mayo's work, and it does indeed look very interesting (even if I might not end up agreeing with her). It is certainly relevant to the debate going on here. Her book "Error and the Growth of Experimental Knowledge" looks like a spirited defense of NHST. It includes a chapter called "Why you can't be a little bit of a Bayesian," which would seem to put paid to the idea of statistical pluralism. Her other book of readings, "Error and Inference," also looks good and a bit more, well, pluralistic. It's nice seeing a philosopher paying serious attention to all these issues. By the way, none of this changes (so far!) anything I said in my posting.

Steve Walker said...

Thanks for your thoughtful responses. This is fun.

I also find that Mayo's stuff doesn't really change my mind, but her perspectives are useful when I find myself tempted to just dismiss empirical research because it doesn't present effect sizes (or because of other standard p-value issues).

To defend her extreme stance a bit, she is a philosopher and in philosophy departments Bayes rules. So I think that her pushing back against the Bayesian domination of the field of the philosophy of statistics is very useful. I personally don't like to take such an extreme position, but I appreciate why she does it. Maybe this is where you are coming from, but in reverse -- your field is dominated by NHST so you feel you need to push back? Fair enough.

Re "none of this changes (so far!) anything I said in my posting", the main issue for me is with the title: "let's *abandon* significance testing". For me, Mayo's work suggests that this advice is too harsh.

I think that the real problems generating the many many bad p-value-based analyses are: 1. probability and statistics are difficult, 2. there is too much pressure to publish significant findings (the problem Holly mentioned in the first comment). In other words, any statistical paradigm will be abused, because statistics are hard.

Anyways, I'm starting to repeat myself now. Thanks for the fun post. Good to 'meet' you Jim. And I would like to second the comment that you should continue blogging!

Jim Wood said...

Hey, Steve, I just checked out your website. Great! I'm going to spend the rest of my life trying to figure out where I fall in your 4-D vector space. I tend to agree with Jeremy Fox that understanding trumps prediction. I've seen a lot of deliberately over-simplified models that greatly enhance understanding while being lousy at prediction (see von Thunen's models in geography).

By the way, just to prove how unoriginal my post is, check out the following, which was posted ten years ago and makes essentially the same points I made:

Or just google NHST for lots of examples.

John K. Kruschke said...

Bayesian estimation supersedes the t test. That's a new article, explaining how to do Bayesian analysis and what's wrong with p values. Here:

Mayo said...

I think you might be interested in my current blogpost which takes up the question as to what in the world posteriors can really mean. Bayesians will not tell you.