Monday, August 31, 2015

It's not just about psych studies; it's about core aspects of inference.

A paper just published in Science by the "Open Science Collaboration" reports the results of a multi-year multi-institution effort to replicate 100 psychology studies published in three top psychology journals in 2008.  This effort has often been discussed since it began in 2011, in large part because the importance of replicability in confirming scientific results is integral to the 'scientific method,' but replicability studies aren't a terribly creative use of a researcher's time, and they're difficult to publish so they aren't often on researchers' To-Do lists.  So, this was unusual.

Les Twins; Wikipedia
There are many reasons a study can't be replicated.  Sometimes the study was poorly conceived or carried out (assumptions and biases not taken into account), sometimes the results pertain only to the particular sample reported (a single family or population), sometimes the methods in an original study aren't described well enough to be replicated, sometimes random or even systematic error (instrument behaving badly) skews the results.

Because there's no such thing as a perfect study, replication studies can be victims of any of the same issues, so interpreting lack of replication isn't necessarily straightforward, and certainly doesn't always mean that the original study was flawed.

The Open Science Collaboration was scrupulous in its efforts to replicate original studies as carefully and faithfully as possible.  Still, the results weren't pretty.  The authors write:
Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.
Interestingly enough, the authors aren't quite sure what any of this means.  First, as they point out, direct replication doesn't verify the theoretical interpretation of the result, which could have been flawed originally, and remain flawed.  And it's impossible to know when a study is not replicated why that is, whether the original was flawed or the replication effort was flawed, or even both were flawed.

This effort has been the subject of much discussion, naturally enough.  In a piece published last week in The Atlantic, Ed Yong quotes several psychologists, including the project's lead author, saying that this project has been a welcome learning experience for the field.  There are plans afoot to change how things are done, including pre-registration of hypotheses so that the reported results can't be cherry-picked, or increasing the size of studies to increase their power, as has been done in the field of genetics.

We'll see whether this is just a predictable wagon-circling welcome, or really means something.  One has every reason to be skeptical, and wonder if these fields really are sciences in the proper sense of the term.  Indeed, it's quite interesting to see genetics held up as an exemplar of good and reliable study design.  After billions of dollars being spent on studies large and small of the genetics of asthma, heart disease, type 2 diabetes, obesity, hypertension, stroke, and so on, we've got not only a lot of contradictory findings, but most of what has been found are genes with small effects.  And epidemiology, many of the 'omics fields, evolutionary biology, and others haven't done any better.

Why?  The vagueness of the social and behavioral sciences is only part of the problem (unlike, say, force, outcome variables such as stress, aggression, crime, or intelligence are hard to consistently define, and can vary according to the instrument with which they are measured).  Biomedical outcomes can be vague and hard to define as well (autism, schizophrenia, high blood pressure).  We don't understand enough about how genes interact with each other or with the environment to understand complex causality.

Statistics and science
The problem may be much deeper than any of this discussion of non-replicable results suggests.  First, from an evolutionary point of view, we expect organisms to be different, not replicates.  This is because mutational changes (and recombination) are always making each individual organism's genotype unique and, second, the need to adapt--Darwin's central claim or observation--means that organisms have to be different so that their 'struggle for life' can occur.

We have only a general theory for this, since life is an ad hoc adaptive/evolutionary phenomenon. Far more broadly than just the behavioral or social sciences, our investigative methods are based on 'internal' comparisons (e.g., cases vs controls, various levels of blood pressure and stroke, fitness relative to different trait values) to evaluate samples against each other, rather than as representations of an externally derived, a priori theory.  When we rely on statistics and p-value significance tests and probabilities and so on, we are implicitly confessing that we don't in fact really know what's going on, and all we can get are a kind of shadow of the underlying process that is cast by the differences we detect, and we detect them with generic (not to mention subjective) rather than specific criteria. We've written about these things several times in the past here.

The issue is not just weakly defined terms and study designs. As Freeman Dyson (in "A meeting with Enrico Fermi") wrote in 2004:
In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, "How many arbitrary parameters did you use for your calculations?" I thought for a moment about our cut-off procedures and said, "Four." He said, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." With that, the conversation was over. . . .
Here, parameters refers to what are more properly called 'free parameters', that is, ones not fixed in advance, but that are estimated from data. By contrast, for example, in physics the speed of light and gravitational constant are known, fixed values a priori, not estimated from data (though data were used to establish those values). We just lack such understanding in many areas of science, not just behavioral sciences.

In a sense we are using Ptolemaic tinkering to fit a theory that doesn't really fit, in the absence of a better (e.g., Copernican or Newtonian) theoretical understanding.  Social and behavioral sciences are far behind the at least considerably more rigorous genetic and evolutionary sciences when the latter are done at their best (which isn't always).   Like the shadows of reality seen in Plato's cave, statistical inference reflects the shadows of the reality we want to understand, but for many cultural and practical reasons we don't recognize that, or don't want to acknowledge it.  The weaknesses and frangibility of our predictive 'powers' could, if properly understood by the general public, be a threat to our business and our culture doesn't reward candor when it comes to that business.  The pressures (including from the pubic media, with their own agenda and interests) necessarily lead to reducing complexity to simpler models and claims far beyond what has legitimately been understood.

The problem is not just with weakly measured variables or poorly defined terms of, for example, outcomes.  Nor is the problem if, when, or that people use methods wrongly.  The problem is that statistical inference is based on a sample and is often retrospective, or mainly empirical and based on only rather generic theory.  No matter how well chosen and rigorously defined, in these various areas (unlike much of physics and chemistry) the estimates of parameters and the like fitted to data that is necessarily about the subjects past, such as their culture or upbringing or lifestyles, but in the absence of adequate formal theory, these findings cannot be used to predict the future with knowable accuracy.  That is because the same conditions can't be repeated, say, decades from now, and we don't know what future conditions will be, and so on.

Rather alarmingly, we were recently discussing this with a colleague who works in very physics- and chemistry-rigorous material science.  She immediately told us that they, too, face problems in data evaluation with the number of variables they have to deal with, even under what the rest of us would enviously say were very well-controlled conditions where the assumptions of statistics--basically amounting to replicability of some underlying mathematical process--should really apply well.

So the social and related sciences may be far weaker than other fields, and should acknowledge that. But the rest of us, in various purportedly 'harder' biological, biomedical, and epidemiological sciences, are often not so much better off. Statistical methods and theory work wonderfully well when their assumptions are closely met.  But there is too much out-of-the-box analytic toolware, that lures us into thinking that quick and definitive answers are possible.  Those methods never promise that because what statistics does is account for repeated phenomena following the same rules, and the rule of many sciences is that, in their essence they are not following such rules.

But the lure of easy-answer statistics, and the understandable lack of deeply better ideas, perpetuates the expensive and misleading games that we are playing in many areas of science.

9 comments:

O. Douglas Jennings said...

So, from your article, my take-away should be: Mixed results on repeated psych studies reveals need for all scientists to avoid easy-answer jumps to conclusions?

Ken Weiss said...

I would say generally, yes. Very much higher standards for claiming findings, running to the news media, and the like. Quiet, humble and slower rather than the rush to judgment. There are many biases we did not even mention in the post, that are all well known. In many ways, in science, less is more.

Manoj Samanta said...

I had a very shocking experience with a psychology paper. The author made an outrageous claims and we decided to redo all their analysis to try to figure out what was going on. We reproduced all their results from their data and reported methods. That means the study was 'reproducible', right? Well, the major claim was based on only a small sample size. Moreover, when we removed just one number, that had dramatic impact on all the plots that the authors used to make their outrageous claims.

In the meanwhile, the authors surreptitiously replaced their raw data file at NCBI with a new file that removed that single number with dramatic effect on the plots. That change happened 12 months after publication and the authors did not tell anyone.

Now we cannot 'reproduce' the study, but nobody in the field of psychology would listen to our sob story.

M.

Ken Weiss said...

Much of the problem, including the instances of dishonor, is the result of the positions the system puts people in, to keep their jobs. There will always be human error and human mischief, but we have built conditions that favor a much worse problem, which is essentially hasty rush to publish and publicize. We can't be surprised at the result. Finding the truth of the world is difficult enough without having to deal with those sociocultural factors.

O. Douglas Jennings said...

I wonder if this is one reason that often the news media reports the findings of a certain remarkable study and then, some time later, reports the findings of a contradictory study. It can undermine the credibility of science when rushed-to-publish studies are found to be grievously flawed.

Ken Weiss said...

Everyone, boosted by the new authors and the news media, seems to take the latest study as both the definitive one that overthrows the former one, and also as being more or less the true or even final word. There are now several studies of the positive upward bias in reports, especially in the most prominent journals, such that in some ways they are the least reliable. This exaggerates the effect sizes by the 'winner's curse' of luck and report bias, and there is the complementary issue of under-reporting of negative studies or replicate studies. It's a systemic problem in the literal sense: a problem with the nature of the science system as it has become.

James Goetz said...

Hmm, So don't sell the farm based on psychological research with 95% confidence....

Anonymous said...

My wife did postdoc in a Stanford lab of a famous professor (HHMI), and often told me that many results from the lab could not be reproduced. She tried in vain to reproduce the results of one such high-profile Science paper, whose first author is now a professor at a big name university.

She also noticed that people, whose results could not be easily reproduced climbed fast with many Science/Nature/Cell papers, whereas other thorough ones could hardly publish one paper in many years.

Fast forward seven years. Here is one of the fast climbers from the lab -

http://www.marketwatch.com/story/amgen-finds-data-falsified-in-obesity-diabetes-study-featuring-grizzly-bears-2015-09-01-121032242

Ken Weiss said...

There will always be sinners and fallible people whose work has flaws they did not detect. But we are also clearly in a time when there are many reasons that details can be conveniently ignored or caveats buried in 'supplemental information'. The news media and journals, by heating up the system for their own self-interested reasons, exacerbate the problem. It is by now very well known, but there are so many ways to be a part of the system oneself that incentives to doubt, or whistle blow, are minimized. And it's no pleasure to be doubting everything all the time. It is much more appealing to believe the claimed progress is real, or, at least, much more efficient than it really is.