Thursday, May 23, 2013

Who, me? I don't believe in single-gene causation! (or do I?). Part II. Probabilistic causation--what do we mean?

Yesterday we discussed notions of determinism and complex causation, and the denial that every self-respecting scientists tries to make that s/he doesn't believe in deterministic causation.  They don't 'believe' it because, in fact, most of the time and with some possible exceptions in the physical sciences, determinism is at best inaccurate and to an unknown extent.  Knowing a putatively causal factor doesn't lead to anything close to perfectly accurate prediction (in biology).  Yet of course we believe this is a causal rather than mystic world, and naturally want to identify 'causes', especially of things we wish to avoid so that we can avoid them.  Being only human, accepting often rather unquestioningly the methods of science, and for various other reasons, we tend to promise miracles to those who will pay us to do the research to find them.

It is annoying to some readers of MT that we're constantly harping on the problem of determinism (and exhaustive rather than hypothesis or theory-driven approaches to inferring to causation).  Partly this annoyance is to be expected, because our criticisms do challenge vested interests.  And partly because if people had better ideas (even if one granted that our points are cogent), they might be trying them (if they thought they were fundable in our very competitive funding system). 

But largely, as a scientific community, we don't know seem to know what better to do.  But the issues about causation in this context seem very deep and we think not enough people are thinking seriously enough about them, for various pragmatic and other reasons. They aren't easy to understand. If we're right, then the critical test or experiment, or truly new concept, just doesn't seem to be 'in the air' the way, say, evolution was in Darwin's time, or relativity in Einstein's. 

Yesterday we described various issues having to do with understanding 'single-gene' diseases like Cystic Fibrosis or familial breast cancer, or smoking-induced lung cancer.  Our point was that causation isn't as simple as 'the' gene for disease, or 'the' environmental risk factor.  Indeed, instead of deterministic billiard-ball like classic cause-effect views, we largely take a sample-based estimation approach, and what we generally find is that (1) multiple factors--genes or environmental--are associated with different proportions of test outcomes in those exposed compared to those not exposed to a given factor, (2) few factors by themselves account for the outcome in every exposed person, (3) multiple different factors are found to be associated with a given outcome, and yet (4) the known factors do not account for all the instances of that outcome.

In the case of genetics, our particular main interest, there are reasons to think that inherited factors cause our biological traits, normal as well as disease, but when we attempt to identify any such factors by our omics approaches (such as screening the entire genome to find them), most of the apparent inherited causation remains unidentified.  So we have at least two different kinds of issue:  First, causation seems probabilistic rather than deterministic, and second, the methods themselves, rather than the causal landscape, affect or even determine what we will infer.  For example, when we use methods of statistical inference, such as significance testing, we essentially define influences that don't reach significance in our data as not being real.

Even though we have believable evidence that the collection of a great many minor factors account, somehow, for the bulk of the outcomes of interest, we keep investing large amounts of resources in approaches that are not getting us very far.  It is this which, we think, justifies continual attempts to pressure people to think more creatively and at least spend resources in ways that might have a better chance of inspiring somebody.  We try not to go overboard in voicing our comments, but of course that's what a blog is for--a forum for opinion, hopefully rooted in fact.

What the current research approach gets most of the time is not statistical associations between factor and outcome that can then be shown to be deterministic in a billiard-ball mechanistic sense, that is, in which a statistical exploration leads to a 'real' kind of cause.   That does happen and we mentioned some instances yesterday, but it's far from the rule.  The general finding is some sort of fuzzy probability that associates the presence of the putative causal factor--say, a given allele at a gene--and an outcome, such as a disease.  So, the explanation, which we think is fair to characterize as a retreat, is to provide what sounds like a knowing answer that the cause really is 'probabilistic'.  But what does this actually mean, if in fact it isn't just blowing smoke?  It is in this context that, we think, the explanations are in a relevant sense in fact deterministic, even if expressed in terms of probabilities or 'risk'.

And besides, everybody knows that causation is probabilistic!  So there!
Now what on earth does it mean to say that causation is probabilistic?  In technical terms, one can invoke random processes and, often (sometimes not in a self-flattering way) relate this to the apparently inherent probabilistic nature of quantum physics.  In my amateurish way, I'd say that the latter replaced classical billiard-ball notions of causation which were purely deterministic (if you had perfect measurements) with truly probabilistic causation even if you had perfect measurement.

Probably those who invoke the everyone-knows sense of causation being probabilistic is referring to our imperfect identification or measurement of contributing factors.  But then, to make predictions based on genotypes, we need to know the 'true' probability of an outcome given both the genotype and the degree of imperfect measurement.  The latter is estimated from data, but almost always in the ceteris paribus notion--that is, the probability of cancer given a particular BRCA1 mutation is p, averaging over all the range of other genetic and environmental variants that carriers of the observed genetic variant are individually exposed to.

But we must think carefully about where that p comes from.  It is treated exactly as if we were flipping a coin, but with probability p of coming up Heads rather than a probability of 1/2.  For quantitative traits like stature or blood pressure, some probability would be assigned to each possible trait value of a person carrying the observed genetic variant, as if these values were distributed in some fashion, say, like the usual bell-shaped Normal distribution around the mean, or expected effect on the trait for those who carry the variant.  Her probability of being 5'3" tall is .35 but .12 of being 5'6", e.g.

Sounds fine.  But how well known, or even how real, are such assumed, fixed, underlying probabilities?

Plausibility is not the same as truth
The probabilistic nature of smoking and lung cancer, or even BRCA1 variants and breast cancer, can be accounted for in meaningful ways related to the random nature of mutations in genes.  Of course, we could be wrong even here, since there may be other kinds of causal processes, but at least we don't feel a need to understand every molecular event that may generate a mutation.  We don't need to get into the molecular details or quantum mechanical vagaries to believe that we have a reasonable understanding of the association between BRCA1 mutations and breast cancer.  In such situations, a probabilistic causation makes reasonable sense, even if it doesn't lead to a clean, closed, classically deterministic billiard-ball causal way.

But that is not the usual story.  When we face what seems to be truly complex causation, in which many factors contribute, as in environmental epidemiology or GWAS-like omics studies, we face something different.  What do we do in these situations?  Generally we identify many different possible causal factors, such as points along the genome or dietary constituents, and estimate their individual statistical association with the outcome of interest (diabetes or Alzheimer's disease, etc.).  First, we usually treat them as independent, and do our best to avoid being confused by confounding; for example, we only look at variants along the genome that are themselves not associated with each other by what is called linkage disequilibrium.  Then, we see, factor by factor, whether its variation is statistically associated with the outcome, as described in our third paragraph above.  We base this on a statistical cutoff criterion like a significance test (not our subject today, but again, see Jim Wood's earlier post on MT).

Given a 'significant' result, we then estimate the effect of the factor in probabilistic ways, such as the probability of the outcome if you carry the genetic variable (for quantitative traits, like stature or blood pressure, there are different sorts of probabilistic effect-size estimates).  Well, this certainly is different from billiard-ball determinism, you say.  Isn't it a modern way to deal with causation, and doesn't that make it clear that careful investigators are in fact not asserting determinism or single-gene causation?  What could be clearer?

In fact, we think that, conceptually, what is happening is basically an extension of single-cause, or even basically deterministic thinking.  As one way to see this, if you identify a single risk factor, such as the CFTR gene in which some known variants are very strongly associated with cystic fibrosis, then if you have a case of CF and you find a variant in the gene, you say that the variant caused the disease.  But for many if not most such variants, the attribution is based on an assumption that the gene is causal and that therefore, in this particular instance the variant was the cause.  There are some rather slippery issues, here, but they get clearer (we think) and even much more slippery, when it comes to multiple-factor, including multi-gene, causation.  We will deal with that tomorrow.

Wednesday, May 22, 2013

Who, me? I don't believe in single-gene causation! (or do I?). Part I. What does it mean?

We were told in no uncertain terms the other day that no one believes in single gene causation anymore.  Genetic determinism is passé, and everyone knows that most traits are complex, caused by multiple genes, gene x environment interaction, or if they're really sophisticated, epigenetics or the microbiome. But is this the view that investigators actually follow?  That's not so clear.

We all throw around the word 'complex' as if we actually believe it and perhaps even understand it.  Of course, (nearly) everyone recognizes that some traits are 'complex' meaning that one can't find a single clear-cut deterministic cause, the way being hit on the side of the head with a baseball bat by itself can send one out for the count.  But in fact the hunt for the gene (or, alright if you insist!) genes 'for' a trait is still on. You name your trait: cancer, diabetes, IQ, ability to dunk a basketball, or get into Harvard without a Kaplan course, and someone's still looking for the gene that causes it.

It is this push to find genes for traits, despite all sorts of denials, that fuels the GWAS and similar fires.  Caveats notwithstanding (and usually offered just to provide technical escape lest one is wrong), that is what the promised 'personalized genomic medicine' in its various forms and guises is all about.

So let's take a careful, and hopefully even thoughtful look at the idea of genetic causation.  We are personally (it must be obvious!) quite skeptical of what we think are excessive claims of genetic determinism (or, now, microbiomial determinism), but they are still being made, so let's tease a few of them apart.

Single gene causation does exist, at least sometimes (doesn't it?)!
First, though, what do we mean by the word 'causation'?  Generally, we think people mean that gene X or risk factor Y is sufficient to cause trait Z.  But, it might also mean that gene X or risk factor Y are necessary but not sufficient causes of trait Z.  The baseball bat might have caused your concussion, but in fact someone had to swing it.

There are well-documented single risk factors, genetic and otherwise, that everyone accepts 'cause' some disease in a very meaningful sense. Examples are some alleles (variant states) of the CFTR gene and Cystic Fibrosis (CF), BRCA1 and 2 variants and breast cancer, or smoking and lung cancer.  Having the alleles associated with CF or breast cancer, or being a long time smoker do put people at high risk of disease.

But, for these and other examples there are usually healthy people walking around with serious mutations in the gene, or heavy smokers who enjoyed their cigarettes well into old-age.  Gene X or factor Y aren't sufficient to cause trait Z.  There are also hundreds of genetic variants found in patients that are assumed to be causal, but for elusive reasons (for example, mutations in non-coding regions near to the CFTR coding regions that have no known function).  What is it about gene X that causes trait Z?  We don't know, but gene X looks damaged in this person, so it must be causal.

The CF case is interesting.  This is an ion channel disease.  Ion channels are gated openings on the cell surface that pass sodium, potassium, calcium and other ions into and out of the cell in response to local circumstances.  CF is characterized by abnormal passage of chloride and sodium through ion channels, causing thick viscous mucous and secretions, primarily in the lungs but with involvement of the pancreas as well.  If the ion channel is badly built, or doesn't get to the surface of cells lining various organs like pancreas and lungs, then the cell cannot control its water content, secretion, or absorption, and the person with the malfunctioning channels has CF.  Again, gene X causes trait Z.

But there are gradations in channel malfunction, and gradations in severity of the disease, and we have no way to know how many people are walking around with variants but no actual disease.  Here, we can say that when it happens, CFTR mutations do cause the trait in the usual way.  But what about when there are mutations but no disease?  Gene X doesn't cause the disease after all?  Or disease and none of the known causal mutations?  Wait, we thought gene X caused the disease?  Could we be assuming single gene causation, and looking only at the CFTR gene, rather than at many other aspects of the genome that may affect ion channels in the same cells or, indeed, may cause the trait in a way we could understand if we but identified them?  This is an open question--but it applies to many other purportedly single-gene diseases.  Gene X and some other gene/s, or some environmental factor cause the disease in at least some instances.  Is it simple or isn't it?

The BRCA story is also interesting.  A BRCA1 variant associated with disease does not lead directly to cancer.  Instead, BRCA1 is a gene that detects and repairs genomic mutations in breast (and other) cells.  If you have a dysfunctional BRCA1 genotype, you are at risk of some one breast cell acquiring a set of mutations that don't get detected and repaired.  What causes those mutations?  Some happen when cells divide, so the activity of breast cells affects the rate of mutational accumulation.  Other lifestyle factors do as well (parity, age of childbearing, lactation and apparently things like diet and exercise). And a person with a causal BRCA mutation lives perfectly healthfully for decades, which if you think in classical Mendelian terms, would not happen if s/he had a 'bad' gene.  BRCA doesn't exactly cause cancer, but it allows it to be caused.  Gene X plus time plus environmental risk factors cause the disease.  Though, we all believe it's a single gene, BRCA1 or 2, that causes cancer.

The obvious non-genetic instance, smoking and lung cancer, is similar but not exactly the same.  Smoking is, among other things, a mutagen: it damages genes.  So one, if not the major, reason for the association is that the mutations caused by smoke can damage genes in lung cells that lead those cells to proliferate out of control.  The reason the risk is probabilistic -- that is, a smoker doesn't have a 100% chance of getting lung cancer -- is that it's impossible to know how many or which mutations a given person's smoking has led to.  In fact, smoking is only an indirect cause, since it is mutant genes in lung cells that, after accumulating in an unlucky way, start the tumor.  Still, in this case, knowing how much a person has smoked can allow one to estimate in some probabilistic way the relative risk of lung cancer due to enough mutations having arisen in at least one lung cell. Still, many who smoke don't get cancer, and many get cancer who don't smoke.  Since smoking, and a few other such risk factors (e.g., exposure to asbestos, and some toxic chemicals) have strong effects,  even if probabilistic, everyone is generally comfortable with thinking of them as causal.  Risk factor Y causes trait Z.  But, in fact, risk factor Y plus time plus unlucky mutations cause trait Z.

Deterministic genotype 'pseudo-single-gene' causation?
More problematic are 'complex' traits, that clearly are not due to simple single gene variants--they don't follow patterns of trait appearance in families that would be consistent with simple Mendelian inheritance.  The number of such traits is legion, and is driving the GWAS industry as we have many times commented on.  They are typically common in the population.  They are complex because we feel--know, really--that many different factors combine somehow to generate the risk.  Most instances are not due to one factor alone, though some may be in the sense of BRCA and CFTR.  So, we do something like genomewide association studies to try to identify all the potentially causal parts of the genome (and, similarly, but less definitively as a rule, lifestyle factors, too).  Here, we'll assume that all of the genomic variants that might contribute are known and can be typed in every person (this is, as everyone knows, far, far from being true at present, if it's even possible).

Advocates can deny any element of single-gene thinking in GWAS reports, where hundreds of loci are claimed to have been found, but these are treated as causal, and major journals are filled to the gills with papers with titles to the effect of "five novel genes for xyz-itis".  This is the slippery slope of simple causation thinking.

If multiple factors contribute, what we know is that most do so only probabilistically in the above senses.  That there are other things at work is rather obvious for the many traits that even in those at high risk don't arise until later or even late in life.  And, in reality, it is nearly always true that cases of the disease are associated with individually unique combinations of the risk factors.  So, your personalized risk is computed as some kind of combination, like the sum, of your estimated risk at each of the putative sites, R= R1+R2+R3...., where at each site one allele is given the minimal risk (or, perhaps, zero) and the other allele the risk estimate by the difference in prevalence of the trait in cases vs in controls.  This might be considered the very opposite of single-gene causation, but conceptually it's pretty much the same, because it treats your aggregate as a single kind of risk score, as if it were acting as a unit.  The idea would be completely analogous to the risk associated with a specific variant at the CFTR or BRCA gene.  Your genotype as a whole would be viewed essentially as a single cause.

These are examples of causal rhetoric.  But these causes are probabilistic.  What does that mean?  It means that you are not 100% protected from getting the trait, nor 100% doomed.  Your fortune is estimated by some number in between.  We call it a probability, estimated either from the presence of one risk-factor or from the fixed set that you inherited.  But what does that probability mean, and how do we arrive at the value and how reliable is it?  Indeed, how often can we not even know how reliable it is?

These are topics for tomorrow.

Tuesday, May 21, 2013

Microbiomes R Us -- another form of science marketing

The microbiome hits the big time 
A piece by food writer/journalist Michael Pollan, "Some of My Best Friends Are Germs", was the cover story of the New York Times magazine on Sunday.  Pollan says the interest he developed in fermented foods while he was writing his latest book -- beer, kimchi, cheeses -- naturally led into an interest in the fermentation that goes on in our large intestines with the help of resident microbes, and this led him to think generally about the interaction between microbes and us.

The current estimate is that microbial cells in and on our body outnumber our own cells ten to one.  The Human Microbiome Project, funded by the NIH, as yet another Big Science do it all and think about it later investment, was launched in 2008 with the goal of sequencing the microbes in the nasal passages, oral cavities, skin, and the gastrointestinal and urogenital tracts of a fairly small sample of men and women.  The project was completed last spring with sequences from samples from more than 240 people (completed means the authors can now go to the press and demand even more money because 'more research is needed' before we understand anything.....).

But, that's not the end.  In a drive to document the microbiome of America, The American Gut Project will sequence your microbiomes for $99.  It's an open source, open access project which allows participants to compare their data with data from people around the world. 

Why?  Because the microbiome is the new genetics.  Forget (our own) genes, the idea is that we are who we are, healthy or sick, because of the microbes we share our bodies with. Well, of course, it's the microbes' genes, so it's really just more genetics.  As Pollan puts it,
To the extent that we are bearers of genetic information, more than 99 percent of it is microbial. And it appears increasingly likely that this “second genome,” as it is sometimes called, exerts an influence on our health as great and possibly even greater than the genes we inherit from our parents.
Our microbiome may make us fat or keep us thin, predispose us to diabetes or heart disease or asthma and allergy. And, the microbiome apparently influences our immune system and trains it in how it responds to the world, and may be responsible for the increase in autoimmune diseases in the West. Indeed, Pollan writes of "an impoverished 'Westernized microbiome'" -- some researchers suggest that the microbiota in our gut should be restored to look more like the microbiomes of people who eat less processed food and take fewer antibiotics (and, by the way, are more likely to die early of infectious diseases than we are). 

But, happily for us, changing our second genome is going to be a lot easier than changing our first one.  We do it all the time, through our diet, the medicines we take, the people (and their microbes) we come into contact with.  Happily for Big Pharma and Big Food, once we know which microbes are best for us, they'll be able to sell you any number of products to reverse these changes and reduce your chances of getting all those diseases geneticists have been trying to find the causes of for so long.  You'll have personalized microbiomalgenomic medicine.  Enough to put another smile on Francis Collins' funds-securing face!

There has been a lot of talk lately about 'fecal transplants' which are just what they sound like -- the transfer of fecal bacteria from healthy people into the colons of unhealthy people, primarily people with Clostridium difficile infections, intestinal infections resistant to antibiotics.  And yes, there is a lot of information on the web for those who want to learn to do this kind of self-medicating at home.  

Overpromising yet?
Pollan (whose qualifcation for writing this piece is that he is a food journalist whose writing sells well) says a few times that people involved in researching the microbiome don't want to make the same mistake that geneticists did with the Human Genome Project, overpromising the extent to which their work will lead to the cure for everything that ails us.  But, if as Pollan claims, the microbiome community is actually talking about the Grand Unified Theory of Chronic Disease, there's apparently a fine line between hype and overpromising -- not to mention borrowing from physics' line that is used to justify Hadron.  And there are certainly plenty of 'probiotic' options in the grocery store these days, so someone's jumping the bandwagon. 

All satire aside, the point isn't that we doubt that there's a connection between microbes, health and disease.  Nor that there are indeed lots of nuggets of grain in the bin being described.  Instead, it's that these are early days yet, and the whole approach to this project is already sounding too reductionist for words.  Not only are microbiome researchers in danger of mimicking the genome project with overpromising to over-enumerate, but also by reducing everything to their favorite microbe (for 'microbe' we used to read 'gene'). The bugs for ability to play the bass or baseball, score well on IQ tests, or tolerate abuse with equanimity may be right around the corner.

Enough Just-So stories yet?
Indeed, the Just-So storytellers are already out in full force. Why are there complex carbohydrates in human breast milk, if babies can't digest them?  As Pollan puts it, "Evolutionary theory argues that every component of mother's milk should have some value to the developing baby or natural selection would have long ago discarded it as a waste of the mother's precious resources." So, it turns out they are there for a particular gut bacterium that breaks them down and uses them.  
“Mother’s milk, being the only mammalian food shaped by natural selection, is the Rosetta stone for all food,” says Bruce German, a food scientist at the University of California, Davis, who researches milk. “And what it’s telling us is that when natural selection creates a food, it is concerned not just with feeding the child but the child’s gut bugs too.”
And, we evolved this commensurate relationship with microbes because they evolve so much faster than we do, so can quickly evolve mechanisms to cope with new kinds of toxins in our environment and so forth.

And, the bacterium that causes ulcers and perhaps some stomach cancers, Helicobacter pylori, is an endangered species, writes Pollan.  But his informants tell him it shouldn't be.  H. pylori also has beneficial roles to play in our stomachs -- preventing acid reflux by regulating acidity, for example, which Pollan suggests they do to render the stomach inhospitable to competing microbes, or regulating levels of an appetite hormone -- and because they do these good things, they should be nurtured rather than killed off.  Why do they do both good and bad?  Well, they do the bad stuff when we're middle-aged or older, so Pollan's informant suggests "this microbe’s evolutionary role might be to help shuffle us off life’s stage once our childbearing years have passed."

Of course, among other curious aspects of this scenario, how something evolved to kill us off once we're no longer contributing genes to the human gene pool is not explained. In order for this to work, the fitness of the microbe that could do this would have to be increased by killing us, its host, and it doesn't work that way.

Stripping it down to the truth
Again, fine, it seems quite likely that our microbiome does make contributions to our health and disease.  That's interesting enough.  For various theoretical reasons, rapidly dividing microbes do present interesting evolutionary challenges, and there's no doubt that we, and our genomes, must respond successfully if we are to persist.  If infection, broadly defined, has early negative effects because of the host's genotype, then selection favoring the bacteria, and likewise selection favoring human resistance can both be strong.  Culture and climate and habits also contribute to this potentially very dynamic evolutionary mix.  Infectious diseases with strong effect on survival, like malaria and HIV and others have clearly demonstrable effects of this kind.

But that is not the same as invoking specific selective stories for complex, ephemeral, varying fluxes of bacteria, which must coexist as well as keep a host alive so they can stay alive.  It's not the same as inventing pat, closed Just-So stories about how this or that effect must have evolved, and it involves no subtleties that we know are applicable, including the range and mobility of humans, modes of transmission,  population size and so on.  We have had a difficult time, and often unsuccessful, in working out clear-cut examples of natural selection at work in humans, though our evolved defenses against malaria (which is not caused by bacteria but involve many relevant evolutionary issues) may be the best one. 

So why can't researchers, writers, the rest of us just concentrate on figuring out how and when microbes are harmful or beneficial, without the hyperbole, the suggestion that microbes now do everything that genes did not long ago, and the made-up stories about how this all evolved? 

This whole microbiome thing needs to slow down and let the science catch up.

Monday, May 20, 2013

Retirement harmful to health or... an uncertainty principle?

Years and years:  but who's counting?
Breaking news!  As reported by the BBC ("Retirement Harmful to Health"): "...the chances of becoming ill appear to increase with the length of time spent in retirement."  Even more astonishing, the effect is the same for men and women. 
The study, published by the Institute of Economic Affairs (IEA), a think tank, found that retirement results in a "drastic decline in health" in the medium and long term.
The IEA said the study suggests people should work for longer for health as well as economic reasons.
This is of course just as astonishing as the fact that having more birthdays increases your lifespan (someone must have won a Nobel prize for that discovery! or at least got a headline story in the NY Times Science supplement).

Retirement is, of course, highly correlated with aging, which is, obviously, highly correlated with length of retirement and, of course, aging is highly correlated with ill health.  Further, people still working but already in ill health are more likely to retire than people healthy and still able to work well into old age.  And, since the report considers mental as well as physical health, it's also relevant that people with an ill spouse may be more likely to retire, which may increase their chances of becoming depressed.  So if this study had reached any other conclusions than that retirement is correlated with ill health, that would have been worthy of headlines.

It turns out that the background to the report treats the question in a relatively nuanced way, even if the conclusions are much less nuanced. E.g., from the report:
...evidence suggests that poorer health increases the likelihood of retirement. When looking at health and retirement it is therefore very difficult to separate cause from effect. In addition, a plethora of variables that cannot be observed are likely to bias results in any empirical studies -- and it is difficult to predict the direction of the bias.
Further, "Theoretically, the impact of retirement on health is far from certain."  "Other mechanisms by which retirement can affect health appear equally ambiguous."  "...an observed correlation between retirement and health says nothing about causation."  "Overall, the most methodologically convincing research on the health effects of retirement is rather mixed. This is likely to be due to researchers employing different research strategies and data."

But, they do report, from interview data with 7000 - 9000 people after varying numbers of years of retirement, more self-reported mental illness, more prescription drug usage, more diagnosed physical problems, and so on among retired people than those still working, and finally conclude that retirement is harmful to health.  Indeed, the report is titled "Work Longer, Live Healthier" so there's no missing their point.

It's true that other studies have found positive effects of retirement, but the authors write, "The results have been cross-checked against the methodologies used in earlier research studies, and it has been found that the positive impact of retirement on health found in earlier studies is, at the very least, partly due to shortcomings in that research."

Well, in fact it's hard to know how to do a study that would properly answer the retirement/health question.  When poor health can 'cause' retirement and retirement can (says this report) 'cause' poor health, how do cause and effect get teased out? 

A study comparing cases and controls would be problematic; can 'controls' who are still working be assumed to match retired 'cases' if the variable being measured (health) can affect whether someone works or retires? And, aging and ill health are already highly correlated, as are retirement and aging.  So disentangling cause from effect is inherently difficult.  

And several other things.
The risks and histories clearly involve cultural and lifestyle factors.  These change all the time, and indeed are affected by stories like the current one that, in itself, might lead readers of the story not to retire, because they'll think they're committing suicide to do it.  And what about all sorts of other factors like smoking history, involvement in wars and economic crashes and their harmful effects, and who knows what else that affects our health and our attitudes on a daily basis?  One thing that is certain: our exposure to those things in the future, which would affect the issue of health after retirement, is uncertain, in principle: as any study that purports to project results into an unknown future environment, this study cannot have any easily knowable implications for years beyond the immediate future at best.

And, if you have a relative who died early from, say cancer or a coronary, you may be driven to retire early to have a chance at enjoying life before your number comes up, if you think your relative's experience reflects your own vulnerabilities.  Or conversely, if your father lived to 110, you might think you will too, so why not sock away a few extra years' pension funds, publish some more astonishing research papers, or whatever.  That is, what may be irrelevant or at least unmeasured factors can confound this type of study.  Even knowing that you don't have to retire, may affect what you decide.  Or seeing what happens to your peers as they drop out of the office and/or off their perch.

The analogy with quantum mechanics: the Heisenberg principle
In a sense, what we see here, at least potentially, is something like the phenomenon in quantum mechanics in which an electron or photon exists as a wave, until you measure it.  Then, it collapses to a point, but because you've measured, say, its location, you can no longer measure its momentum.  The reason is that the very act of observing and measuring it changes its behavior.  This, loosely speaking, is the Heisenberg uncertainty principle.

Here, too, there is a quite similar-seeming uncertainty principle:  the very act of doing the study and publishing its results will affect the future course of the very people whose future you're trying to predict with your data.  The relevant behavior of the people you studied, and others who read the research, is affected by the fact that you did the study.  How that alters behavior is uncertain and basically not knowable.

But, the bad news that retirement is harmful to people's health is good news for governments looking to save money. Raising the age at which people can begin to draw their pensions is one way to save a lot of money, because people will contribute to retirement funds for more years and draw it for fewer. We just wish the evidence were sturdier.

Friday, May 17, 2013

Of mice and men: Genes, environment, and whatever

Is the nature/nurture question finally solved? Are we who we are as a consequence of luck?  A paper in Science, Freund et al., "Emergence of Individuality in Genetically Engineered Mice," assesses the effects of the nonshared environment on neural and behavioral development in 40 inbred female mice living in an enriched environment compared with genetically identical mice in a non-enriched environment.  Does the plasticity of the brain in response to environmental stimuli go a long way toward explaining who we are? 

Freund et al. write:
...the emergence of experience-based individual differences within groups of genetically identical animals exposed to the same enriched environment has rarely been addressed. We used a large group of animals and a particularly complex environment to capture the emergence of individual differences in brain and behavior over time. We used exploration as a marker of behavioral development, and adult neurogenesis in the hippocampus as a marker for continued brain development.
  This figure from the paper gives a schematic of the experimental set-up.

Experimental setup and effects on body and brain weight.
(A) Schematic illustration of the large enrichment enclosure housing 40 mice including RFID antenna positions (shown as red rings). Positions of levels, water sources, nesting boxes, and connecting tubes are drawn to scale. (Inset) Schematic illustration of animal tracking; an RFID passive integrated transponder (PIT) is implanted in mouse’s neck. The electromagnetic field issued by the antenna induces the PIT to emit the number identifying the animal. This information is then picked up by the antenna and stored into a database together with spatial and temporal annotations. (B) Experimental time line. (C) Body weight development: weights (in grams) of CTR (blue) and ENR (red) mice at the beginning and end of the experiment. (D) Brain weights at perfusion (in grams). The difference in variance between CTR and ENR missed conventional statistical significance at P = 0.057. Source: Freund et al., 2013, Science.
After 3 months, mice in the enriched environment were heavier, with more variability in brain and body size than the control mice (though, there were only 8 control mice) and more neuronal connections were made in the hippocampus of the enriched adult mice than in the brains of the controls. "This finding supports the idea that the key function of adult neurogenesis is to shape hippocampal connectivity according to individual needs and thereby to improve adaptability over the life course and to provide evolutionary advantage."

While the authors point out a number of ways in which these mice may not in fact be genetically identical, primarily, they suggest, because of epigenetic changes due to such variables as position in the uterus, maternal disease, nutrition and interactions with the mother, maternal imprinting and so on (the 40 mice were randomly picked from different litters of the same strain of inbred mice to minimize these kinds of effects), they still consider them to be "identical."

But...wait a second!  Are they identical?
Much as we like the authors' general conclusion, because it moves away from the excessive level of imputed genetic determinism of our traits, we must add a word of caution.  These mice were actually not even genetically identical. Every time a cell divides, mutation is likely to happen. Based on some estimates from various kinds of data, that can be about 150 changes per cell division.

Such mutational variation accumulates from conception to an animal's sperm or egg production, and is thus transmitted across generations.  This is true, of course, even in inbred laboratory animals.  So even in a litter of 'identical' pup embryos, there is genetic variation.  But the picture is even more complex.

Every cell division during a mouse's (or your) lifetime, mutations occur.  If we assume that the rate is roughly as above, and even a mouse has millions if not billions (and you have billions if not a trillion or so) of cells, there is a lot of genetic variation within an organism.  Once such a somatic (body cell rather than germ cell) mutation occurs, when that cell divides its daughter cells, throughout future  life of the organism, inherit the change.  Thus, the earlier during embryological development that a somatic mutation occurs, the larger the tree of cellular descent--the more organs or larger the part of a developing organ--that will inherit the change.

Now most of these by far will occur in unimportant areas of the genome--not in any actual genes at all, or in genes that aren't used in the tissues in which the mutation has occurred.  But this cannot be assumed as a general fact and one has to consider the nearly inevitable likelihood that whatever trait is being studied, the animals, even inbred lab animals, are not genetically identical in respect to it.

The challenge here is to identify the variation and figure out if it matters to the trait.  To do that one needs to do tissue-specific, if not detailed individual cell-specific genome sequencing, and even then one needs to identify gene expression at the cell level--and maybe (probably) at different times during development or environmental exposure changes--to attempt to identify those genomic elements whose variation might be involved.   This is essentially impossible as a general rule, and certainly only under some unusual circumstances would it even be worth undertaking.  And, of course, you'd kill the animal in the process, so its behavior would be (only) in the mind of the investigator!  Even to justify the cost and effort, without this minor mortal stumbling block, one would once again have to believe that slicing and dicing the genome, cell by cell, minute by minute, will explain complex traits.

Of mice and men....
Is there really such variation in inbred animals?  Well, we used ForSim, a forward evolutionary simulation program developed in my lab by me and Brian Lambert, to simulate a mouse experiment.  Two independent lines of mice were simulated, using many genes and a lot of DNA to get enough statistical stability in the results.  After generating a normal level of variation, as seen in wild animals and people, we simulated selection of some trait in the opposite direction (small trait values in one, large in the other strain), and then inbreeding for a large number of generations (around 200) with the population kept small, roughly as inbred lines are developed.  Then, we examined these simulated animals for sequence variation and we found a substantial amount of it: roughly as many different sites were varying among the animals as had been fixed by selection and inbreeding.  Yet in the usual mapping and experimental approaches such variation is assumed not to exist.  But it does.

Hey, are we against genes or for genes, after all??
Given our predilection for criticizing what we believe are excessive claims of genetic causation, one has to think carefully and avoid oversimplifying.  The argument cuts all ways.  We must therefore also raise similar cautionary questions about excessive dismissal of genomic effects!

Today, our point is that even here, where traits vary to a surprising extent even in putatively identical animals, it cannot really all be attributed to 'chance' (unless that includes mutations), nor to learning, nor environment.  The causal mix is inextricably complex under widespread if not most conditions.

It is for reasons such as revealed by this study in what otherwise is a clear demonstration, that we write so often to try to temper the enthusiasm for genetically deterministic thinking, much less such gene-based predictions of individuals' futures.  But genes do vary, and they vary subtly.  There is no one crystal ball, not even for mice!

Thursday, May 16, 2013

Breast cancer, and probablities, in the news (again)

Angelina Jolie's New York Times editorial on her decision to have bilateral mastectomies when she tested positive for BRCA1 mutations associated with high risk of breast cancer brings up a number of issues.  We have previously commented on the problem of assessing competing risks, in the context of debates about screening, detection, and risk associated with breast cancer.  This is so common, and so serious, a problem that it naturally draws a lot of attention.   Just as naturally, it involves many sorts of probabilities: does a given test detect actual cancer?  Does every cancer need to be detected, or will some go away spontaneously?  And so on.

Under these kinds of conditions, the balance between costs, risks, important detection,  treatment options and the like all involve probabilities.  For example, Jolie writes that she was given an 87% probability of eventually having cancer.  Projecting risks--your net future in regard to this disease--is of vital interest, and because most of the probabilities involved are very inaccurately known, it is even a problem to know whether or to what extent to believe the probability estimates we already have. That is also why the same thing seems to require study over and over and over, without clear results.

In the end, for most women, and their physicians, whether they know it or not, they and their lives and health are highly dependent on the statistical aspects of studies to estimate and assess a wealth of probabilities.  In such cases, there are important, often wildly misunderstood or misapplied statistical approaches, and they often yield probabilities whose accuracy is not high or is even unknown.  Yet the point of statistical and probabilistic analysis is to make decisions about the state of the world.  If that is important, then how we interpret the results is important, but so, to a fundamental extent, is how we come to our results and interpretations in the first place.

This is so serious and widepread an issue in science, that we posted a very fine 2-page primer on statistical design and the basic nature of probabilistic inference that was written for MT by our very knowledgeable colleague, Jim Wood, here in our Department.

And yet, the Jolie story shows that under some circumstances, it is unnecessary and perhaps even wrong to worry about the details.  She decided to undergo double mastectomy as a preventive against breast cancer.  That is, she has decided that in a sense she already had breast cancer in the sense that it was a ticking time bomb in her genome.  Prevention being better than cure, she made that awesomely serious decision.

Jolie discovered that she carried one of the known variants in the BRCA1 gene that confer very high risk of breast (and ovarian) cancer.   She said she was told that the risk she'd get breast cancer was over 85%.  Now, there are actually major uncertainties about even this risk, as different cohorts of women (that is, born in different places or times) have very different risks as estimated by retrospective studies of women known to carry, compared to those known not to carry, such variants.  For some cohorts, the risk by age 60 or so has been estimated at only about half that in other cohorts.

Yet, here it doesn't matter and there is no need to worry about statistical finery or even the specific risk estimate.  Why?  Because under all the established risk scenarios, these mutations confer extremely high, potentially lethal risk.  It matters not at all, at least to most of us, whether a risk is 90% or 50%, if the risk is avoidable and the consequences dire.

Further, the BRCA1 gene function is basically known: it relates to corrections of DNA copying mistakes.  If miscopied DNA is not repaired, the risk is that in some breast cell a mutation will arise that leads the cell to be transformed into a cancer cell, that then proliferates and spreads.  So here we have not just estimates of very high risks, whatever they are, but also a mechanism.  And we have replicated findings in different populations.  So here, that these specific variants are truly risk factors themselves, rather than just being associated with some unmeasured factor, is pretty convincing--convincing enough to bet your life on it.

The nature of epidemiological risks
The cohort dependence of the BRCA1 risk for those with the clear-cut, well-studied variants, raises a very important point, one we've mentioned before.  Risks are expressed in terms of the likelihood of future events.  But how do we know what those are?  The answer is that we generally only know them from the past.  That is, we do retrospective studies, that compare those with or without exposure to a putative risk factor (here, a genetic variant) and see what happened to them.  We estimate how much more happened to those with, compared to those without, exposure to the risk factor.  But how can we predict future risk from such data?  The answer is basically an assumption that might be called uniformitarianism (a term related to the history of geology and that led Darwin to his insight about evolution): we assume that in all relevant ways, the future will be like the past.

That means that we assume that exposure to the same risk factors in the future will have the same effects as exposure did in the past (which we discovered from our sample of cases and controls, etc.).  But this assumes that we measured all the relevant factors in the past and, much more importantly, it assumes that people with the genetic risk factor will be exposed to same other factors in the future to an extent that justifies our uniformitarian extrapolation.

However, even if that were the case, we do not understand the many factors well enough and cannot, even in principle, know what exposures will be like in the future.  We simply do not know what our lifestyles will be like.  So, we have no way to make accurate risk  predictions.  It is, like the parts of the universe beyond which light cannot get here for us to see, literally beyond our reach.

This is why BRCA1 variants are 'lucky', in that whatever happens in regard to future lifestyles, there is no known scenario in which these variants would not also seem to confer very high risk.  The same cannot be said of the vast majority of genetic risk factors that are known today.  For them, risk estimates such as various genome-testing companies, or NIH's drive for 'personalized genomic medicine' are misleading--to an unknown extent.

It needs to be pointed out in this context that there are hundreds of other mutational variants in the BRCA1 gene (and in a handful of others, one of which is BRCA2) that are so rare and or were found only in patients' tumor cells, that we really cannot legitimately attribute causation to them.  Indeed, if they are too rare (as most are) we cannot apply statistical tests to even estimate risk.  All we can do is assume that the gene is relevant and therefore that the variant we find is causal.  Those variants are listed in disease-gene data bases as if they are causal, but that verges on simply being circular: assume a gene is causal and then conclude that the variant in that gene is therefore a cause.  That's bad reasoning.

Again, even here there are environmental (that is, non-genetic) risk factors that are not well established that may make an even larger proportional difference in risk, so that even if these various mutations are in fact causal in some way, that may depend entirely on the environmental context.  If so, that way is highly probabilistic and much farther from certain than the known variants in these genes. Or, some may be exceedingly dangerous, but not statistically demonstrable in the sense of probability that Jim Wood posted about in his excellent primer and discussion yesterday.

In fact, even here one can ask why the same BRCA1 mutations do not cause comparably elevated risks for any and all tissues in the body.  Such associations are generally low and not well established.  So here, the mechanism that seems to be known (DNA repair) should predict--should lead to a prior expectation of--high cancer risk in any tissue in which the gene's expressed.  In a standard 'Bayesian' analysis, the lack of strong effect in other tissues could actually undermine our confidence in our causal expectation for breast cancer itself.  Perhaps explanations for this exist, but we don't know of them.

Unfortunately, for most women there is no pre-smoking gun to guide what here are preventive decisions.  So while it's very unlucky to have inherited such a variant as Jolie did, in a strange sense she was very lucky.  At least she knew.  About 9-10 percent of women in developed countries (that is, where this has been studied) are at risk for breast cancer.  A close friend of ours has just had the same kind of operation as Jolie, but after a tumor was already found, and without carrying the known risk-variants.  Fortunately, although as we noted in our earlier post on this subject (link given above), the story is not entirely rosy, at least there are treatments that can be effective, even after the cancer has already occurred.

Everyone probably knows people who have been affected by breast cancer, and for many it's in their own families.  But unlike those with the clear-cut variants, they must face the kinds of exquisitely difficult decisions, based on very poorly understood, or inaccurate to unknown extent, competing probabilities.  For them, and for most of us in regard to the various disease time bombs silently ticking away inside us, the fine points of statistical analysis really are matters of life and death.  And they are fine points that nobody really understands.....and those who claim to are being misleading.

Wednesday, May 15, 2013

Let's Abandon Significance Tests

By his own admission, this is not only the first blog post Jim Wood has ever written, it's also the first one he's ever read. We're hoping not the last of either, given this debut. Jim is a biological anthropologist and demographer in the Penn State Dept of Anthropology, a man of many interests. His statistical knowledge is deep and of course informs his every academic endeavor, from endocrinology to infectious disease to household agriculture and beyond. We think every student should print this post and carry it around with them everywhere.  Well, anyone.
------------------------

Ronald Aylmer Fisher
   (1890-1962)


By Jim Wood

It’s time we killed off NHST.
 

NHST (also derisively called “the intro to stats method”) stands for Null Hypothesis Significance Testing, sometimes known as the Neyman-Pearson (N-P) approach after its inventors, Jerzy Neyman and Egon Pearson (son of the famous Karl). There is also an earlier, looser, slightly less vexing version called the Fisherian approach (after the even more famous R. A. Fisher), but most researchers seem to adopt the N-P form of NHST, at least implicitly – or rather some strange and logically incoherent hybrid of the two approaches. Whichever you prefer, they both have very weak philosophical credentials, and a growing number of statisticians, as well as working scientists who care about epistemology, are calling – loudly and frequently – for their abandonment. Nonetheless, whenever my colleagues and I submit a manuscript or grant proposal that says we’re not going to do significance tests – and for the following principled reasons ­­– we always get at least one reviewer or editor telling us that we’re not doing real science. The demand by scientific journals for “significant” results has led over time to a substantial publication bias in favor of Type I errors, resulting in a literature that one statistician has called a “junkyard” of unwarranted conclusions (Longford, 2005).


Jerzy Neyman
(1894-1981)
Let me start this critique by taking the N-P framework on faith. We want to test some theoretical model. To do so, we need to translate it into a statistical hypothesis, even if the model doesn’t really lend itself to hypothesis formulation (as, I would argue, is often the case in population biology, including population genetics and demography). Typically, the hypothesis says that some predictor variable of theoretical interest (the so-called “independent” variable) has an effect on some outcome variable (the “dependent” variable) of equal interest. To test this proposition we posit a null hypothesis of no effect, to which our preferred hypothesis is an alternative – sometimes the alternative, but not necessarily. We want to test the null hypothesis against some data; more precisely, we want to compute the probability that the data (or data even less consistent with the null) could have been observed in a random sample of a given size if the null hypothesis were true. (Never mind whether anyone in his or her right mind would believe in the null hypothesis in the first place or, if pressed on the matter, would argue that it was worth testing on its own merits.)   

Egon Pearson
(1895-1980)
Now we presumably have in hand a batch of data from a simple random sample drawn from a comprehensive sample frame – i.e. from a completely-known and well-characterized population (these latter stipulations are important and I return to them below). Before we do the test, we need to make two decisions that are absolutely essential to the N-P approach. First, we need to preset a so-called Î± value for the largest probability of making a Type I error (rejecting the null when it’s true) that we’re willing to consider consistent with a rejection of the null. Although we can set Î± at any value we please, we almost inevitably choose 0.05 or 0.01 or (if we’re really cautious) 0.001 or (if we’re happy-go-lucky) maybe 0.10. Why one of these values? Because we have five fingers – or ten if we count both hands. It really doesn’t go any deeper than that. Let’s face it, we choose one of these values because other people would look at us funny if we didn’t. If we choose Î± = 0.10, a lot of them will look at us funny anyway.

Suppose, then, we set Î± = 0.05, the usual crowd-pleaser. The next decision we have to make is to set a β value for the largest probability of committing a Type II error (accepting the null when it’s not true) that we can tolerate. The quantity (1 – β) is known as the power of the test, conventionally interpreted as the likelihood of rejecting a false null given the size of our sample and our preselected value of Î±. (By the way, don’t worry if you neglect to preset β because, heck, almost no one else bothers to – so it must not matter, right?) Now make some assumptions about how the variables are distributed in the population, e.g. that they’re normal random variates, and you’re ready to go.

So we do our test and we get p = 0.06 for the predictor variable we’re interested in. Damn. According to the iron law of Î± = 0.05 as laid down by Neyman and Pearson, we must accept the null hypothesis and reject any alternative, including our beloved one – which basically means that this paper is not going to get published. Or suppose we happen to get p = 0.04. Ha! We beat a, we get to reject the null, and that allows us to claim that the data support the alternative, i.e. the hypothesis we liked in the first place. We have achieved statistical significance! Why? Because God loves 0.04 and hates 0.06, two numbers that might otherwise seem to be very nearly indistinguishable from each other. So let’s go ahead and write up a triumphant manuscript for publication. 
Significance is a useful means toward personal ends in the advance of science – status and widely distributed publications, a big laboratory, a staff of research assistants, a reduction in teaching load, a better salary, the finer wines of Bordeaux…. [S]tatistical significance is the way to achieve these. Design experiment. Then calculate statistical significance. Publish articles showing “significant” results. Enjoy promotion. (Ziliak and McCloskey, 2008: 32)
This account may sound like a crude and cynical caricature, but it’s not – it’s the Neyman-Pearson version of NHST. The Fisherian version gives you a bit more leeway in how to interpret p, but it was Fisher who first suggested p less than or equal to  0.05 as a universal standard of truth. Either way, it is the established and largely unquestioned basis for assessing hypothesized effects. A p of 0.05 (or whatever value of a you prefer) divides the universe into truth or falsehood. It is the main – usually the sole – criterion for evaluating scientific hypotheses throughout most of the biological and behavioral sciences.

What are we to make of this logic? First and most obviously, there is the strange practice of using a fixed, inflexible, and totally arbitrary a value such as 0.05 to answer any kind of interesting scientific question. To my mind, 0.051 and 0.049 (for example) are pretty much identical – at least I have no idea how to make sense of such a tiny difference in probabilities. And yet one value leads us to accept one version of reality and the other an entirely different one.

To quote Kempthorne (1971: 490):
To turn to the case of using accept-reject rules for the evaluation of data, … it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brainwashed.

Think Kempthorne’s being hyperbolic in that last sentence? Nelson et al. (1986) did a survey of active researchers in psychology to ascertain their confidence in non-null hypotheses based on reported p values and discovered a sharp cliff effect (an abrupt change in confidence) at p = 0.05, despite the fact that p values change continuously across their whole range (a smaller cliff was found at p = 0.10). In response, Gigerenzer (2004: 590) lamented, “If psychologists are so smart, why are they so confused? Why is statistics carried out like compulsive hand washing?”

But now suppose we’ve learned our lesson: and so, chastened, we abandon our arbitrary threshold a value and look instead at the exact p value associated with our predictor variable, as many writers have advocated. And let’s supposed that it is impressively low, say p = 0.00073. We conclude, correctly, that if the null hypothesis were true (which we never really believed in the first place) then the data we actually obtained in our sample would have been pretty unlikely. So, following standard practice, we conclude that the probability that the null hypothesis is true is only 0.00073. Right? Wrong. We have confused the probability of the data if you are given the hypothesis,  P(Data|H0), which is p, with its inverse probability P(H0|Data), the probability of the hypothesis if you are given the data, which is something else entirely. Ironically, we can compute the inverse probability from the original probability – but only if we adopt a Bayesian approach that allows for “subjective” probabilities. That approach says that you begin the study of some prior belief (expressed as a probability) in a given hypothesis, and adjust that in light of your new data.

Alas, the whole NHST framework is by definition frequentist (that means it interprets your results as if you could do the same study countless times and your data are but one such realization)  and does not permit the inversion of probabilities, which can only be done by invoking that pesky Bayes’s theorem that drives frequentists nuts. In the frequentist worldview, the null hypothesis is either true or false, period; it cannot have an intermediate probability assigned to it. Which, of course, means that 1 – P(H0|Data), the probability that the alternative hypothesis is correct, is also undefined. In other words, if we do NHST, we have no warrant to conclude that either the null or the alternative hypothesis is true or false, or even likely or unlikely for that matter. To quote Jacob Cohen (1994), “The earth is round (< 0.05).” Think about it.

(And what if our preferred alternative hypothesis is not the one and only possible alternative hypothesis? Then even if we could disprove the null, it would tell us nothing about the support provided by the data for our particular pet hypothesis. It would only show that some alternative is the correct one.)

But all this is moot. The calculation of p values assumes that we have drawn a simple random sample (SRS) from a population whose members are known with exactitude (i.e. from a comprehensive and non-redundant sample frame). There are corrections for certain kinds of deviations from SRS such as stratified sampling and cluster sampling, but these still assume an equal-probability random sampling method. This condition is almost never met in real-world research, including, God knows, my own. It’s not even met in experimental research – especially experiments on humans, which by moral necessity involve self-selection. In addition, the conventional interpretation of p values assumes that random sampling error associated with a finite sample size is the only source of error in our analysis, thus ignoring measurement error, various kinds of selection bias, model-specification error, etc., which together may greatly outweigh pure sampling error.

And don’t even get me started on the multiple-test problem, which can lead to completely erroneous estimates of the attained “significance” level of the test we finally decide to go with. This problem can get completely out of hand if any amount of exploratory analysis has been done. (Does anyone keep careful track of the number of preliminary analyses that are run in the course of, say, model development? I don’t.) As a result, the p values dutifully cranked out by statistical software packages are, to put it bluntly, wrong.

One final technical point: I mentioned above that almost no one sets a β value for their analysis, despite the fact that β determines how large a sample you’re going to need to meet your goal of rejecting the null hypothesis before you even go out and collect your data. Does it make any difference? Well, one survey calculated that the median (unreported) power of a large number of nonexperimental studies was about 0.48 (Maxwell, 2004). In other words, when it comes to accepting or rejecting the null hypothesis you might as well flip a coin.

And one final philosophical point: what do we really mean when we say that a finding is “statistically significant”? We mean an effect appears, according to some arbitrary standard such as < 0.05, to exist. It does not tell us how big or how important the effect is. Statistical significance most emphatically is not the same as scientific or clinical significance. So why call it “significance” at all? With due deference to R. A. Fisher, who first came up with this odd and profoundly misleading bit of jargon, I suggest that the term “statistical significance” has been so corrupted by bad usage that it ought to be banished from the scientific literature.

In fact, I believe that NHST as a whole should be banished. At best, I regard an exact p value as providing nothing more than a loose indication of the uncertainty associated with a finite sample and a finite sample alone; it does not reveal God’s truth or substitute for thinking on the researcher’s part. I’m by no means alone (or original) in coming to this conclusion. More and more professional statisticians, as well as researchers in fields as diverse as demography, epidemiology, ecology, psychology, sociology, and so forth, are now making the same argument – and have been for some years (for just a few examples, see Oakes 1986; Rothman 1998; Hoem 2008; Cumming 2012; Fidler 2013; Kline 2013).

But if we abandon NHST, do we then have to stop doing anything other than purely descriptive statistics? After all, we were taught in intro stats that NHST is a virtual synonym for inferential statistics. But it’s not. This is not the place to discuss the alternatives to NHST in any detail (see the references provided a few sentences back), but it seems to me that instead of making a categorical yes/no decision about the existence of an effect (a rather metaphysical proposition), we should be more interested in estimating effect sizes and gauging their uncertainty through some form of interval estimation. We should also be fitting theoretically-interesting models and estimating their parameters, from which effect sizes can often be computed. And I have to admit, despite having been a diehard frequentist for the last several decades, I’m increasingly drawn to Bayesian analysis (for a crystal-clear introduction to which, see Kruschke, 2011). Thinking in terms of the posterior distribution, of support for a model provided by previous research as modified by new data seems a quite natural and intuitive way to capture how scientific knowledge actually accumulates. Anyway, the current literature is full of alternatives to NHST, and we should be exploring them.

By the way, the whole anti-NHST movement is relevant to the “Mermaid’s Tale” because most published biomedical and epidemiological “discoveries” (including what’s published in press releases) amount to nothing more than the blind acceptance of p values less than 0.05. I point to Anne Buchanan’s recent critical posting here about studies supposedly showing that sunlight significantly reduces blood pressure. At the < 0.05 level, no doubt.



REFERENCES

Cohen, J. (1994) The earth is round (p < 0.05). American Psychologist 49: 997-1003.

Cumming, G. (2012) Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.

Fidler, F. (2013) From Statistical Significance to Effect Estimation: Statistical Reform in Psychology, Medicine and Ecology. Routledge, New York.

Gigerenzer, G. (2004) Mindless statistics. Journal of Socio-Economics 33: 587-606.

Hoem, J. M. (2008) The reporting of statistical significance in scientific journals: A reflexion. Demographic Research 18: 437-42.

Kempthorne, O. (1971) Discussion comment in Godambe, V. P., and Sprott, D. A. (eds.), Foundations of Statistical Inference. Toronto: Holt, Rinehart, and Winston.

Kline, R. B. (2013) Beyond Significance Testing: Statistics Reform in the Behavioral Sciences. Washington: American Psychological Association.

Kruschke, J. K. (2011) Doing Bayesian Analysis. Amsterdam: Elsevier.

Longford, N. T. (2005) Model selection and efficiency: Is “which model…?” the right question? Journal of the Royal Statistical Society (Series A) 168: 469-72.

Maxwell, S. E. (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods 9: 147-63.

Nelson, N., Rosenthal, R., and Rosnow, R. L. (1986) Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist 41: 1299-1301.

Oakes, M. (1986) Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York: John Wiley and Sons.

Rothman, K. J. (1998) Writing for Epidemiology. Epidemiology 9: 333-37.

Ziliak, S., and McCloskey, D. N. (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor: University of Michigan Press.