Monday, November 3, 2014

TMI? Personalized probability

I went to my dentist on Friday expecting to go home with a permanent crown on one of my molars but I went home with an appointment with an endodontist for evaluation for a root canal instead.  My dentist, who I think is a very good one, said that 6% of people getting crowns go on to need a root canal, and, based on my symptoms, he was 90% sure I'd be one of those 6%.  Lovely.

But what does that 6% even mean?  Just off the top of my head, it could reflect the fallibility of indications for the need for a crown, it could mean that the need for a crown and a root canal are unrelated events that just happened to occur at the same time, it could mean that some dentists are better diagnosticians than others.  And of course, it's based on the assumption that all crowns are necessary, and are fixing something real, because if not, people getting a crown who didn't need them aren't actually at risk of a root canal.  Maybe all of these.  Someone might know, but not I.  That 90% was, I assume, mostly professional courtesy, allowing for the possibility that the endodontist may not agree--that is, it was a subjective but not really a probability statement and how he arrived at it is anyone's guess (maybe even his own!).

Root canal; Wikipedia


My dentist also told me that there is probably a subgroup of crown patients that is more likely to need root canals than others.  If I remember correctly what he said -- keep in mind that I had just been told I'd need one of the most dreaded dental procedures there is, so my memory could be faulty here -- he said that the subgroup consists of people who experienced more pain before the crown procedure than the other 94%.  Which I translated to mean that these could be people with problems that their dentists misdiagnosed; maybe they needed a root canal from the beginning, but I could be totally wrong about that because I really know next to nothing about how dentists decide when a tooth needs a crown or a root canal.  In any case, it did make me think about the whole idea of probability and prediction.

First, I wasn't at all surprised that I landed in that 6% group.  My family has experienced what we think is an unlikely number of rare events, so of course I would be in that 6%.  Right?  Well, no.  Each event has been seemingly independent, with separate causes, so what has happened before has no bearing on whether or not I'm going to have a root canal.  One person being hit by lightning (say) doesn't make it more likely that another will be hit by a bus.  Unless, you know, bad luck is genetic.

But let's say these things aren't genetic.  Let's say they are random events and everyone at risk is equally at risk.  It's easy to calculate the chances of something like winning the lottery because the denominator, the number of people at risk, is readily available -- it's the number of people who buy a lottery ticket.  It's a lot harder to calculate the chances of being hit by lightning or by a bus because we can't know how many people are at risk.  We'd have to know how many people are outside during a storm, or crossing a street when a bus comes along, and that's impossible unless we're asking about your street corner or my backyard.

In that case, we've made it easy to calculate the denominator, but the likelihood of the event happening exactly there has now become astronomically low, no matter how precisely we know how many are at risk.  And if we expand the geographic area again to a city or a state so that the event has a reasonable probability of happening there, we might know how many people are struck by lightning or a bus (though, that's a hard one too because surely people don't always report being hit by a bus) but again, we've lost sight of the denominator so it's impossible to calculate the probabilities.  Unless, like the National Weather Service, we assume the whole population of the US is at risk every year, in which case, according to their calculations, each of us has a probability of being hit by lightning in a given year ranging from 1/1,900,000 to 1/960,000 depending on whether we're using actual reported lightning strikes or estimated strikes.

What are the odds of being hit by a bus in a given year?  This is a hard statistic to find, and it's necessarily just an estimate not only because such accidents aren't always reported, and we haven't specified whether we mean fatal or not, but also because sometimes bus accidents involve people inside the bus.  Given all that, the odds of being hit by a bus per year in the UK have been estimated to be 1/500,000.  That would mean something like that there were x many reported bus accidents and so many people in the UK in that year (but are they counting tourists and undocumented workers?).  But that's also a bit suspect, because some people live and walk around in cities, and others in the country where there are no buses.  Or what if they drive?  Does being hit by a bus while in one's car count in the reported statistic?  

Let's say someone in your family has been hit by a bus, and another family member has been hit by lightning. If the two are independent -- and how would we know that? -- then the odds would be the product of the two figures given above, roughly ten to the minus 11th power, or 11 zeroes after the decimal point.  Since there are only about 10 to the 9th power humans on earth, your nuclear family is probably the only family on Earth in which these two events have occurred together (actually, this is impossible to estimate for a given family if we don't specify family size and so on, but the point is that there can be events so rare that the fact of both of them happening in one family is a one-time occurrence.)  If we excuse indigenous people living where there are no buses (there is likely to be lightning), even just these two events make you super-unique. And it's such a rare circumstance that it's hard to know what it has to teach us about probabilities of such things happening.  Most of the useful stuff we learn from science is based on reproducible events -- because if not, the odd thing that happened ends up on the News of the Weird page in your local paper.  But even things like lightning strikes are not truly reproducible.

We can enumerate all the steps that led to my Uncle Ed being hit by a bus, but they won't necessarily apply to your Uncle Ed's experience with a bus.  The outcome, that they both ended up in front of a bus at the wrong time, is the only generalizable and useful fact here.  That my Uncle Ed stopped to buy a newspaper to read this week's News of the Weird, and was so absorbed in the story about the elephant that he stepped off the curb to cross the street just when the bus came around the corner is irrelevant to the facts of your Uncle Ed's ill-fated meeting with his bus. Your Uncle Ed, as it happened, was in a hurry and, yes, texting, and he stepped off the curb to get around the throng of tourists gawking at the Empire State Building just as a bus came along.

New York City bus; Wikipedia

The facts in each case, interesting and important as they are to each Uncle Ed and his family, can't be used to predict the steps by which someone else's Uncle Ed will end up being hit by a bus. We might conclude that reading the News of the Weird is dangerous, but that seems illogical.  We might, but probably shouldn't conclude that tourists are dangerous.  But if we begin to collect enough stories about people being hit by a bus while texting, we can start to suspect that texting might be dangerous. But by looking at causation in aggregate, we're likely to pay attention to common causes and ignore the rare ones.  Indeed, my dentist didn't feel he needed to tell me that my tooth that seemed to need a simple crown might end up also needing a root canal.

But, in every case, we've got a lot of information that doesn't apply to everyone else, or even anyone else, even if it applies in the individual case.  TMI, except that if the event is rare -- and particularly if it's unique -- we can't always determine which information is relevant to its cause, especially if there are multiple pathways to the outcome.  It's only when we've got a lot of replications of the same event that we can begin to sort that out, but even then we usually can't begin to predict the event.  Not everyone texting gets hit by a bus, just as not everyone who smokes gets cancer, and not everyone with a BRCA risk allele gets breast cancer.  Public health policy warns against the effects of smoking anyway, even when only 10% of smokers get cancer.  We're not good at predicting who will be among that 10%, so we warn everyone.

The danger of numerology
We often read about coincidences like the Uncle Eds'.  Probably, you can find the story in the News of the Weird -- two Uncle Eds hit by a bus.  And if it was the same bus, all the better.  But there is a very common danger in that.  If you sort through enough possible events, what in itself seems like an incredible coincidence becomes essentially expectable.  There are enough rare events in this world that the chance two Eds share at least some such event is not so small.  The chances that my family has experienced the particular rare events that we've experienced is probably so small that we're the only family ever to have experienced these specific events in the history of families on Earth, but the number of families that are unique in the same way is in fact probably quite common.  And depending on how we classify events, every family is unique.  (A good discussion of this is in a recent podcast of a BBC Radio4 program More or Less, that tries to debunk misuses of statistics.)  We see the problem of assessing the uniqueness of events all the time in twin studies, too.  If my dog and your dog have the same name it's just a curious coincidence, but if my dog and my twin's dog have the same name, Nature will publish that as a genetic finding (thank you, Holly).  So we have to be careful of that sort of numerological game.

The nature of prediction, done properly
The concept of 'probability' is a very elusive one, and really it has no single definition.  But most uses refer to repeatable events, as mentioned above.  That is, the probability of an outcome is the fraction of times that outcome is observed among repeated experiences.  Likewise, prediction is generally based on aggregate data. That's what epidemiology does; collect a pile of data in a population and figure out which variables, or exposures, are most likely associated with an outcome or disease. In genetic epidemiology, if enough people with a disease share a genetic variant, that variant starts to look causal.  In the easiest cases, which are usually rare diseases, the variant can then often, but certainly not always, be considered predictive, or at least diagnostic.

As we've said many times before, epidemiology and many other areas of observational science do best when causation is simple.  Simple point, or single-factor causes, make the notion of replication more sensible and more reliably observable.  But this is far less clear when causation is complex, involving multiple simultaneous contributing factors, no one combination of which is necessary.  

Too often, even if hundreds of millions of dollars have been spent on trying to understand the cause of a disease, prediction is hard.  In a very nice post on the blog 'chronic health, Luzt Kraushaar shows that this is true for heart disease.  He writes:
If weather forecasts were as reliable as cardiovascular risk prediction tools, meteorologists would miss two thirds of all hurricanes, expect rain for 8 out of 10 sunny days, and fail to see the parallels to fortune telling.  
Epidemiologists have tried for decades to separate the wheat from the chaff with heart disease, to determine which risk factors are predictive; which are necessary and sufficient.  In Uncle Ed terms, this means they've tried to determine whether reading News of the Weird is causal or whether the cause is actually texting.  Or maybe both are causal. And even though heart disease is a major cause of death in rich countries, we still don't know very well how to predict who'll be affected.  In a way, that's because every heart attack is a rare, even unique event, and aggregating data may discard the important variables about the cause of my heart attack.  For my Uncle Ed, the News of the Weird was an important cause of his getting hit by the bus, but not predictive for anyone else, or at least not reliably predictive.  

But we're all rare events.  Every genome that has ever existed has been unique.  We are each now a 1 in 7.5 billion event (and that's without counting our also entirely unique ancestors).  And this means that every event for everyone of us follows its own path.  Even the phenotype of single gene diseases can differ between individuals because while they may share a causal variant it's in a unique genomic context, and the modifying effects of that context are unique.  Further, when a disease involves environmental exposures, everyone's will be unique, and this makes retrodicting cause (which is what epidemiology does) difficult, but predicting it is even more difficult because we can't predict future environments, and we can't know which particular genomic contexts, each one of them unique, are significant risk factors or what kinds of modifiers they'll be.

So, while we're being promised miracles by believers in personalized medicine based on our DNA, this in some ways will be too much information that we don't know what to make of.  Predicting complex diseases, those due to environmental exposures interacting with many genes, is going to be impossible, in principle, much of the time.  Who knew News of the Weird was going to cause Uncle Ed to walk in front of a bus?

3 comments:

  1. I really enjoyed this, Anne. It reminds me, as your posts often do, with the trouble with claiming any predictions less than 100% are accurate. If Nate Silver says Obama has a 60% likelihood of winning and then Obama wins, Nate Silver is said to have been "right" etc... This isn't so. He would also have been right if Obama lost since he gave that a 40% probability (in this hypothetical ex.). So ... what does probability mean? At the level of the event or the person? Whatever we use it for to fit our narrative, is what I'm inclined to answer.

    ReplyDelete
  2. Thanks, Holly. The Kraushaar post I linked to above is great on the bad job medicine is doing of predicting heart attacks, and it's a sobering post. But I'm still stuck on what 20% risk of heart attack even means.

    I think public health decided long ago that even if risk is small, if an outcome is preventable, PH action should be taken. In a sense, that's using probabilities to fit our narrative.

    ReplyDelete
  3. You're on the mark, Holly.
    'Probability' is a very problematic (probabilmatic?) concept to start with. In life sciences it almost always means something about clearly replicable events, which much of life isn't like.

    Or it is somebody's hunch to which they give an entirely subjective number between 0 and 1, as if the situation in question could be repeated, which it often can't, in which case it's just another way to sound insightful, scientific, and technical while uttering a hunch. An example is the ever-wise Dawkins' statements that he's 99% sure that God doesn't exist.

    At least those advocating Bayesian approaches openly acknowledge that they are using the system to try to make a clear means of stating one's hunch, as a hunch.

    ReplyDelete