Tuesday, April 1, 2014

Is complex disease risk predictable, or even parseable? Problems without solutions?

A very smart friend of ours tells us that he likes our posts about the problems that are impeding the kind of progress toward explaining and predicting disease that are major goals of genetics and epidemiology.  What causes asthma, or heart disease, or obesity, or hypertension or breast cancer?  And, how can we know who is going to get these diseases?  We have written a lot about why these questions are so hard to answer.  But now our friend is asking us for a solution.

Perhaps it does seem unfair to criticize without proposing a solution, though we (naturally) think otherwise.  At risk of sounding like a broken record, as we've said before, asking the critic to solve the problem he or she is criticizing is like asking a jury who votes not to convict to identify the guilty party! We've also said before that the solution, if there is one, is going to come from some smart young person who sees the problems and beyond.  Such insights can't be ordered up the way you order a burger at McDonald's; history shows that they arise unpredictably -- but in prepared minds.

The very successful early history of advances in genetics and epidemiology are telling, as they illuminate what it is about the diseases these fields still haven't explained that makes them so seemingly  intractable.  In retrospect, these fields picked the low-hanging fruit, which led to the assumption that all the fruit would be as easy to pluck. In some ways, rather arrogantly, investigators even trumpeted the fact that they were just picking the low fruit.

The same could be said for epidemiology -- infectious disease agents were, in retrospect, easy, and finding them led to major successes in treatment and prevention, with major improvements in public health.  These were largely solutions to 'point cause' problems -- one exposure or an exposure to one thing was enough to generate the results. Infectious disease led the way, but single gene 'Mendelian' science wasn't too far behind -- though at first the actual gene was hard to identify even if the single-gene cause seemed clear.

Again as we've written many times before, genetics was flying high starting in the 1980's when various technologies led to single genes being identified in which alleles (variants in the population) were essentially sufficient to cause disease.  Among these, the CFTR gene mutation that causes cystic fibrosis in Europeans was a major and justifiably heady find, and reinforced the idea that genes 'for' many diseases, and other traits, were just waiting to be found.  And, in the past 30 years or so, something like 6000 genes associated with rare genetic diseases have indeed been identified.

xkcd
Some environmental causes of less clear-cut chronic diseases, that generally arise slowly during life rather than being present at birth, were also identified in the mid to late 20th century and public health measures to prevent diseases caused by smoking, asbestos, coal dust, sunlight, and others were enacted and to some extent successful.

Why were these so relatively easy?  Because the effect of the individual risk factor or gene is strong, and exposure to it can be identified and, often, the level of exposure measured. Even in the 'simple' single-cause diseases, there turn out to be complications (not everyone with a 'causal' gene will get the disease, and not everyone with what looks like the same disorder has the same causal allele), but by and large when someone is exposed to the bacterium Vibrio cholerae he or she will get cholera, and though it isn't yet clear to what extent this is true, though it is clear that it's not always true (this varies by gene and trait), when someone is born with an allele associated with a rare genetic disease like Huntington's or Tay Sachs, he or she will have or develop the disease.

Once the gene, such as CFTR, was known and could be sequenced, many other, rare, variants were found that contributed to various levels of severity in the trait. The simple 'recessive' or 'dominant' nature of the trait blurred into a more quantitative relationship between variation in the gene and severity of the trait.  But while this is complicated, it is tractable because at least we know what gene to look at.

In the face of 20-some years applying modern laboratory, computer, and statistical technologies, there have been many advances on many fronts.  The major diseases and disorders that are still unexplained in terms of major causes and predictability are generally chronic diseases like cardiovascular disease, obesity, asthma: these are complex traits, meaning that that there is rarely if ever a one-to-one relationship between cause and effect, but instead it's many-to-many.  Many different phenotypes and many ways to get there.  And, rarely are there single causes with strong effects, as with infectious diseases, diseases caused by smoking or the generally rare genetic diseases for which causal alleles have been identified.  Instead, GWAS and other studies are identifying sometimes hundreds of genes associated with single diseases, and epidemiological studies, try as researchers might to pin blame on single causes like sugar or saturated fats, are finding many factors that might be associated with chronic conditions like heart disease and the like. In the case of genetic factors related to these complex traits, there may be variants with strong effects at individual genes, but these usually turn out to be very rare in the population and/or require some particular lifestyle factor to lead them to their harmful effect -- and this rarity makes them very hard to identify.

So? you say.  Just because it's hard, doesn't mean we can't do it!  We'll sequence thousands, or tens or hundreds of thousands of individuals, and identify all the rare causal alleles that way.  Or we'll start biobanks, amassing exposure and genetic data for equally large numbers of people, and figure out causes, and this will let us predict who'll get which diseases later on.  The more the merrier!  Big Data science is good for one and all, since even small effects will be detectable if we have ginormous sample sizes and equally large research groups to comb through the data.

Sometimes there is no solution
But there's a problem with this approach.  It assumes that the solutions that work for causes with major effects are applicable to chronic complex diseases.  But there are many reasons why that isn't true.  Indeed, a large part of the problem may be that there can be no solution to a lot of the components of the problem.  Larger sample sizes may not just 'even out' the various sorts of error and complication; instead larger samples may just introduce ever-larger amounts of complicating variation (technically speaking, the 'signal to noise' ratio may not behave, as required by most statistical theory).

Some of the reasons we're not able to identify the causes of many diseases are methodological; confounding can always throw a washer in the works.  A classic example of confounding was in the mid 1950's and 60's when only middle and upper class Americans had telephones, and studies of breast cancer suggested that having a telephone caused the disease.  But as it turned out, other factors associated with wealth, confounders, appeared to be the actual causes -- delayed age at first birth, fewer children, and so forth -- not the phone, but these data were not collected. How can you ever know that you have observed all possible risk factors directly?

This is especially true of gradual- and late-onset traits, because to identify the causal factors -- if that phrase even has meaning -- you need to measure a person's whole lifetime's worth of exposure, maybe including prenatal exposures, that people may not be aware of, or may forget. 

Study biases, that may be introduced wholly inadvertently by investigators, are equally likely to mess with study results, as are inappropriate study design or statistical methods that rely on inaccurate or inapt assumptions, and so forth.  Yes, these are preventable and sometimes reparable, at least in principle; but perfect studies do not exist, and that's a problem.  Certainly that was a problem in the heyday of genetics and epidemiology too, but when the risk factor has strong effects, these issues are less likely to impede success.  In those cases imperfect designs were good enough, but the supply of that low fruit eventually becomes exhausted.

Predicting the future
But there are issues that are actually impossible to correct for, even in principle.  It's impossible to predict the future, for example.  And yet, when we want to predict who will get a disease, we assume that environmental exposures will remain constant over time.  This is a fundamental problem with biobanks, for which data is collected on the assumption that gene environment interactions will be in 2 decades what they are today, and it's a problem with the push to sequence everyone at birth, in order to predict their future illnesses.

Before World War II, Native Americans were essentially free of obesity, type 2 diabetes and gallbladder disease.  These diseases are now endemic in NA populations, and given that the particular natural history of this group of diseases (high gallbladder disease rates, relatively low heart disease rates given obesity levels, and the quick response to some environmental change) is characteristic of these populations, it's possible that there is a fairly simple genetic susceptibility that may have come to the New World with migrants who crossed the Bering Straits ten or more thousand years ago.  (Ken blogged about this here.)

If susceptibility is due to genetic risk, Native Americans would have carried this risk for millennia, with no ill effect.  Only in a 'properly' provocative environment did these alleles become disease alleles.  This was entirely unpredictable, just as whatever novel unpredictable gene/environment interaction will cause disease in us or our children or grandchildren in 20, 30, or 40 years.

Irreproducibility
Reproducibility is a gold standard of science, a measure of the credibility of a result.  One of the criticisms of GWAS searches for causal alleles has been that they are so often irreproducible, and indeed this is a criticism of many epidemiological studies, too.  Is the Mediterranean Diet good for us?  Are saturated fats harmful? The list of irreproducible results is long and often a joke.

But sometimes results can be correct but still can't be duplicated.  There are many families with rare disease-causing alleles that aren't found in any other family, even families that have the same or similar phenotype.  Population isolates, too, can have private diseases or alleles that aren't found in other populations.  If there is a significant genetic contribution to type 2 diabetes, the cause in Native Americans may be different from that in Samoans.  If the cause of asthma is, say, excessive hygiene in some kids, maybe it's cockroaches in other kids.

So it's impossible to know what irreproducibility is telling us.  The first, later irreproducible study may in fact correctly represent the specific but irreproducible population that participated in that study, each participant with his or her own unique path to disease, but the fact that the results can't be duplicated may mean nothing about the accuracy of the results of the first, or the second, study.

Another rather superficially invoked 'gold standard' criterion is the rather self-congratulatory and fanciful claim that we try very hard to falsify our hypotheses, because learning what isn't true, that we thought might be, gradually leads us to what is true.  First, science rarely really tries to falsify its ideas but to the contrary usually tries to cook the study-design books to get the answer they want.  Even 'Big Data' hypothesis-free approaches assume there must be (for example) genetic cause, and will work very hard to show it.

But more importantly, the amount of variation in life means that falsifying an idea by not supporting it in a study doesn't necessarily mean that the idea is wrong. There could be lab or sampling errors, or statistical results that aren't quite what one's significance cutoff demanded.  Or, populations or model animals etc. simply may vary too much to expect that a result will hold up.  Falisfiability is far from a definitive criterion for finding truth.

Statistical issues
Some future exposures may not be predictable precisely, but one might at least know the distributions of exposure in some probabilistic terms.  If we knew how many people were going to drive in what types of cars for how many miles, we could predict at least statistically how many accidents there might be and what a person's risk, given his/her exposure, would be.  But even here, these factors are unknowable, because we don't know and can't know what safety features or driving habits might be present in the future.

We often model risk factors as if they act probabilistically, and in part that is necessary because of our various unavoidable sources of error (sampling, measuring, lab errors, etc.).  But often something appears probabilistic because of our ignorance, not our inaccuracy.  So often we may be modeling things in inapt ways when we apply particular statistical approaches.  But even if complex traits were 100% determined, there  are far too many unpredictable factors -- exposure to cosmic rays at just the right vulnerable time, say -- to allow us ever to be able to collect them all.  

Statistical approaches to complexity can be only as good as the collected data.  If the data are inherently unavailable (future environmental risk factors, say), or haven't been collected (confounders), no method can make up for that.  

And, success depends on what question is being asked.  Why does this patient have this rare disease? potentially has a more concise answer than asking what causes this rare disease in general, because it reduces heterogeneity and asks a precise question.  

Population data
The generalizability of population averages to individuals is problematic.  When studies show that people with blood pressure over a given threshold are at higher risk of disease than those with lower blood pressure, and especially when people with lower bp can get the disease as well, what does this mean about your risk?  It's impossible to know.  This is especially problematic when extrapolating what we estimate from the past, to the future.

Is all this a solution?
No, we raise these points because they reflect our view of why it's so hard for us to even conceive of a solution.  One can never say 'never', but the issues are serious and demand serious thought, in our opinion.  Except for the single-gene late onset diseases that we know about already (early-onset Alzheimer's, Huntington's, e.g.), by and large it's probably not going to be possible to predict the chronic diseases we will get as we age, either from our genomes or from our environmental exposures, particularly those that haven't happened yet.  In fact all of us are walking around with genetic mutations that should have felled us, but haven't.

There are answerable questions, but these are generally when a risk factor's effect is strong -- but even then not everyone who smokes gets lung cancer, and we don't yet know the extent to which 'causal alleles' aren't causing disease.

Public health as a field long ago relied on population-based approaches to make very good progress against many disorders.  You don't need to know who will get cavities or goiter to know that fluoridated water and iodized salt will reduce the frequency of those problems.  In many ways, it may simply not be worth the effort to try to 'personalize' predictions as the NIH is now grandiosely promising to do from genetic data.

But a sobering thought is that even the public health, population-based approach has proven itself to be weak in the face of causal complexity, as examples mentioned above, and in many of our posts, show.  Even population risks thought to be associated with specific factors, like sugar or saturated fat, are uncertain and unstable, at best.  And if our list of reasons for unpredictability are cogent, public health as well as individual prediction is in deep trouble.

Tomorrow we'll try to express these thoughts in the context of a concept that is very familiar and seems very well established, in physics and cosmology, in a way that makes our ideas easier to grasp, and, if they are wrong, to refute.  

No comments:

Post a Comment