Tuesday, October 21, 2014

And it's even worse for Big Data.....

Last week we pointed out that the history of technology enabled larger and more extensive and exhaustive enumeration of genomic variation, to apply to understanding the cause of human traits, important diseases or even normal variation like the recent stature paper.

We basically noted that even the earlier, cruder methods such as 'linkage' and 'candidate gene' analysis did not differ logically in their approach as much as advocates of mega-studies often allege, but more importantly, all the approaches have for decades told the same story.  That is, that from a genomic causal point of view, with a few exceptions, complex traits that don't seem to be due to strongly-causal simple genotypes aren't.  Yet the profession has been vigorously insisting on ever bigger and more costly, long-term, too-big-to stop studies to look for some vaguely specified pot of tractably simple causation at the end of the genomic rainbow.


When a genetic effect is strong, we can usually find it, in family members; Wikipedia

But the problem is even worse.  The history we reviewed in last week's post did not include other factors that make the problem even more complex.  First, is somatic mutation, about which we wrote separately last week.  But environment and lifestyle exposures are far more important to the causation of most traits than genomic variation.  The obvious and rather clear evidence is that the heritability is almost always quite modest.  With a few exceptions like stature (which is only an exception if you standardize height for cohort or age, that is, for differing environments), most traits of interest have heritability on the order of 30% or so.

But why are environmental contributions problematic?  If we can Next-Gen our way to enumerating all the (germline) genomic variants, why can't someone invent a Next-Env assessor of environments? There are several reasons, each profound and challenging to say the least.
1. Identifying all the potentially major environmental contributors, which may occur only sporadically, decades before someone is surveyed, unbeknownst to the person, or simply unknown to the investigator, is at least a largely insuperable challenge. 
2.  Defining the exposure is difficult and measurement error probably large and to an unknown or even unknowable extent.  We might refine our list, but how accurately can we really expect to measure subtle things like behavior habits in early childhood, exposures in utero, or even such seemingly obvious things like fat intake, or even smoking exposure. 
3.  Somatic mutations will interact with environmental factors just as germline ones will, so our models of causation are simply inaccurate to an unknown extent. 
4.  Suppose we could get perfect recall, and perfect measurement, of all exposure variables that affected our sampled individuals from in utero to their current ages (and, assuming pre-conception effects on gene expression by such things as epigenetic marking).  That would enable accurate retrospective data fitting, but there is simply no way to predict a current person's future exposures. The agents and exposure levels, and even identity of risk factors is simply unknowable, even in principle.
These facts, and they are relevant facts, show that no matter the DNA sequencing sophistication, we cannot expect to turn complex traits into simple traits, or even enumerable, predictable ones.

More on history
In last Friday's (Oct 17) post on Big Data, we noted that history had shown that even cruder messages were successful at giving us the basic message we sought, even if it wasn't the message we wanted to find.

The history of major, long-term, large-scale studies in general shows another reason we invest false expectations in the desire for Big Data.  Forgetting physics and other sciences, there have been a number of very big, long-term studies of various types.  We have had registries, studies of radiation exposure, follow-ups on diets, drug, hormone and other usage, studies on lipids in places like Framlingham, sometimes multi-generational follow-ups.  They have made some findings, sometimes important findings, but usually two things have happened.  First, they quickly reached the point of diminishing returns, even if the investigators pressed to keep the studies going.  Second, many if not perhaps even most of the major findings have turned out, years or decades later, even in the same study, not to hold up very well.

This isn't the investigators' fault.  It's the fault of the idea itself.  Even things like, say, lipids and other such risk factors are estimated in different ways later in a study than they were earlier, as ideas about what to measure evolves.  Measurement technology evolves so that earlier data are not considered reliable or accurate or cogent enough.  The repeated desire is to re-survey or re-measure, but then all sorts of other things happen to the study population (not least of which is that subjects die and can't be recontacted).  Yet, and we've all heard this excuse for renewal in Study Section meetings, "after all that we've invested, it would be a shame to stop now!"

In a very real scientific investment sense, the shame in many such cases is not to stop now.  That involves some very difficult politics of various kinds, and of course scientists are people, too, and we don't just react to the facts as we clearly know them.

Is an Awakening in progress?
Here and there (not just here on MT!) are signs that not only is the complexity being openly recognized as a problem, but perhaps there's an awakening or a formal recognition of the important lessons of Big Data.

Although the advocates of Big Data Evermore surely remain predominant, a number of Tweets out of this week's American Society of Human Genetics meetings suggest at least a recognition among some that caution is warranted when it comes to the promises of Big Data. The dawning has not yet, we think, reached the serious recognition that Big Data is a wastefully costly and misleading way to go, but at least that genotype-based predictions have been over-promised.  How much more time and intellectual as well as fiscal resources will be dispersed to the winds before a serious re-thinking of the problem or the goals occurs?  Or is the technology first approach too entrenched in current ways of scientific thinking for anything like such a change of direction to happen?

No comments: