The Mermaid's Tale: BigData: scaling up, but no Big message

As technology has advanced dramatically during the past few decades, we have been able to look at the relationship between genotypes and phenotypes in ever more detail, and to address phenogenetic questions, that is to search for putative genomic causes of traits of interest on an ever and dramatically increasing scale. At each point, there has been excitement and a hope (and widespread promises) that as we overcome barriers of resolution, the elusive truth will be found. The claim has been that prior methods could not identify our quarry, but the new approach can finally do that. The most recent iteration of this is the rush to very Big Data approaches to genetics and disease studies.

Presumably there is a truth, so it's interesting to review the history of the search for it.

The specifics in this instance

In trying to understand the genetic basis of disease, modern scientific approaches began shortly after Mendel's principles were recognized, around 1900. At that time, the transmission of human traits, like important non-infectious diseases, was documented by the same sorts of methods, called 'segregation analysis', that Mendel used on his peas. Specific traits, including some diseases in humans or pea-color in plants, could be shown to be inherited as if they were the result of single genetic causal variants.

Segregation analysis had its limits, because while a convincing disorder might be usefully predicted, as for couples wanting to know if they are likely to have affected children, there was rarely any way to identify the gene itself. However, as methods for identifying chromosomal or metabolic anomalies grew, the responsible gene for some such disorders, usually serious pediatric traits present at birth, could be identified. For decades in the 20th century this was a matter of luck, but by the '80s methods were developed to search the genome systematically for causal gene locations in the cases of clearly segregating traits. The idea was to trace known genetically varying parts of the genome (called genetic 'markers', typed because they were known to vary not because of anything they might actually cause) and search for co-occurrence among family members between a marker and the trait. This was called 'linkage mapping' because it tracked markers and causal sites that were together--were 'linked'--on the same chromosome.

Linkage analysis was all that was possible for many years, for a variety of reasons. Meanwhile, advances in protein and enzyme biology identified many genes (or their coded proteins) that were involved in particular physiology and whose variants could be associated causally with traits like disease. A classic example was the group of hemoglobin protein variants associated with malaria or anemias. The field burgeoned, and identified many genes that were likely 'candidates' for involvement in diseases of the physiology the gene was involved in. So candidate genes were studied to try to find association between their variation, and disease presence or traits.

Segregation and linkage analysis helped find many genes involved in serious early onset disease, and a few late-onset ones (like a subset of breast cancers due to BRCA genetic variants). But too many common traits, like stature or diabetes, were not mappable in this way. The failure of diseases of interest to have clear Mendelian patterns (to ‘segregate’) in families showed that simple high-effect variants were not at work or, more likely, that multiple variants with small effects were. Candidate genes accounted for some cases of a trait, but they also failed to account for the bulk. Again, this could be because responsible variants had small effects and/or were just not common enough in samples to be detected. Finding small effects requires large samples.

So things went for some time. But a practical barrier was removed and a different approach became possible.

Association mapping is linkage mapping and it is candidate gene testing

As it became possible to type many thousands, and now millions, of markers across the genome, refined locations of causal effects became possible, but such high-density mapping required large samples to resolve linkage associations. Family data are hard and costly to collect, especially families with many members affected a given disease.

However, linkage does not just occur in close relatives, because linkage relationships between close sites on a chromosome last for a great many generations, so a new approach became possible. Variants arise and are transmitted (if not lost from the population) generation upon generation in expanding trees of descend. So if we just compare cases and controls, we can statistically associate map locations with causal variants: we know the marker location, and the association points to a chromosomally nearby causal site.

It is not widely appreciated perhaps, especially by those with only a casual background in evolutionary genetics, but the reason markers can be associated with a trait in case-control and similar comparisons is that such 'association' studies are linkage studies because we assume the sets of individuals sharing a given marker do so because of some, implicit--unknown but assumed--family connection, perhaps many generations deep. The attraction of association studies is that you don't have to ascertain all family members that connect these individuals, though of course such direct-transmission data provide much statistical power to detect true effects. Still, if causation is tractably simple, GWAS, a statistically less powerful form of linkage analysis, should work with the huge samples available.

And GWA studies are essentially also a form of indirect candidate gene studies. That's because they identify chromosomal locations that then are searched for plausible genetic candidates for affecting the trait in question. GWAS are just another way of identifying what is a functional candidate. If the candidate's causal role is tractably simple, candidate gene studies should work with the huge samples available.

But by now we all know what’s been found so far—and it’s been just what good science should be: consistent and clear. The traits are not simple, no matter how much we might wish that. All the methods, from their cruder predecessors to their expansive versions today, have yielded consistent results.

Where the advocacy logic becomes flawed
This history shows the false logic in the claim typically raised to advocate even larger and more extensive studies. The claim is that, since the earlier methods have not answered our questions, therefore the newly proposed method will do so. But it's simply untrue that because one approach failed, some specific other approach must thus succeed. The new method may work, but there is no logical footing for asserting certainty. And there is a logical reason for doubting it.

The argument has been that prior methods failed because of inadequate sample size and hence poor resolution of causal connections. And since GWAS is in essence just scaled up candidate-gene and linkage analysis, it should ‘work’ if the problem in the first place was just one of study size. Yet, clearly, the new methods are not really working, as so many studies have shown (e.g., the recent report on stature). But there's an important twist. It isn't true that the older methods didn't work!

In fact, the prior methods have worked: They first of all stimulated technology development that enabled us to study greatly expanded sample sizes. By now, those newer, dense mapping methods have been given more than a decade of intense trial. What we now know is that the earlier and the newer methods both yield basically the same story. That is a story that we have good reason to accept, even if it is not the story we wished for, that causation will turn out to be tractably simple. From the beginning the consistent message has been one of complexity of weak, variable causation.
Indeed, all the methods old and new also work in another, important sense: when there's a big strong genetic signal, they all find it! So the absence of strong signals means just what it says: they're not there.

By now, asserting 'will work' or even 'might work' should be changed to 'are unlikely to work'. Science should learn from experience, and react accordingly.

Link List

Friday, October 17, 2014

BigData: scaling up, but no Big message

No comments: