The Mermaid's Tale: When the crystal ball is cloudy: calling sequence data correctly

Thursday, April 4, 2013

When the crystal ball is cloudy: calling sequence data correctly

Here's a monkey wrench of a paper (O'Rawe et al.), just published in Genome Medicine. We're all being sold on the idea that knowing our whole genome sequence is going to make us much healthier. The DNA sequencer cum crystal ball will tell us what we're likely to be in for, and this will give us plenty of lead time to prevent it -- by running, lowering our cholesterol intake, losing weight, or whatever -- or to prepare for it.

But, among many other assumptions, this assumes first and foremost that the data are being read correctly, no false positives or negatives. And here's the clincher: O'Rawe et al. compared five different software packages that read and interpret DNA sequence data, and they report low concordance between results. Discrepancies have been found before, but not when comparing reads of the same raw data.

This group sequenced whole exomes of 15 different individuals, in 4 families, and fed the raw data through 5 sequence analysis pipelines. They also sequenced one whole genome. Sequences were done at 20 - 154X coverage, 120X average, meaning each nucleotide was read at least 20 times, but most often more, and at least 80% of the target sequence was obtained.

They found that the 5 programs agreed on single nucleotide variants (SNVs) about 60% of the time. That is, 40% of the time a SNV was called by fewer than 5 of the programs. Each of the pipelines detect variants that the others do not, and they aren't necessarily all false positives.

This disagreement is likely the result of many factors including alignment methods, post alignment data processing, parameterization efficacy of alignment and variant calling algorithms, and the underlying models utilized by the variant calling algorithm(s).

That is, each step along the way potentially introduces errors. Indel (insertion/deletions, segments of DNA one or more nucleotides in length) concordance rates were even lower, at 26% between three indel calling programs. (The paper goes into much more detail about specific pipelines and error rates.) Using family data can help reduce inaccuracies when it is possible to determine which calls just cannot be correct. But, otherwise, with current methods reducing false positives means increasing false negatives, and vice versa.

The authors write,

In the realm of biomedical research, every variant call is a hypothesis to be tested in light of the overall research design. Missing even a single variant can mean the difference between discovering a disease causing mutation or not. For this reason, our data suggest that using a single bioinformatics pipeline for discovering disease related variation is not always sufficient.

This somewhat understates the problem. Serious level testing of a SNP (single nucleotide polymorphism) to see if it has an effect on disease risk--especially when these effects are typically very small in any case, and biased upwards in GWAS type data, is no joke. What do you do? Put that single change into a lab mouse or rat and see if it might be more likely to develop slightly higher blood pressure at old age? Or have a slightly higher risk of some sort of cancer (again, to be a human model, it should be at older ages)? Which mouse strain would you use? If humans are to be used for validation, how would you do it?

The questions are serious because miscalls by sequencers go both ways. A sequencer can miss a SNV call, so you don't identify one of the variants that you really want to be checking. Or, it can give you a false positive, and lead you farther astray. And if you must choose between hundreds of variants across the genome, with comparable estimated effects, you are already in a bit of a bind even if they are all perfectly called!

No technology, or medical test, will be correct 100 percent of the time, and sequencing technologies are likely to get better, not worse (though if MS Windows is any guide, that's not necessarily true!). But, when disease risk estimates depend on accurate DNA sequence, it is obvious that we are way premature in proclaiming findings so loudly and demanding that so much effort and resources be poured into doing more of the same. Again, focused studies on problems more important, clearer, and less vulnerable to these kinds of errors is where the effort should be going.

And, some subtle manipulation, too?
By the way, the standard term for a single nucleotide variation in a population is SNP (single nucleotide polymorphism). Now, some authors use SNV (single nucleotide variant), essentially doing two things. First, they are rhetorically equating 'variant' with causal variant--that is, tactily, subtly, or surreptitiously planting in your mind that they are onto something causal. And second, they are tacitly, subtly, or surreptitiously suggesting that one of the two is the 'good' or 'normal' (i.e., health-associated) variant. This perpetuates the 'wild type' thinking--see our earlier post 'walk on the wild-type side'.

These are ways in which the community of researchers inadvertently or intentionally (you decide which) cooks the books in your and journalists', and even their own minds, entrenching a de facto genetic-causation worldview into their and everybody's thinking. That's good for business, of course.

18 comments:

AnonymousApril 4, 2013 at 11:03 AM
Thanks for posting this, I wasn't aware of the paper. And excellent point regarding de facto "cooking" the data, spinning occurrence into something causal.
ReplyDelete
Replies
AnonymousApril 4, 2013 at 10:13 PM
Ken, today more than most days I feel your "explanatory framework" comment-- after having received yet another rejection on an article of mine reporting what I would consider a potentially foundational aspect of autism genetics but which, I admit, may pose more questions than give answers. Alas, editors are not fond of these kinds of papers. Am having to downgrade my sights re journals and go with a sure thing just to get it published. I've been surprised by the resistance in subtle shift in conceptual framework. But then again, I'm not so surprised after all.

Anne, I'm definitely going to be going over the paper in more detail. Was overrun with labwork today though, but have it sitting on my computer desktop in wait. :)
ReplyDelete
Replies
AnonymousApril 5, 2013 at 9:41 AM
Very true, and even though I've had critics who've toed the line of critique and just delved straight into criticism, some of their messages have rung true and, when I listened, often made the paper that much stronger. At the moment, however, the paper-- while definitely reporting results from an experiment-- is moreso an observation about a common aspect which seems to characterize autism-related genes in general. It was a small study and meant to mainly be a bioinformatics approach as opposed to anything wetlab-based. The first critique was simply that the study was too specific a topic for the journal; the second, that they wanted to see more work performed on underlying mechanisms. This latter one I can definitely agree with, but the purpose of the study was more preliminary and we'll be moving into the lab-based work in future.

But I guess one problem was that it was moreso an observation with hints of causation but nothing strictly delineated. And I made it so purposefully, as an observation which could lead to new avenues of research. I stressed "relationship" as opposed to "causation". Which was why your comment about how articles are "spun" reminded me of my current situation.

But I do agree regarding smaller journals and simply getting it out there on the net. Which is Plan B and so it's gone on for review now.
ReplyDelete
Replies
AnonymousApril 5, 2013 at 2:42 PM
Thanks for the resources, Ken and Anne. A new age certainly seems to be dawning, one in which I hope that information reigns king moreso than researchers' names (not that we should lose the importance of having "leaders" in the respective fields as well, who are also necessary for basic progress). I've been curious about PeerJ since I heard about it. After these latest experiences, a shock to myself a young researcher who, so far, has had unusual luck remaining relatively naive to them due to connections made over a lifetime career by my partner, I can see the importance of this grass roots movement even more. I am definitely being made a convert.
ReplyDelete
Replies
AnonymousApril 5, 2013 at 9:29 PM
Congrats on the editorial position at PJ, Ken. :) I was wandering around the site and noticed a call for people; I'm wondering whether I might be helpful as a reviewer perhaps.

I didn't see the piece about Broadreach, I'll have to backtrack a bit.
ReplyDelete
Replies
DanielApril 8, 2013 at 6:10 PM
This is a reasonable summary of the take-homes from this paper, although I think it's worth emphasizing that most findings from next-gen sequencing are validated with independent technology as a matter of course - so the impact of false positives on the research enterprise is not as great as you might imagine. False negatives, on the other hand, continue to plague us.

Unfortunately I really disagree with this section of your post:

First, they are rhetorically equating 'variant' with causal variant--that is, tactily, subtly, or surreptitiously planting in your mind that they are onto something causal.

Both the intent and, I would argue, the effect of this terminology change is precisely the opposite of this: this is a term that decreases rather than increases ambiguity about causality. In fact the term variant has been adopted precisely because it is a neutral term that avoids the baggage associated with the terms "polymorphism" (a variant generally assumed to be benign) and "mutation" (a variant thought to be pathogenic).

And second, they are tacitly, subtly, or surreptitiously suggesting that one of the two is the 'good' or 'normal' (i.e., health-associated) variant.

This objection has more merit in it - but not enough, I think, to argue against adopting the term over its historically muddled alternatives.
ReplyDelete
Replies

Add comment