Wednesday, September 4, 2013

Collecting data is easy; making sense of it is not

Big Data
More Big Data is on the horizon -- a new estimate of the number of viruses in known mammal species harbor is generating excitement over the plausibility of sequencing them all.  A paper in mBio ("A Strategy To Estimate Unknown Viral Diversity in Mammals," Anthony et al.) reports that sequencing the Indian Flying Fox, a bat, the first species whose 'virodiversity' was characterized, found 58 viruses, 50 of them previously unknown.  Extrapolating from this, researchers estimated that with 5,500 known mammals, mammals harbor at least 320,000 viruses.

The majority of viruses that infect humans come to us from wild life, so if we sequence them all, which now seems feasible, or at least imaginable in principle, we'd have at least some handle on the kinds of epidemics that might be in our future.  And this for a mere $6.3 billion, trivial compared with the cost of a pandemic, assuming that's what would be in store otherwise. And it would be even less costly if only 85% of all viruses were characterized.
If annualized over a 10-year period, the discovery of 85% of mammalian viral diversity would be just $140 million/year, which is both a one-off cost and a fraction of the cost of globally coordinated pandemic control programs such as the “One World, One Health” program, estimated at $1.9 to 3.4 billion per year, recurring.  While these programs will not themselves prevent the emergence of new zoonotic viruses, they will further contribute to pandemic preparedness by enhancing our understanding of viral ecology and the mechanisms of disease emergence and by providing sequences and other insights that reduce the morbidity, mortality, and economic impact of emerging infectious diseases by expediting recognition and intervention. 
This all sounds extremely valuable to global public health.  But, will it be?  We have known about the mode of transmission of influenza for decades, and have developed vaccines which we're all encouraged to get each fall, but we are unable to eliminate the virus in its animal reservoirs.  And it mutates every year, and it's yet another virus that medicine is unable to treat, and it still kills tens of thousands every winter.

MERS coronavirus; Wikimedia Commons
We know everything we need to know about HIV and it's still killing people.  We worry about avian flu, and mutations that will make it easily transmissible between humans, and are unable to prevent that, even though the virus has been well characterized.  Even when the reservoir for MERS is discovered, we'll not be able to treat it, nor prevent its transmission to humans, nor prevent its mutation.  Sequencing a virus is easy, it's what gets done after that that's hard.  And having 320,000 viral sequences won't change that.

Another Framingham -- alas
At least one commenter on this project, here, likens its potential to that of the Framingham Project, a decades long study of heart disease which many believe has been responsible for a sharp decline in  cardiovascular disease in the West.  But in fact incidence of CVD began to fall before any Framingham-related interventions became widespread, and indeed the underlying reason is still not understood.  And some of the fundamental Framingham findings have not been born out by subsequent research. So, if the project is similar to Framingham it may well be in collecting huge amounts of data that may not make much difference to public health.

The project has the potential to be similar to Framingham in another major way, and that is in tying up money for decades.  The investigators are already talking about ten years of funding.  And ten years from now, the project will be too big to defund.

And of course this project will have a lot in common with human genetics -- let's just collect the data, and then it will tell us what it means.  Science has gotten very very good at collecting loads of data.  But that doesn't always correlate with asking the right questions, or understanding what it means.

There seems to be a clear pattern that investigators get together to hatch ever larger, longer and costlier projects, argued on fanciful promises that often boil down to fear tactics (horrible epidemic if you don't fund us!!!).  That is the 'Big Data' worldview.  Time has already proven that these are very costly, difficult to terminate after they reach diminishing returns, compete with focused individual investigator research, and are a big grab on shrinking or limited funds.

So the very first question one should ask, after filtering out the excess verbiage, is would this, at this time, given other problems being faced, be worth investing in?  Or should the investigators be told to focus on more immediate or real problems, or areas of virology and epidemiology with more likely short-term relevance?  After all, some years, say ten, from now, technology costs may have plummeted and this could be more affordable, given a recovered economy.  But affordability is not necessarily the same as guaranteed progress. 


Ken Weiss said...

I want to add an afterword, to be clear. The Framingham study may well have made major contributions, findings for its time that led to or sped up other findings. The issue to us, here, is that such big studies become perpetual and don't get phased out when they reach diminishing returns. The politics of continuing trump the political risks of termination.

Framingham is just one of many, some much larger, such efforts that one could mention. So our comment was not aimed at this particular study, except as it serves as an example.

Manoj Samanta said...

I agree with your entire post except one tiny part -

"technology costs may have plummeted and this could be more affordable, given a recovered economy"

US economy will never recover. The country is following the same course as Netherland did after 1650.

Anne Buchanan said...

You could be right. And the point should have been clearer that we meant conditional on a recovered economy!