Tuesday, September 6, 2011

Science on autopilot

Are scientists' science skills getting rusty?

A new study has shown that in recent years, aircraft accidents, including serious fatal ones, are due not to malfunctioning of the plane, but of the pilots.  Not because they're asleep at the wheel, but because they've no longer been trained at the wheel.  The pilots are given less and less time flying planes, and more and more training using the autopilot, the onboard computer that does everything for them, so they can flirt with the flight attendant (well, that's illegal now, so maybe they read comics).  Computers taking over where thought and judgment turn out to be better.
"What happens is, you don't actually hand-fly or manipulate the controls, whether it's a control yoke or a sidestick controller," Hiatt said. "Therefore, your computer skills get greatly enhanced, but your flying skills start to get rusty."
How is this relevant to science?  In our field, it's highly relevant, at least conceptually.  We are moving, or being pushed, more and more towards collecting some samples (blood or tissue samples), processing them with a robotic instrument to extract DNA or protein or something of that sort, sending that off to an even bigger, more expensive, and more wholly automated instrument that will generate far more data than a person can actually look at.

Then, this mass of results is fed into some ready-made, often online computer software, often on a large network of computers somewhere 'out there', which then sends back ready-made results files.  DNA sequencing and genotyping provide one example.  Each blood sample from a single person, or leaf from a single plant, is processed to extract DNA, the DNA is sent to a microchip based sequencer or genotyper, and the millions of results are processed by a computer to sieve out unreliable 'calls'.  Then this file is sent to a genome browser or other high data throughput device that summarizes the results.  So how many variable sites are there between the two copies of the genome in this person, or among the persons whose DNA was submitted?  How many of those are in the Human Genome or Mouse data bases?  What is each of their variant frequencies?  Which variants are statistically analyzed, again by sophisticated if 'canned' statistical packages, to find those that are associated with the others in the sample, and which are associated with some trait, like a disease or a crop plant's yield?

The blurry eyed scientist at the end of this process stares at masses of results for hours. Then s/he writes up something to describe the findings for a journal, saving gobs of data out of the paper but to submit along with the main paper as Supplemental material that, when the paper is published, will be available on line at the journal's web page.

But this is not all.  Now, the journal's web page is a centralized reviewing program that helps the editors assign reviewers, process the manuscript, receive the reviews notify everybody of the status of the paper and the reviews and the revisions.  Not only that, the author gets automated messages about the paper's status, and even (of course) instructions as to how to pay the publication cost.

Open access online journals do have their upside.  They are open access, for one thing, so that, in principle, results are accessible to far more people than fee-for-access and/or print-only journals.  And publication is generally faster than for print -- so all our cutting-edge results are out there as quickly as possible.

But when the author pays for publication, it's hard to see that the quality of the journal isn't compromised by the conflict of interest this creates for the publisher.  It's a lovely you-scratch-my-back-I'll-scratch-yours system, serving the insatiable need for investigators to publish, and the need of the publisher to turn a profit.  

In some ways this system does not make the science--DNA sequencing and analysis--bad, because something so intricate may be better handled by software and hardware developed by the narrow, but highly technical professionals.  But ordinary scientists--most of us--might be much more likely to blunder through mistakes of all kind that would be difficult for readers, who haven't the time, technology, or talent to detect.

The data may become part of public data bases (a very positive trend in life sciences) but may have layers of errors in them, undetected by the scientist.  Whether or not the interpretation is correct, probably the centralized technology does at least constrain errors up to a point -- but not completely.

But investigators are in a sense losing the primary skills that could be important in better designing and carrying out studies.  Like students who no longer no how to do multiplication, logarithms and the like by hand, most biologists know how to manage equipment but not how to do the primary work, such as writing computer programs specifically tailored to the problems they are working on.  (Many lab methods have become so automated, or off-the-shelf that scientists may not even know how to do simple things like pour gels or make buffers anymore, either.)  This can force analysis into what some large lab's programmers decided was the right way to do the analysis. Assumptions differ, program algorithms differ, and options differ.  Local judgments are being suspended to a considerable extent.  If you don't have to think about the problem except to decide whether to use Structure or Adixmap, Haploview or SeattleSNPs, or which analytic package to use, each with its own often very complex, subtle, or even unstated operating assumptions, then you are at least one step removed from a deeper understanding.

Hi-tech science is so high-tech that it is also to a considerable extent unrealistic to expect it to be other than the kind of operation it is today.  Where the balance of knowledge, skills, and centralization should lie is a matter of judgment. But the haste, numbers of 'errata' regularly published in journals, and overwhelming amounts of data being produced through a proliferating array of journals, raises important questions.


Anne Buchanan said...

A piece in The Guardian yesterday makes much this same point, and offers a solution -- papers should be self-published, allowing for anonymous comments, researchers should publish no more than 2 papers a year, and have only 1 research grant. http://www.guardian.co.uk/science/2011/sep/05/publish-perish-peer-review-science

EllenQ said...

I recently took a tour of a high-end boutique guitar factory where they had extremely fancy computerized lathes that precisely cut out the frames of the guitars so that all the joints were perfect and the stress of the strings was evenly distributed. These computer-cut pieces then all went to a single individual who tapped on every sounding board and used a razor to shave the back until it produced the sound he wanted. Their philosophy, and mine, is to let the technology do what it is best at and let the people do what they are best at.

I think the automation of data production is a fine thing (I also appreciate not having to calculate natural logarithms by hand), and I don't think being able to Sanger sequence in your own lab is necessarily a virtue. But certainly, the interpretation of the data is a complex question that must be handled with the appropriate tools.

Ken Weiss said...

For better or worse, that kind of specialization is accelerating. Of course we aren't anywhere near the situation in, say Galileo's or Boyle's time, when they made their own instruments as well as used them.

Whether we now are hyperspecialized or too far removed from what matters, is open to debate and whether we can even know what would have been better (if anything) may not even be clear in retrospect.

But the phenomenon is happening, and there's no doubt about that.

Anne Buchanan said...

Many fine guitars have been made entirely by hand, but a whole genome can't be done by Sanger sequencing in a single lifetime. Automation is required for today's genetics. The problem is that there isn't always the equivalent of the individual tapping every sounding board and shaving the wood at the other end. There are lots of errors in data generated automatically, but people don't always check or think to interpret results with that in mind.