Friday, March 16, 2012

The shelf life of a banana

We're in New York where I'm working with collaborators on a project in which we're developing  software to simulate complex genetic systems, so that knowing the 'truth' we can investigate some of the  questions being pursued these days in both epidemiological and evolutionary genetics.  Simulations don't generate actual reality, but when they generate what is very similar, one can hopefully make inferences about the truth, and in this arena the truth is in some ways unknowable.  The idea of simulation is to improve our ability to guess the truth from data, when things are complex as they clearly are here.

For some years our collaborators (Joe Terwilliger and Joe Lee at Columbia University) have been saying that much of the discussion of problems is needlessly about  limitations in the available genetic data.  GWAS and other similar genomic approaches to the genetic causes of traits like disease, and the results of evolution, have used increasingly extensive kinds of data.  For example, more sites of the genome are used to try to infer causation in parts of the genome that are near to those sites (we can refer to these as 'mapping markers').

Genetics has been going from one fad to another.  We first had mapping with limited sets of markers, that suggest regions of the genome that could be causally involved in some question (like adaptation to diet, or the risk of diabetes).  But the implicated regions of the genome were large, and many functional elements are in the region.  We couldn't easily identify the actual causal site or sites in the region.

Then someone discovers that there is more use of the genome than as an intermediate code for protein (that's messenger RNA, mRNA), but the RNA itself has direct function; some, called microRNA, (miRNA) affects the translation of mRNA into protein.  Someone else discovers that gene regulatory sites are important in the genome to control the expression of protein-coding regions.  Somebody else probes interactions among genes, claiming that these 'networks' are higher-order functional units.  Then it's found that chunks of DNA are duplicated or lost in some people but not others (called 'copy number variation', or CNV). Others explore the modification of DNA by various chemical means in cells, that affect which genes are expressed; this is called 'epigenetics'.

All of these things become 'omicizied--in the scramble for money, attention, and yes, even to actually do some real science, analytic platforms are developed for detecting these various elements in the genome: special genomewide tests for epigenetic sites, or non-coding RNA, or rapid sequencing methods for the protein-coding parts of the genome (called 'exome sequencing').

Many people recognize that these are temporary stop-gaps.  In part, those who think that CNV will be the killer-discovery that explains a huge fraction of the cause of diabetes or autism, or that miRNA is the key to regulation, argue about and develop special molecular and statistical tools for detecting it.

Even those specializing in one or another of these fads, or subsets of genetically related causation, know that the tools are rather temporary.  The individual applications are discovering complex causal elements, but everyone knows that in total they still promise only to account for a fraction of causation of the traits of interest.  They are holding actions, and this is openly acknowledged.  They keep the funds and research moving, and feed the technology companies, so they can develop the next level to genomic methods.  We know this, but we are institutionalized so we can't wait for better tools.  We must keep the factories moving with these methods.

But they have the shelf-life of a banana.

It's easy, and perhaps correct, to be cynical of the great hype machinery that keeps the system in high gear.  It's costly, but we need to be paid, the tech companies need to sell something today while they develop a tool for tomorrow.  We need to keep the graduate student pipeline flowing.

For years in our various talks and mini-courses and papers, Joe Terwilliger and I have been saying that rather than spending too much time arguing about the best way to use and analyze these kinds of bananas, we should just assume that whole genome sequences will be available for everybody in the population. Given the lock that the science establishment has on funds, and the way that technology makes serious increases in the amount of data we can generate and analyze, and the way that leads to drop in cost, it seems likely that barring international catastrophe, ubiquitous sequence data will be available.

This is now viewed by some as a kind of inevitable end point: finally we'll have all the data we need from a genomic point of view, and we can then really, truly, identify all the genetic causation that is involved in diabetes, cancer, how you vote, or whether you respond negatively to being sexually abused.

There are a couple of problems with this view.  First, the system will need to continue to produce and sell new gear, so clearly new things will be discovered that need documentation.  That itself means that even whole genome sequencing is likely to have the shelf life of a banana.  The kinds of things we measure will be shown to be incomplete, and in that sense 'out of date'.  The way we measure the trait--like diabetes or cancer--will be elaborated in this way, so that prior measures will be denigrated as primitive.  We'll have to do the same megastudies over again.

But no matter how comprehensive these tools and data will become--and some have already gone beyond genes, even whole genomes, to enumeration of all cellular processes and so on, as though in recognition that genomes really aren't the answer, but that the research machinery must march on--they will not in themselves solve the main issue that we face:  causation is often clearly complex, changeable, statistically elusive, and not really reducible to an enumerable set of causes.  In addition to making sure that each step is only a partial step--rarely if ever reaching the point where something is actually 'solved' (because that would put us out of business)--we have not really come to grips with the fluid and complex nature of causation of the traits we're interested in.

The more contributing 'causes' there are for a trait, and the weaker that each is on its own, the more unstable their actual effects will be, and the harder to estimate accurately, the less useful such estimates will be as predictors.  In a sense, the trait may be the same, but its causes always substantially different.  The problem is dealing with the trait, rather than attempting to enumerate its ephemeral causes.  How to do that is for the future, if we would really come to grips with it.  That, we think, is where real innovation, rather than the kinds of technical improvements that are steadily being made, will have to come.

Whether in a population that is already 7 billion strong, it will be good to continue to nibble away at the causes of traits that affect us as we age, or whether we'll just be creating countless new problems due to stress on resources and so on, is essentially a philosophical question.

No comments: