Friday, February 27, 2015

The story of a rare disease

By Ellen Weiss

Despite being the product of  two of the authors of this blog – two people skeptical about just how many of the fruits of genetic testing that we've been promised will ever actually materialize  – I have been involved in several genetic studies over the years, hoping to identify the cause of my rare disease.

February 28 is Rare Disease Day (well, Feb 29 technically; the last day of February which is, every four years, a rare day itself!); the day on which those who have, or who advocate for those who have, a rare disease publicly discuss what it is like to live with an unusual illness, raise awareness about our particular set of challenges, and talk about solutions for them.

I have hypokalemic periodic paralysis, which is a neuromuscular disease; a channelopathy that manifests itself as episodes of low blood potassium in response to known triggers (such as sodium, carbohydrates, heat, and illness) that force potassium from the blood into muscle cells, where it remains trapped due to faulty ion channels.  These hypokalemic episodes cause muscle weakness (ranging from mild to total muscular paralysis), heart arrhythmias, difficulty breathing or swallowing and nausea.  The symptoms may last only briefly or muscle weakness may last for weeks, or months, or, in some cases, become permanent.

I first became ill, as is typical of HKPP, at puberty.  It was around Christmas of my seventh grade year, and I remember thinking to myself that it would be the last Christmas that I would ever see.  That thought, and the physical feelings that induced it, were unbelievably terrifying for a child.  I had no idea what was happening; only that it was hard to breathe, hard to eat, hard to walk far, and that my heart skipped and flopped all throughout the day.  All I knew was that it felt like something terrible was wrong.

Throughout my high school years I continued to suffer. I had numerous episodes of heart arrhythmia that lasted for many hours, that I now know should've been treated in the emergency department, and that made me feel as if I was going to die soon; it is unsettling for the usually steady, reliable metronome of the heart to suddenly beat chaotically. But bound within the privacy teenagers are known for, my parents struggled to make sense of my new phobic avoidance of exercise and other activities as I was reluctant to talk about what was happening in my body.

HKPP is a genetic disease and causal variants have been found in three different ion channel genes.  Although my DNA has been tested, the cause of my particular variant of the disease has not yet been found.  I want my mutation to be identified.  Knowing it would likely not improve my treatment or daily life in any applicable way.  I'm not sure it would even quell any real curiosity on my part, since, despite having the parents I have, it probably wouldn't mean all that much to this non-scientist.  

But I want to know, because genetics has become the gold standard of diagnostics.  Whether it should be or not, a genetic diagnosis is considered to be the hard-wired, undeniable truth.  I want that proof in my hand to give to physicians for the rest of my life.  And of course, I would also like to contribute to the body of knowledge about HKPP in the hopes that future generations of us will not have to struggle with the unknown for so many years.

For many people, having a rare disease means having lived through years of confusion, terrible illness, misdiagnoses, and the pressure to try to convince skeptical or detached physicians to engage in investigating their suffering.

I was sick for all of my adolescent and young adult years; so sick that I neared the edge of what was bearable.  The years of undiagnosed, untreated chaos in my body created irrevocable changes in how I viewed myself and my life.  It changed my psychology, induced serious anxiety and phobias, and was the backdrop to every single detail of every day of my life.  And yet, it wasn't until I was 24 years old that I got my first clinical clues of what was wrong.  An emergency room for arrhythmia visit revealed very low blood potassium.  Still, for 4 more years I remained undiagnosed, and there was horrible suffering during which my loved ones had to take care of me like a near-infant, accompanying me to the hospital, watching me vomit, struggle to eat or walk to the bathroom, and waking up at 3am to take care of me.  For 4 more years I begged my primary physician and countless ER doctors during desperate visits to investigate what was going wrong, asked them to believe that anxiety was a symptom not a cause, and scoured medical information myself, until I was diagnosed.  It wasn't until I was 28 that I found a doctor who listened to me when I told him what I thought I had, made sense of my symptoms, recognized the beast within me, and began to treat me.

My existence, while still stained to a degree every day by my illness, has improved so immeasurably since being treated properly that the idea of returning to the uncontrolled, nearly unbearable sickness I once lived with frightens me very much.  I fear having to convince physicians of what I know of my body again.

What I went through isn't all that uncommon among the millions of us with a rare disease.  Lengthy periods of misdiagnoses, lack of diagnoses, begging well-meaning but stumped, disbelieving, or truly apathetic physicians to listen to us are common themes.  These lost years lay waste to plans, make decisions for us about parenthood, careers, and even whether we can brush our own teeth.  They induce mistrust, anxiety, exhaustion.

Each rare disease is, of course, by definition rare.  But having a rare disease isn't. Something like 10% of us has one.  It shouldn't be a frightening, frustrating, lengthy ordeal to find a physician willing to consider that what a patient is suffering from may be outside of the ordinary since it isn't all that unlikely at all.  Mathematically, it only makes sense for doctors to keep their eye out for the unusual.

I hope that one day the messages we spread on Rare Disease Day will have swept through our public consciousness enough that they will penetrate the medical establishment.  Until then, I will continue to crave the irrefutable proof of my disorder.  I will continue to worry about someday lying in a hospital bed, weak and verging on intolerably sick, trying to convince a doctor that I know what my body needs, a fear I am certain many of my fellow medically-extraordinary peers share.

And that is why I, this child of skeptics, seek answers, hope and proof through genetics.

Thursday, February 26, 2015

Digesting yeast's message

A new paper in Nature by Levy et al. reports on the genomic consequences of large-scale selection experiments in yeast.  Yeast reproduce asexually and clones can be labeled with DNA 'barcode' tags and followed in terms of their relative frequency in a colony over time.  This study was able to deal with very large numbers of yeast cells and because they used barcodes the investigators could practicably follow individual clones without needing to do large-scale genome sequencing.  Prior to this, this sort of experiment was prohibitively costly and laborious.  So the authors add to findings in selection experiments using bacteria or flies and so on, where mostly aggregate responses could be identified.

In this case, nutrient stress was imposed, and as beneficial mutations occurred and gave their descendant cells (identified by their barcode) an advantage, the dynamics of adaptation could be followed.  The authors showed, in essence, that at the beginning the fitness of the overall colony increased as some clones, bearing advantageous mutations, rose rapidly in relative frequency.  Then, the overall colony fitness stabilized and subsequent advantageous mutations were largely kept at low frequency (most eventually went extinct).  But overall, the authors found thousands of colonies with different advantageous variants; most fitness effects were of only a small (or, for the majority, very small) percent.  Once a set of large numbers of 'fit' variants had become established, new ones had a difficult time making any difference, and hence staying around very long.

This study will be of value to those interested in evolutionary dynamics, though I think the interpretation may be rather more limited than it should, for reasons I'll suggest below.  But I would like to comment on the implications beyond this study itself.

Who cares about yeast (except bakers, brewers, and a few labs)?  You should!
This is interesting (or not) you might say, depending on whether you're running a yeast lab, or in the microbrew or bakery business. But there are important lessons for other areas of science, especially genomics and the promises being made these days.  Of course, the lesson isn't a pleasant one (which, you might correctly assume, is why we're writing about it!).

This study has important implications for basic evolutionary theory perhaps, but also for much that is going on these days in human biomedical (and also evolutionary) genetics, where causal connections between genomic genotypes and phenotypes are the interest.  In evolution, selection only works on what is inherited, mainly genotypes, but if causation is too complex, the individual genotype components have little net causal effect and as a result are hardly 'seen' by selection, and evolve largely by chance.  That's important because it's very different from Darwin's notions and the widespread idea that evolution is causally rather simple or even deterministic at the gene level.

Put another way, genomic causation evolved via the evolutionary process.  If natural selection didn't or couldn't refine causation to a few strong-effect genes, that is, to make it highly deterministic at the individual gene level, then biomedical prediction from genome sequences won't work very effectively.  This is especially true for traits, disease or otherwise, that are heavily affected by the environment (as most are) or for late-onset traits that were hardly present in the past or arose post-reproductively and hence didn't affect reproductive fitness and are not really 'specified' by genes.

There was considerable genomic variation between the authors' two replicate yeast experiments.  As one might say, meta-analysis would have some troubles here.  Likewise, from cell lineage to cell lineage, different sets of mutations were responsible for the fitness of the lineage in this controlled, fixed environment. This means that even in this very simplified set-up, genomic causation was very complex.  No 'precise' yeastomic prognostication!

In real biological history, even for yeast and much more so for sexually reproducing species in variable environments, selection has never been unitary or fixed, and genomes much more complex. Human populations have been until very recently very much smaller than 10^8 in the yeast experiments, and recent population expansion will make the number of low-frequency variants much greater, and with recombination, vastly more genomically unique.

The bottom line here is that our traits should be much less predictable from genotypes than traits in yeast. We have not reached, nor did our ancestors ever reach, the kind of fitness equilibrium reached in the yeast study under controlled selection, and fixed environments.

Somatic mutation
The authors also compare the large numbers of cells whose evolution they were able to follow with their barcode-tagging method, to the evolution of genetic variation in cancer and microbial infections, where there are even larger numbers of cells in an affected person and, importantly, clones expanding because of advantageous mutations. From the yeast results, these clonal advantages may not generally be due to one or two specific mutations (with perhaps, hopefully, exceptions when chemotherapy or antibiotics exert far stronger selection than was imposed in the yeast experiment). But the general complexity of such clonal expansions present major challenges, because they may end up with descendant branches distributed throughout the body where even in principle the responsible variation can't be directly assessed.

But the implications go far beyond cancer.  As we've recently posted, cancer is a clear but perhaps only a single manifestation of a more general phenotypic relevance of the accumulation of somatic mutations, that occur in body cells during life and can in aggregate have systemic or organismal-level implications.  The older we get the more likely we are to generate such clones, all over the body, and it seems likely that they can become manifest not just as individually ill-behaving cells, but as disease for the whole person.

But it's not just late onset implications that the yeast work may forebode.  There are already huge numbers of cells in the early embryo and fetus whose even huger descendant clades of cells during life grow many, many fold by adulthood.  There is no reason not to expect that each of us will carry clades that include differently-than-normal functioning cells in our tissues.  Let age, environmental exposure, and further mutations add to this and disease or age-related degeneration can result.  Yet none of this can be detected in the usual individual's 'genome' as currently viewed.  This is a potentially important fact that, for practical reasons or what one might call reasons of convenience, is ignored in the wealth of mega-sequencing projects being lobbied for based on genome sequencing (precision prediction being the most egregious claim).

So a bit of brewer's yeast may be telling us a lot--including a lot that we don't want to hear. Inconvenient facts can be dismissed.  Oh, well, that's just yeast!  They evolve differently!  That was just a lab experiment!  Brewers and bakers won't even care!

So let's just ignore it, as if it only applies to those rarefied yeast biologists.  Eat, drink, and be merry!

Wednesday, February 25, 2015

Survival of the safest: Darwinian conservatism, not derring-do

A mantra for many in life science is 'survival of the fittest'.  This phrase, one Darwin liked and used many times after he saw its use by Herbert Spencer, reflects Darwin's view of life as a relentlessly competitive phenomenon.  To Darwin, life was an unending struggle for survival (and reproduction) among individuals in every species all the time.  Natural selection, a relentless force like Newtonian gravity, always identified the 'fittest', weeding out the others.

Darwin's objective was to show how new characters could arise, that were suited--'fitted'--to their environment, without the intervention of God via special creation events.  Because organisms, all and always, were struggling against each other for limited resources, they 'tried' (via their inherited genomic drivers) to be better, different, more exploitive of environmental opportunities than their fellows.  Dare to be different!

But in perhaps fundamental ways, Darwin had it very wrong, perhaps inverted from what is really going on.  We know this from the analysis of genomes, the presumed source of all evolutionary evidence, since everything that's inherited goes back, at least indirectly and usually directly, to information carried in DNA.

When DNA sequences are compared within or between species, there are segments that are seen to have very little variation among the sequences, and segments with much more variation.  Now we have learned how to identify truly functional parts like coding exons, transcription start sites, introns, promoter and some regulatory regions, functional RNAs (like tRNA, rRNA and so on), telomeres, and so on.  And we have also identified many parts (the majority, actually) that has far less obvious or strong function, if indeed any function at all.  So what do we see?

The clear, consistent pattern is that the more strongly functional, the more highly conserved (there are a few exceptions, like sensory system genes in the olfactory and immune system, but even their variation proves the rule).  The less, or non-functional regions vary much more, both within and between species.
Herd of dairy goats, Polymeadows Farm; photo A Buchanan

This has been seen so consistently, that for many purposes (like the ENCODE project to characterize all DNA elements) sequence conservation is the very definition of biological function.  The reason is that evolution conserves function but doesn't care about bits that have no function.  One can quibble about the details, but the main gist of the message seems unequivocally correct.  But this now near-dogmatic principle has some little-digested implications.

Darwinian evolution
The problem Darwin wanted to solve was to explain the differences among species, and the way they were suited to their ways of life, in terms of historical processes rather than Divine creation.  He had a deep sense of geological change and biogeography from his trip on the Beagle, that showed the evidence of local relationships that suggested common ancestry.  And then he had an idea of a law of Nature, an ineluctable force-like process of adaptive change, the way gravity is a force, that would gradually form the kinds of differences that characterized species.

'Natural selection' was the name he gave to that force.  And because it was force, like the way gravity is a force, it could detect the tiniest differences among competing organisms and favor them to produce the next generation.

The idea is that species always over-reproduce relative to their resources (an idea that was already 'in the air' in Britain at the time), and struggle to obtain what become limited resources.  Because of inherited variation, the individuals with the best genotype (to use our term for it) reproduced, their poor lesser peers fell to what Tennyson would call 'Nature red in tooth and claw'.

Darwin's idea was that selection always favored the innovator.  Relentless striving to be different from the herd, to get the scarce food or mate supply.  As the late thinker Leigh Van Valen suggested, the Darwinian struggle was like the Red Queen in Alice and Wonderland--always running as fast as she could, but never getting ahead because the competition was always trying to out-do you with their own adaptations.

This is so entrenched in the biological and evolutionary literature, that it may be surprising to realize how different from what we see in the actual data--the genetic data, our most precise indicator--about how evolution works.

Or is this Darwinian?
What we actually see in the genetic data is not the kind of chaotic variation that an intense, force-like, relentless struggle to be better than your peers would lead us to expect.  Instead, what we see is what can only be called herd behavior at the genome level.  Our ancestors did, and our contemporaries do, their very best to stay with the herd.  The high conservation of functional DNA sequence suggests that mutational variation is mainly harmful to fitness, that what selection really favors is conformism. Don't be very different, or you'll get pruned away from your species' posterity!

Herd of flamingos, the Camargue, France; A Buchanan
Instead of 'survival of the fittest', what is by far mainly going on is 'survival of the safest'.  Stay with the mean.  Why is that?  A standard answer that is likely accurate, is that we today are the product of a long past in which the traits we bear were able to survive.  We're very complex organisms, so what evolution hath joined, let no one put asunder except at their peril!

Survival of the safest might suggest that there is no innovation, which is clearly not correct, since different species have different adaptations--fish swim, cats eat meat, bats fly, we write blog posts. So clearly differences do arise and have been favored regularly in the past.  However, that seems inconsistent with the high level of genomic conservatism that is so predictably identified.

One way, perhaps the major way, that these apparent contradictions are reconciled is this:  As Darwin stressed repeatedly, evolutionary change is very, creepingly slow.  That means that either there are occasional short bursts of rapid, major change, brought about by largely catastrophic changes in circumstances, or, only a very minor 'ooze' of the distribution of traits occurs in some favored direction from one generation to the next.  This is imperceptibly slow at any given time, because being near the mean is still the safest place to be.  Just be a tiny bit different.

Survival of the safest is Darwinian in that it is a form of natural selection.  Indeed, it is a lot more Darwinian than Darwin was himself.  Selection is more probabilistic and less force-like than he thought (he lived still in Newton's shadow), but it is always at work as he said it was.  It's just that it's mainly at work removing rather than favoring what is different.  Every geneticist knows this, but it is far from thoroughly integrated into the common view of evolution even by professionals.

If a trait were being strongly driven by selection in some new direction all the time, as in the more exclusive connotation of survival of the fittest, we might expect only a few variants with strong effect in the favored direction would be contributing to its newly adaptive instances.  Mapping the variation in the trait would perhaps yield a rather simple genetic causal picture as a result.

But if a trait is being roughly maintained, by survival of the safest, pruning away serious deviants, then any genotypes that are consistent with being somewhere near the average can stay around, with individual variants coming and going by chance (genetic drift).  Variants conferring trait values too far from the mean are pruned by selection.  But most variants can hang around.  Mapping would reveal very large numbers of contributing variants across the genome.  And there would not be precise predictability from genotype to phenotype.

This is what we see in biology.

Survival of the safest in daily life, too
'Survival of the safest' thus seems to be a better metaphor for adaptive biological evolution.  But if you think about it you'll see that we see much of the same regularly in most aspects of our society. We may say heady things like 'dare to be different!', but those who dare to be very different are quickly punished by being ignored or directly slapped down.  This is true through history. It's the general fact in religion, government, social behavior.  And it's true in science, too.  Business as usual is safe, real innovation is a threat to the established.  We see the press of society to claim to be different, but not really to be different.  Innovation mostly means incremental change trumpeted with exaggerated verbiage.  We may even think we want major, rapid change, but emotionally we shy from it, and feel too nervous about what it might mean for our own current state.

Organizations and sociocultural and political systems are very slow to change. or they change in herd-like fashion.  This is true even in realms, like the business world, where one often hears a rather self-satisfied pronouncement that allowing free-market Darwinian competition is the way to get innovation. Innovation is often claimed, but much less often really major.  It's true in science, too: everyone is playing grantsmanship, to seem different to draw attention or funds, in this case, but you dare to be very different at your peril, lest reviewers suspect, distrust, can't grasp, or are jealous of your idea.  Some rapid change may occur, but it's not so common relative to the inertia of survival of the safest.

That is what we see in society.

Tuesday, February 24, 2015

Causation revisited again

A paper* published recently in The Medical Journal of the Islamic Republic of Iran ("X-ray radiation and the risk of multiple sclerosis: Do the site and dose of exposure matter?" Motamed et al.) explores the possibility that X-rays are a risk factor for multiple sclerosis (MS).  (Do we routinely read this journal?  No.  Ken sent me a pdf of the paper, and when I asked him where he'd gotten it, he said he thought I'd sent it to him.  Which I had not.  On looking back at the email, he finds that it contained no actual message, just the pdf, and not even an identifiable sender.  Creepy spam? I guess we'll find out.  But until our computers are taken over by bots, despite its iffy provenience, the paper does bring up some interesting questions.)

From the paper abstract:
Methods: This case-control study was conducted on 150 individuals including 65 MS patients and 85 age- and sex-matched healthy controls enrolled using non-probability convenient sampling. Any history of previous Xray radiation consisted of job-related X-ray exposure, radiotherapy, radiographic evaluations including chest Xray, lumbosacral X-ray, skull X-ray, paranasal sinuses (PNS) X-ray, gastrointestinal (GI) series, foot X-ray and brain CT scanning were recorded and compared between two groups. Statistical analysis was performed using independent t test, Chi square and receiver operating characteristics (ROC) curve methods through SPSS software. 
Results: History of both diagnostic [OR=3.06 (95% CI: 1.32-7.06)] and therapeutic [OR=7.54 (95% CI: 1.59-35.76) X-ray radiations were significantly higher among MS group. Mean number of skull X-rays [0.4 (SD=0.6) vs. 0.1 (SD=0.3), p=0.004] and brain CT scanning [0.9 (SD=0.8) vs. 0.5 (SD=0.7), p=0.005] was higher in MS group as well as mean of the cumulative X-ray radiation dosage [1.84 (SD=1.70) mSv vs. 1.11 (SD=1.54) mSv; p=0.008].
So, it was a very small study, but the odds ratios were quite significant, particularly for therapeutic X-ray, for which dosage is likely to be higher than for diagnostic X-ray.

Chest x-ray; Wikipedia

And this isn't the only study, in fact, that has found an association between X-ray and MS.  Axelson et al. find a similar link in Sweden, described here, also in a very small study.  But, this about exhausts the reports of such a link.  The problem is that the risk of MS is small (the National Multiple Sclerosis Society estimates that there are 400,000 people in the US with MS, or about 1/1000) relative to the number of people getting X-ray, therapeutic or diagnostic.  This means that even if X-ray is causal, and the odds ratios (relative risk) in these small studies fairly large, the actual (absolute) risk is minuscule.

The cause of MS is unknown.  Many hypotheses have been considered -- it may be immune-related, or viral, or genetic, or perhaps environmental, and some believe lack of vitamin D is a prime candidate.  But as with other complex diseases, with the kind of varying and complex phenotype that is seen with MS, it's possible that there are numerous causes, and/or numerous triggers, rather than a single one.  So, if X-ray really is causal, perhaps it causes some tissue irritation that stimulates immune response, triggering some over-response that contributes to MS risk.  Thus, it's conceivably possible that X-ray is in fact contributory in some cases.  But one can imagine many such explanations.

But this again raises a larger question, one we've been blogging about off and on, well, forever, but recently, including last week, and again yesterday with respect to the new dietary recommendations, that no longer include cautions against eating foods high in cholesterol.  Why is it so hard to determine the cause of so many diseases?  Why don't we yet know the cause of MS, or heart disease, or obesity, or many other common diseases?  Essentially, it comes down to the fact that our methods for determining causation just aren't good enough when every case is different.

In the near future, we'll write about this issue in the context of how epidemiology is done these days.

*Here's the link, if you, too, want to chance it.

Monday, February 23, 2015

When the methodology fails

Aop-ed piece by Nina Teicholz in Friday's NYTimes lays it on the line, chastising the government for its regular bulletins on dietary advice that, for 50 or so years have altered what we eat, what we fear to eat, and what the risks are.  Now, new studies tell us that what was bad is good and what was good is bad, and that the prior half-century of studies were wrong.  We've eliminated fats and cholesterol, and replaced them with carbohydrates, but, as Teicholz writes,
...recent science has increasingly shown that a high-carb diet rich in sugar and refined grains increases the risk of obesity, diabetes and heart disease — much more so than a diet high in fat and cholesterol.
But why should we believe these new studies?  Teicholz basically takes the underlying methodology to task, and yet she has written a book recommending that we eat more fats (“The Big Fat Surprise: Why Butter, Meat and Cheese Belong in a Healthy Diet"), but those recommendations are based on the very same faulty methodology as the recommendations with which she, and the current USDA advisory committee, find fault.

Embrace the fat! (Wikipedia)

The same, almost exactly the same, critiques are earned by many of the 'big data' genomics studies (and other long-term go-not-very-far megaprojects).  It is the statistical correlation methodology.  When many factors are studied at once (perhaps properly since many factors, genetic and environmental, are responsible for health or other traits), we can't expect simple answers.  We can't expect correlation to imply causation.  We can't expect replication.  We can't predict the risk factors that people, for whom risk advice is based on such studies, will face in the future.

The real conclusion is to shut down the nutrition megaprojects at Harvard (singled out by the op-ed) and the other genetics and public health departments that have been running them for decades, and do something different.  The megaprojects have become part of the entrenched System, with little or no real accountability.

Pulling the plug would be a major acknowledgment of failure, both by the feds for what they funded, the program officers for defending weak portfolios and their budgets, the universities defending their overhead and prestige projects and, of course, the investigators who are either simply unable to recognize what they're doing, or too dishonest and self-protecting to come clean about it.  And then they and their students could go on to do something actually productive.

Of course such a multi-million dollar threat will be resisted, and that's why the usual answer to the kinds of conflicting, confusing reports that so often come out of these megaprojects is to increase their size, length and, geez, what a surprise!, their cost.  To keep funding the same investigators and their protegĂ©s.  This is only to be expected, and many people's jobs are covered by the relevant grants, a genuine concern.  However, research projects are not supposed to be part of a welfare system, but to solve real problems.  And the same peoples' skills could be put to better use, addressing real problems in ways that might be more effective and accountable.

And we used to laugh at the Soviets' entrenched, never successful, Five Year Plans!

It is a public misappropriation that is taking place.  Yes, there are health problems we wish to avoid, and government and universities are set up to identify them and recommend changes.  But, for most of today's common chronic diseases, lifestyle changes would largely do the trick.

But then, that would just let people live longer to get diseases that might be worse, even if at older ages.  And meanwhile we aren't putting on a full court press for things that really are genetic, or really do have identifiable life-style causes.

Much of this research is being done at taxpayer expense.  We should let the people keep their money, or we should spend it more effectively.  We won't be able to do the latter until we admit, formally and fully, that we have a problem.  Given vested and entrenched interests, getting that to happen is a very hard trick to pull off.

Monday, February 16, 2015

Occasionality (vs Probability)?

In genetics and epidemiology, we want to explain the causes of diseases, and we want to be able to predict who will get them. We generally do this statistically, by working out the probability that we are right in our guesses about alleles or environmental causal factors.  But this is problematic for a number of reasons.

The concept of probability and the term itself have always been elusive or variable in meaning.  But they refer to collections of objects and the relative proportions of some specified component, or alternatively to the distribution of outcomes of repeated identical trials.  They work great for truly repeatable phenomena such as flipping coins or rolling dice or sampling colored balls from the proverbial urn (actually, there are problems even here, of both practical and theoretical sorts, but they are minor relative to the point we wish to make).

Dice; Wikipedia

In discussions of the phenomena pertaining to evolutionary genetics, 'probability' arises all the time.  We have genetic drift--the 'probability' of change in frequency of a given allele in a population of specified size; natural selection and evolutionary fitness--the relative probability of the bearer of a given genotype to reproduce relative to the probability of alternative genotypes (adaptive fitness); the probability of migration and hence gene flow, and much more.

The same sorts of things arise in biomedical genetics: Mendelian inheritance--the probability of a given allele being transmitted to a given offspring; the relative probability of a disease arising in a person of a given genotype; the probability of lung cancer in a smoker (per cigarette consumed, etc.); the probability of having a given stature or blood pressure in a person of a given genotype, and much more.

Such things are as routinely discussed as the daily news, or what some guest said on The Daily Show. But there is a problem.  The meaning of the term is unclear, often to the user as well as hearer. The term is deeply built into many areas of evolutionary and causative genetics, our means of drawing conclusions from both observational and experimental data.  But upon closer scrutiny, the term is  being used based on highly questionable or hopeful assumptions that at best only apply to a subset of situations.

Replicability and probability: p-eeing into the wind
The standard way to represent a probability is by a symbol, often the letter p.  Once you have identified this value, it is easy to write it into various equations, like, for example, the probability of observing 3 Heads in a row is p^3 (p cubed) where p is the probability of Heads on any given flip. Or, if you assume that your data involve sampling from the same urn filled with certain proportion of red and blue balls (or, say, of diabetics or tall people in a sample from a population of people with a given genotype) and the probability of getting your result just by chance (that is, no genetic causation involved),  p is less than 0.05, then you declare to the world that something unusual is going on--that is, your research has found something (such as that the genotype is a causal factor for diabetes or tall stature).  But if your actual set-up is not something simple like balls in an urn, or dice—that sort of thing, then what is it that is really happening?  How reliable or interpretable are these probabilities?

The assumed sort of regular repeatable condition, in the real evolutionary and biomedical world where we live, is usually a fiction!  That doesn't mean there is no truth involved.  It means that we think there is some truth in how we've organized our description of what's going on, and in a somewhat circular way we're using our data to check that.  Unfortunately, we usually don't know even what we think is going on in a rigorous enough way for the p's we employ to be accurate, or to put it more fairly, they are inaccurate to an unknown extent.  Is that extent even knowable in principle?

In the areas of biology and health discussed here, we are usually far from having such knowledge. We use 'p' values or their statistical equivalents all the time, both for the processes (e.g., risk of some disease or outcome) or unusualness in support of our causal assertion (e.g., statistical significance).

In most or perhaps every case, we are making strong underlying assumptions about the real world that is our objective to understand.  That assumption is one of replicability--that is, that we are dealing with repeatable events, often that there is some ‘random’ aspect of causation (itself usually a vague assumption not seriously explained).  That in turn implies that there truly exists some p-values or probability distributions underlying what we are studying.  Repeatability is a basic assumption, what one might even call a tenet about the nature of the organization of the world we are studying.  

In what sense do such distributions or probabilities actually exist?  Is the world of our interest organized, symmetric, and repeatable enough for the models being tested to be realistic? If the cosmos were purely deterministic (e.g., no truly probabilistic quantum phenomena), then we would always be sampling from worlds of proportions and our probability values would reflect repeated sampling events ('random' assuming we really knew what that meant in practice).

But suppose to the contrary that to an important extent each event is unique, even if perhaps wholly predictable if had we enough information, and that it is literally impossible for us to be both in the world and have enough information about the world in order to make such predictions.  Perhaps either truly, or pragmatically, the 'probabilities' on which so much of what we are about in modern genetics, and, of course, the promises being made, simply don't exist.  To the extent that's true, the edifice of probabilistic science and statistical inference are mistaken or misleading, to an unknown and probably unknowable extent.

Occasionality: a more appropriate alternative concept--where there's no oh!
When many factors contribute to some measured event, and these are either not all known or measured, or in non-repeatable combinations, or not all always present, so that each instance of the event is due to unique context-dependent combination, we can say that it is an ‘occasional’ result.  In the usual parlance, the event occasionally happens and each time the conditions may or may not be similar.  This is occasionality rather than probability, and there may not be any 'o-value' that we can assign to the event.

This is, in fact, what we see.  Of course, regular processes occur all around us, and our event will involve some regular processes, just not in a way to which probability values can be assigned.  That is, the occasionality of an event is not an invocation of mystic or immaterial causation.  The word merely indicates that instances of the event are individually unique to an extent not appropriately measured, or not measured with knowable accuracy or approximation, by probabilistic statistical (or tractable deterministic) approaches.  The assumption that the observations reflect an underlying repeated or repeatable process is inaccurate to an extent as to undermine the idea of estimation and prediction upon which statistical and probabilistic concepts are based.  The extent of that inaccuracy is itself unknown or even unknowable.

There are clearly genetic causal events that are identifiable and, while imperfect because of measurement noise and other unmeasured factors, sufficiently repeatable for practical understanding in the usual way and even treated with standard probability concepts.  Some variants in the CFTR gene and cystic fibrosis fall into that category.  Enough is known of the function of the gene and hence of the causal mechanism of the known allele that screening or interventions need not take into account other contextual factors that may contribute to pathogenesis but in dismissible minor ways. But this seems to be the exception rather than the rule.  Based on present knowledge, I would suggest that that rule is occasionality.

One problem clearly is definitional.  Where causation has traditionally been assigned to some specific outcome, we now realize that outcomes, like causes, constitute a spectrum or syndrome of effects, so that we have as much variation in 'the' outcome as in the input variables.  Clearly things are not working!

Occasionality can be vaguely viewed as something like a rubbery lattice of non-rigid interacting factors, with some connections and factors (nodes) missing or added from time to time without known (or knowable?) causal processes (e.g., what new mutations might lead to part of the genome to contribute to some spectrum of phenotype outcomes?), connecting rungs being broken or new ones added as the lattice jostles down a roiling stream in and around obstacles.  Particular vortices, or outcomes, a set of vertices (traits) that may or may not bob, occasionally, up to the surface individually or in different combinations to be measured as 'outcomes'.  This is at least an attempt at a heuristic way to envision the elusive fish the biological community are trying to land.

Metaphysics and physics envy
It has often been observed that biology's love affair with the kinds of mathematical/statistical approaches being taken is a form of physics envy.  It's an aspect of the belief system that pervades both evolutionary and genetic thinking.  It avoids the somewhat metaphysical reality, that we know life involves emergent properties that are not enumerable in ways ordinary physical and chemical systems seem (at least to outsiders like us, and at the macro-scale) to involve.  If what is wrong or missing (if anything) were clear, these areas of the life sciences would be doing things differently.

This may sound far too vague and airy, but the evidence is all around us.  Last week, as we wrote about here, even 40 or more years of nutritional studies, very sophisticated and extensive, are reported to have been 'wrong' in regard to the risk of dietary cholesterol, and salt consumption.  'Wrong' means not supported by recent research.  The nature of even of blood-levels of many factors like these also seems ephemeral, that is, findings  of one study are contradicted by another study.  Which (if any) are to be believed?  Why do we think our new studies or methods obsolesce the older ones, when in essence there aren't very substantial differences?  Why haven't we got solid, replicable, clear-cut causal understanding?

The difference between occasionality and probability is somewhat like the difference between continuous and discontinuous (quantitative vs qualitative) variation.  The challenge is to understand and make inferences about the latter when we don’t have underlying analytic (mathematical function based) theory to apply.  We don’t want to say an effect has no cause (unless we are involved in quantum mechanics, and even that's debated!), but we do need to understand unique combinatorial causation in ways we currently don’t.  This will have to be mixed with the statistical (probabilistic) aspects we can't avoid, such as sampling and measurement effects.

There are many reasons to dismiss such a term or concept as occasionality.  First, because it is not operationalized: we don’t really know what to do about it.  Second, there is a tendency to reject some new term that is invoked in a critical vein.  Thirdly, the practical reason is that even if the idea is correct, if it were acknowledged too fully it would undermine the business that most in this area of science are about, and certainly would be resisted.  But if we were to force ourselves to ask what we would do if we knew the phenomena we want to understand were manifestations of occasionality rather than probability, it might force us to think and act differently.

Thursday, February 12, 2015

What's a 'healthy diet' anyway?

A nutrition advisory panel is convened by the US Department of Agriculture every five years to review recent research in the field, make sense of it, and offer recommendations about what Americans should be eating for optimal health.  Those food pyramids, MyPlate, decades of advice to limit our cholesterol, saturated fat and salt intake?  All the work of these panels of experts who scoured the data and told us what it meant.  And as a result, from the 1960's onward, good, conscientious people reduced their cholesterol intake, took the salt shaker off the table and the whole milk and butter out of the refrigerator, cattle were bred to be leaner, eggs were banned from breakfast, low-salt and low-fat processed foods appeared on the shelves, and heart disease death rates ... continued to fall.

Deaths due to diseases of the heart (United States: 1900–2006). Circulation.2010; 121: e46-e215

Now, according to numerous news accounts, including here at the Washington Post, after more than 50 years of anti-cholesterol, anti-fat expert advice, the current advisory panel is reportedly poised to recommend that we need no longer need to limit the amount of cholesterol we eat, nor worry about our salt intake.  Bring on the shrimp, the lobster, the eggs, the meat, banish the guilt!

Oh, wait.  Hold on a sec.  Let's go back to those falling heart disease rates.  I remember one of the first lectures I heard as a new grad student at the University of Texas School of Public Health in the late 1970s, given by heart disease epidemiologist Reuel Stallones, Dean of the school.  His point was that rates had been falling since the 1960's and epidemiologists had no idea why.  He systematically destroyed every argument then (and now) current that might explain the rise and fall of heart disease death rates -- changing diet, decreased smoking, de/increased exercise.

In fact, he made the same case in 1980 in Scientific American in an article called "The Rise and Fall of Ischemic Heart Disease".  (The terms ischemic heart disease, coronary heart disease, and arteriosclerotic heart disease are more or less interchangeable, according to Stallones.)
In the U.S. the death rates attributed to heart attack and other results of the obstruction of the arteries that nourish the heart have fallen since the 1960's. Why they have is not understood.
Here's Stallones' graph of heart disease death rates, from 1900 to 1980.  I can't explain why deaths are still rising in the above graph in the 1950's, but falling in this graph; presumably this has to do with differing classifications of deaths due to heart disease.  Anyway, here rates rose rapidly starting in 1920 and then began to fall in the 1950's, steeply in the 1960's.
Stallones, Scientific American 354(5); 53-59

Here's another graph showing the decline more or less in line with Stallones' data, from a 2000 paper in Circulation.

"Death rates for major cardiovascular diseases in the United States from 1900 to 1997. *Rates are age-adjusted to 2000 standard." Source: Circulation, Cooper et al., 2000

As Stallones wrote, "Plainly a sustained decline in the death rate for ischemic heart disease commands attention and calls for explanation." And, he backs up to ask not only why the fall, but why it was that heart disease mortality began to rise so quickly in 1920, particularly among men.  Whatever explains the decrease must also explain the increase.

Smoking began to rise after World War 1, which fits the rise in heart disease mortality, and began to decrease from the mid-1960's or so, which fits the decrease.  But such an explanation would assume no latency period between beginning to smoke and its effects on heart disease.  And, middle-aged men quit smoking at higher rates than women, and this is not, as Stallones said, in concordance with the pattern of decline in heart disease mortality.  Treatment isn't an effective explanation either, because incidence rates -- new cases -- followed the same pattern as mortality.

He concluded,
In summary, four major variables are known to be associated with the risk of ischemic heart disease in individuals. Among the four, hypertension does not fit the trend of the mortality from ische­mic heart disease at all; physical activity fits only the rising curve, serum choles­terol fits only the falling curve and only cigarette smoking fits both. In no case is the fit as precise as one would like. This raises doubt that any of the factors is a fully satisfactory explanation for the variation in mortality.
So, in sum, as of 1980 epidemiology had no explanation for the rise and fall of heart disease mortality rates.

And epidemiology still can't tell us what causes heart disease, or predict who'll get it.  So, apparently we'll soon be told that we no longer have to monitor our cholesterol intake, and there's a lot of talk about fat consumption not being linked with heart disease anymore, and it's not clear whether obesity or hypertension are actually causal.  At least smoking is probably still a problem.

It's clear from the rise and fall of death rates through the 20th century that genes aren't going to be the major explanation because genes can't explain the spike in the 1920's and the fall 40 years later.  That experience also makes it clear that we can't predict environmental changes (or, often, even figure out what they were in hindsight) that might be associated with risk, and thus we aren't going to be able to predict the future, despite the claims of precision medicine advocates.

Stallones suggested that heart disease mortality data might indicate a single environmental cause to explain the rise and fall of death rates, but found reasons to argue against each of the most obvious ones.  Could the cause have been inflammatory?  If so, that would reinforce the idea that predicting future environments and causes is not going to be possible.

And, if there was a single cause, it's curious that our reductionist approaches, with large carefully designed samples and sophisticated statistical analysis, were unable to identify it, because that's what they are widely thought to be best at.  This makes it more likely that heart disease in populations has multiple causes.  And in fact every heart attack is unique, because no two people eat the same things, do the same amount of exercise, suffer the same infectious diseases, and so on. So maybe the very word 'cause', and the very approach (statistical), both of which assume some regular, replicability properties, are not being appropriately conceived.  This is a subject we'll discuss next time.....

"Experts" responding to the coming cholesterol recommendations have said that we still need to eat a healthy diet.  But when we still have no idea what's unhealthy, it's hard to know what is.

Monday, February 9, 2015

Simulation: surprisingly common!

The other day we did a post suggesting that given the commitment to a mega-genomic endeavor based on the belief that genomes are predictive entities about our individual lives generally, we should explore the the likely causal landscape that underlies the assumption, with tools including research-based computer simulation.  Simulation can show how genomic causation can most clearly be assessed from sequence data.  By comparison to the empirical project itself, simulation is very inexpensive and trivially fast.  Research simulation has a long history of various sorts, and is a tool of choice in many serious sciences.  Research simulations are not just 'toy model' curiosities, but are real science, because the conclusions and predictions can be tested empirically.

Still, many feel free to dismiss simulations, as we discussed. It seems in a sense too easy--too easy to cook up a simulation to prove what you want to prove, too easy to dismiss the results as unrelated to reality.  But what alternatives do we have?

First, roughly speaking, in the physical sciences there exists very powerful theory based on well-established principles, or 'laws'.  Laws of gravity or motion are classic examples.  A key component of this theory, is replicability.  In biology, our theory of evolution and hence of genomic causation is far less developed in terms of, yes, its 'precision'.  Or perhaps it's more accurate to say that as currently understood, genetic theory is fine but we are inaptly applying statistical methods designed for replicable phenomena for things that are not replicable as the methods assume (e.g., every organism is different).  In the physical sciences to a great extent, the theory comes externally to a new situation under study, that is, is already established and being applied from the outside to the new situation.  Newton's laws, the same laws, very specifically help us predict rates of fall here on earth, and also how to land a robot on a comet far away in the solar system, or even study the relative motion of remote galaxies.

By contrast, evolution is a process of differentiation, not similarity. It has its laws that may be universal on earth, and in their own way comparably powerful, but that are very vaguely general: evolution is a process of ad hoc change based on local genomic variation and local circumstances--and luck.  They do not give us specific, much less precise 'equations' for describing what is happening.

We usually must approach evolution and genomic causation from an internal comparison point of view.  We evaluate data by comparing our empirical samples to what we'd get if the experiment were repeatable and nothing systematic is going on ('null' hypothesis) or if it's more likely that something we specify is (alternative hypothesis).  That is, we use things like significance tests, and make a subjective judgment about the results.  Or, we construct some formulation of what we think is going on, like a regression equation Y = a + bX, where Y is our outcome and X is some measured variable, a is an assumed constant and b is an assumed relative-effect constant per unit of X.  This just illustrates what we have a huge array of versions of.

But these 'models' are almost always very generic (in part because we have good ready-made algorithms or even programs to estimate a and b). We make some subjective judgment about our estimate of the parameters of this 'model' such as that our a and b estimates are correct to a known extent and are in fact causal factors. One might estimate the value of a or b to be zero, and thus to declare on the basis of some such subjective judgment, that the factor is not involved in the phenomenon.  Such applied statistical approaches are routine in our areas of science, and there are many versions of the basic idea.  They are these days often, if informally and especially in terms of policy decisions, taken to be true parameters of nature.  For example, as in this simplistic equation, the assumption is made that the effects of a and b on Y are linear.  There is usually no substantial justification for that assumption.

In a fundamental sense, statistical analysis itself is computer simulation!  It is not the testing of an a priori theory against data, either to estimate the values of things involved or to test the reality of the theory. The models tested rarely are intended to specify the actual process going on, or are purposely ignoring other factors that may be involved (or, of course, are ignored because they aren't known of).

Proper research simulation goes through the same sort of epistemological process.  It make some very specific assumptions about how things are, generates some consequent results, which are then compared to real data.  If the fit is good then, as with regular statistical analysis, the simulated model is taken to be informative.  If the fit is not good then, as with regular statistical analysis, the simulation is adjusted (analogous to adding or removing terms, or differently estimate parameters, in a statistical 'model').  Simulation models are not constrained in advance--you can simulate whatever you specify.  But neither are statistical models--you can data-fit whatever 'model' you specify.

There are lots of simulation efforts going on all the time.  Many are using the approach to try to understand the complexity of gene-interaction systems in experimental settings.  There is much of this also in trying to understand details such as DNA biochemistry, protein folding, and the like. There is much simulation in population ecology, and in microbiology (including infectious disease dynamics).  I personally think (because it's what I'm particularly interested in) that not nearly enough is being done to try to understand genetic epidemiology and the evolution of genomically complex traits.  To a great extent, the reason in my oft-expressed opinion is the drive to keep the support for very large-scale and/or long-term studies, given the promises that have been being made of medical miracles from genetic screening.  Those reasons are political, not scientific.

Analytic vs Descriptive theory
I think that the best way to get closer to the truth in any science is to have a more specific analytic theory, that describes the actual causal factors and processes relative to the nature of what you're trying to understand.  Such explanations begin with first principles and make specific predictions, as contrasted to the kind of generic descriptive theory of statistical models.  Both approaches are speculative until tested, but generic statistical descriptions generally do not explain the data, and hence have to make major assumptions when extrapolating retro-fitted statistical estimates to prospective prediction.  The latter is the goal of analytic theory, and certainly what is being grandly promised.  Here, 'analytic' can include probabilistic factors including, of course, the pragmatic ones of measurement and sampling errors, etc.  It is an open question whether the nature of life inherently will thwart both of these, leaving us with the challenge to at least optimize the way we use genomic data.

It is not clear if we have current theory that is roughly as good as it gets in regard to explaining evolutionary histories or genomic predictive power.  Life certainly did evolve, and genomes do affect traits in individuals, but how well those phenomena can be predicted by enumerative approaches to variation in DNA sequences (not to mention other factors we currently lump as 'environment') based on classic assumptions of replicability, is far from clear.  At least, at present it seems that as a general rule, with exceptions that can't easily be predicted but can be understood (such as major single-gene diseases or traits with strong effects of the sort Mendel studied in peas), we do not in fact have precise predictive power.  Whether or when we can achieve that remains to be seen.

There is one difference, however:  statistical models cannot be applied until you have gone to the extent and expense of collecting adequate amounts of data.  In truth, you cannot know in advance how much, or often even what type of, data you would need.  That is why, in essence, our collect-everything mentality is so prevalent.  By contrast, prospective research computer simulation can do its work fast, flexibly: it only has to sample electrons!

Denigrating computer simulation should not be done by anyone who is actually doing computer simulation, by calling their work by another name such as statistical modeling.  There is plenty of room for serious epistemological discussion about whether or how we are or aren't yet able to understand evolutionary and genomic process with sufficient accuracy.

Thursday, February 5, 2015

Populations, individuals and imprecise disease prediction

As Michael Gerson writes in the Washington Post, "Preventable infectious disease is making its return to the developed world, this time by invitation." When anti-vaxxers were few, and they chose not to expose their kids to what they consider toxins, their kids benefited from the herd immunity that resulted from most parents choosing to have their kids vaccinated (or, as Gerson puts it, anti-vaxxers chose to be free-riders).  They could claim there were no costs to their action (or non-action), because as long as they were a small minority, there weren't.

But unlike many complex non-infectious diseases, infectious diseases are very predictable. Once the proportion of a population that is immunized falls below a certain threshold, as determined by the rigorous and empirically tested mathematics of infectious disease, the kids of anti-vaxxers are then sitting ducks for disease once they are exposed, and the disease is then likely to spread even to the immunized population because no vaccine is 100% effective.  And this is happening in the US now with measles.  Anti-vaxxers convinced enough of their neighbors not to vaccinate that they can no longer claim no cost, only benefits to their beliefs.

From Mother Jones, 2014

In theory, herd immunity protects a population from measles when at least 90-95% of the population is vaccinated, so the above map would suggest that even in Oregon, with a rate of non medical immunization exemption over 6%, the disease would be unable to gain a foothold.  But, infectious disease researcher Marcel SalathĂ© who is here at Penn State nicely describes herd immunity here, and suggests that something closer to 100% coverage would actually be required to protect against measles because of pockets of lower vaccination rates, and non-random mixing of the population and so forth.

The science on the safety and efficacy of vaccination is well-established. Vaccines can have side-effects, but it's pretty clear the list doesn't include autism.  (Has anyone estimated the prevalence of autism in anti-vaxxer communities yet?  If they were right that the MMR vaccine causes autism, the rate should be a lot lower in unvaccinated kids by now, no?)  The freedom of choice issue, of whether the state has a right to require individuals to be vaccinated, is a live one, and any conscientious objector to the right of society to make decisions for individuals has to be struggling with this one, given the societal consequences.  This is distinctly not the same as an individual's freedom to decide whether to smoke or to drink jumbo soft drinks because in the case of vaccines, what's good for society is also good for individuals, and vice versa.

Measles virus; Wikipedia (Cynthia S Goldsmith Content Provider, CDC) 

But that's not what interests me particularly here.  What interests me is the interplay between population and individual disease dynamics.  Infectious disease dynamics depend on the proportion of susceptible individuals in the population; too few and the disease dies out, enough and the disease sticks around, cyclically infecting people as, say, the flu, or endemic, as, e.g., venereal diseases.  So, in a very real sense infectious disease happens to a group, in a group, and because of a group, at the same time it's happening to individuals in that group.  But chronic non-infectious diseases (CNIDs) don't work that way.  Chronic non-infectious diseases happen to individuals only, irrespective of what's happening to anyone else in the population (though, see below).

But, what we know about and predict for individuals depends on what we know (or think we know) about a CNID in a population.  Epidemiologists collect data in a group on what may be relevant risk factors, and then statistically estimate their importance and impact. Observations on a single individual don't have the power to allow epidemiologists to determine which risk factors are likely to be important.  That requires repeated observations, on many people.

So, calculations of how likely you are to have a heart attack given your age, body mass index, cholesterol levels and so on are based on population associations between such factors and actual heart attacks.  These are based on past observed experience in many individuals.  As we've written many times before, it's not really clear what 'risk' represents, other than the observed proportion of a population with apparent past exposure to tested risk factors who went on to develop a disease.  But, a lot of people with the same risk factors didn't develop disease, or have a heart attack or whatever, and a lot of people without those risk factors did.  So, risk estimates are population-based statistics that may or may not apply to you -- or anyone individually, really.  They are collective data, and clearly don't explain all risk, or allow precise prediction.

Now, social epidemiologists would say that chronic diseases can be as much a result of population factors as infectious diseases are.  Smoking, obesity, drinking, stress-related diseases are correlated with social class, so in a very real sense, population dynamics can affect risk of CNIDs as well.  So, if we want to explain CNIDs by distal rather than proximal risk factors, population dynamics become important, as with infectious diseases.  But they are still population-level factors -- not every low-income individual is obese or has high blood pressure, and not every overweight individual is poor.  But, everyone with measles has been infected by the measles virus.

The population issue, I think, goes a long way toward explaining why it's so hard to predict chronic disease, genetic or otherwise.  We're forced to infer group statistics to individuals, and that's never going to be precise.

Wednesday, February 4, 2015

Exploring genomic causal 'precision'

Regardless of whether or not some geneticists object to the cost or scientific cogency of the currently proposed Million Genomes project, it is going to happen.  Genomic data clearly have a role in health and medical practice. The project is receiving kudos from the genetics community, but it's easy to forget that many questions related to understanding the actual nature of genomic causation and the degree to which that understanding can in practice lead to seriously 'precise' and individualized predictive or therapeutic medicine remain at best unanswered.  The project is inevitable if for no other reason that DNA sequencing costs are rapidly decreasing.  So let's assume the data, and think about what our current state of knowledge tells us we'll be predict from it all.

An important scientific (rather than political or economic) point in regard to recent promises, is that, currently, disease prediction is actually not prediction but data-fitting retrodiction.  The data reflect what has happened in the past to bearers of identified genotypes.  Using the results for prediction is to assume that what is past is prologue, and to extrapolate retrospectively estimated risks to the future. In fact, the individual genomewide genotypes that have been studied to estimate past risk will never recur in the future: there are simply too many contributing variants to generate each sampled person's genotype.

Secondly, if current theory underlying causation and measures like heritability is even remotely correct, the bulk of the risk associated with individual genomic factors are inherited in a Mendelian way but do not as a rule cause traits that way.  Instead, each factor's effects are genome context-specific and act in a combinatorial way with the other contributing factors, including the other parts of the genome, and the other cells in the individual, and more.

Thirdly, in general, most risk seems not due to inherited genetic factors or context, because heritability is usually far below 100%.  Risk is due in large part to lifestyle exposures and interactions.  It is very important to realize that we cannot, even in principle know what current subjects' future environmental/lifestyle exposures will be, though we do know that they will differ in major ways from the exposures of those patients or subjects from whom current retrospective risks have been estimated.  It is troubling enough that we are not good at evaluating, or even measuring current subjects' past exposures, whose effects we are now seeing along with their genotypes.

In a nutshell, the assumption underlying current 'personalized' medicine is one of replicability of past observations, and statistical assessments are fundamentally based on that notion in one way or another.

Furthermore, most risk estimates used are perforce for practical reasons based essentially on additive models:  add up the estimated risk from each relevant genome site (hundreds of them) to get the net risk.  Depending on the analysis, this leaves little room for non-additive effects, because things are estimated statistically from a population of samples, etc.  These issues are well known to statisticians, perhaps less so to many geneticists, even if there are many reasons, good and bad, to keep them in the shadows. Biologically, as extensive systems analysis shows clearly, DNA functions work by its coded products interacting with each other and with everything else the cell is exposed to.  There is simply no reason to assume that within each individual those interactions are strictly additive at the mechanistic level, even if they are assessed (estimated) statistically from large population samples.

For these and several other fundamental reasons, we're skeptical about the million genome project, and we've said that upfront (including in this blog post.)  But supporters of the project are looking at the exact same flood of genomic data we are, and seeing evidence that the promises of precision medicine are going to be met.  They say we're foolish, we say they're foolish, but who's right?  Well, the fundamental issue is the way in which genotypes produce phenotypes, and if we can parameterize that in some way, we can anticipate the realms in which the promised land can be reached, ways in which it is not likely to be reached, and how best to discriminate between them.

Based on work we've done over the past few years, one avenue we think should be taken seriously, which can be done at very low cost could potentially save a large amount of costly wheel-spinning, is computer simulation of the data and the approaches one might take to analyze it.

Computer simulation is a well-accepted method of choice in fields dealing with complex phenomena, as in chemistry, physics, and cosmology.  Computer simulation allows one to build in (or out) various assumptions, and to see how they affect results.  Most importantly, it allows testing whether the results match empirical data. When data are too complex, total enumeration of factors, much less analytical solutions to their interactions is simply not possible.

Biological systems have the kind of complexity that these other physical-science fields have to deal with.  A good treatment of the nature of biological systems, and their 'hyper-astronomical' complexity, is Andreas Wagner's recent book The Arrival of the Fittest.   This illustrates the types of known genetically relevant complexity that we're facing.  If simulation of cosmic (mere 'astronomical' complexity) is a method of choice in astrophysics, among other areas, it should be legitimate for mere genomics.

A computer simulation can be deterministic or probabilistic, and with modern technology can mimic most of the sorts of things one would like to know in the promised miracle era of genomewide sequencing of everything that moves.  Simulation results are not real biology, of course, any more than simulated multiple galaxies in space are real galaxies.  But simulated results can be compared to real data.  As importantly, with simulation, there is no measurement or ascertainment error, since you know the exact 'truth', though one can introduce sampling or other sorts of errors to see how they affect what can be inferred from imperfect real data.  If simulated parameters, conditions, and results resemble the real world, then we've learned something.  If they don't, then we've also learned something, because we can adjust the simulations to try to understand why.

Many sneer at simulation as 'garbage in, garbage out'.  That's a false defense of relying on empirical data, that we know are loaded with all sorts of errors.  Just as with simulation, an empiricist can design samples and do empirical data in garbage in, garbage out ways, too.

Computer simulation can be done at a tiny fraction of the cost of collecting empirical data. Simulation involves no errors or mistakes, no loss or inadvertent switching of blood samples, no problems due to measurement errors or imprecise definition of a phenotype, because you get the exact data that were simulated.  It is very fast if a determined effort is made to do it.  Even if the commitment is made to collect vast amounts of data, one can use simulation to make the best use of them.

Most simulations are built to aid in some specific problem.  For example, under (say) a model of some small number of genes, with such-and-such variant frequency, how many individuals would one need to sample to get a given level of statistical power to detect the risk effects in a case-control study?

In a sense, that sort of simulation is either quite specific, or in some cases been designed to prove some point, such as to support a proposed design in a grant application.  These can be useful, or they can be transparently self-serving.  But there is another sort of simulation, designed for research purposes.

Evolution by phenotype
Most genetic simulations treat individual genes as evolving in populations.  They are genotype-based in that sense, essentially simulating evolution by genotype.  But evolution is phenotype-based: individuals as wholes compete, reproduce, or survive, and the genetic variation they carry is or isn't transmitted as a whole.  This is evolution by phenotype, and is how life actually works.

There is a huge difference between phenotype-based and gene-based simulation, and the difference is highly pertinent to the issues presently at stake.  That is because multiple genetic variants whether changing under drift or natural selection (almost always, it's both, with the former having the greatest effect), that we get the kind of causal complexity and elusive gene-specific causal effects that clearly is the case.  And environmental effects need to be taken into account directly as well.

I know this not just by observing the data that are so plentiful but because I and colleagues have developed an evolution by phenotype simulation program (it's called ForSim, and is freely available and open-source, so this is not a commercial--email me if you would like the package).  It is one of many simulation packages available, which can be found here: NCI link.  Our particular approach to genetic variation and its dynamics in populations has been used to address various problems and it can address the very questions at issue today.

With simulation, you try to get an idea of some phenomenon so complex or extensive that data in hand are inadequate or where there is reason to think proposed approaches will not deliver on what is expected.  If a simulation gives results that don't match the structure of available empirical data reasonably closely, you modify the conditions and run it again, in many cases only requiring minutes and a desk-top computer.  Larger or very much larger simulations are also easily and inexpensively within reach, without waiting for some massive, time-demanding and expensive new technology. Even very large-scale simulation does not require investing in high technology, because major universities already have adequate computer facilities.

Simulation of this type can include features such as:
* Multiple populations and population history, with specifiable separation depth, and with or without gene flow (admixture)
*  The number of contributing genes, their length and spacing along the genome
*  Recombination, mutation, and genetic drift rates, environmental effects, and natural selection of various kinds and intensities to generate variation in populations
* Additive, or function-based non-additive, single or multiple phenotype determination
* Single and multiple related or independent phenotypes
* Sequence elements that do and that don't affect phenotype(s) (e.g., mapping-marker variants)

Such simulations provide
* Deep known (saved) pedigrees
* Ability to see the status of these factors as they evolve, saving data at any point in the past (can even mimic fossil DNA)
* Each factor can be adjusted or removed, to see what difference it makes.
* Testable sampling strategies, including 'random', phenotype-based (case-control, tail, QTL, for GWAS, families and Mendelian penetrance, admixture, population structure effects)
*  Precise testing of the efficacy, or conditions, for predicting retrofitted risks, to test the characteristics of 'precision' personalized medicine.

These are some of what the particular ForSim system can do, listed only because I know what we included in our particular program.  I don't know much about what other simulation programs can do (but if they are not phenotype-based they will likely miss critical issues).  Do it on a desktop or, for grander scale, on some more powerful platform.  Other features that relate to some of the issues that the current whole genome proposed sequence implicitly raises could be built in by various parameter specification or by program modification, at a minuscule fraction of the cost of launching off on new sequencing.

Looking at this list of things to decide, you might respond, in exasperation, "This is too complicated!  How on earth can one specify or test so many factors?"  When you say that, without even yet pressing the 'Run' key, you have learned a major lesson from simulation!  That's because these factors are, as you know very well, involved in evolutionary and genetic processes whose present-day effects we are being told can be predicted 'precisely'.  Simulation both clearly shows what we're up against and may give ideas about how to deal with it.

Above is a schematic illustration of the kinds of things one can examine by simulation, and check with real data.  In this figure (related to work we've been involved with), mouse strains were selected for 'opposite' trait value, then interbred, and then a founding few from each strain were crossed, and then intercrossed for many generations to let recombination break up gene blocks.  Then use markers identified in the sequenced parental strain, to map variation causally related to the strains' respective inbred trait. Many aspects and details of such a design can be studied with the help of such results, and there are surprises that can guide research design (e.g., there is more variation than the nominal idea of inbreeding and 'representative' strain-specific genome sequencing generally considers, among other issues).

Possibilities in, knowledge out
As noted above, it is common to dismiss simulation out of hand, because it's not real data, and indeed simulations can certainly be developed that are essentially structured to show what the author believes to be true.  But that is not the only way to approach the subject.

A good research simulation program is not designed to generate any particular answer, but just the opposite.  Simulation done properly doesn't even take much time to get to useful answers.  What it gives you is not real data but verisimilitude--when you match real data that are in hand, you can make sharper, focused decisions on what kinds of new data to obtain, or how to sample or analyze them, or, importantly, what they can actually tell you.  Just as importantly, if not more so, if you can't get a good approximation to real data, then you have to ask why.  In either case, you learn.

Because of its low relative cost, the preparatory use of serious-level simulation should be a method of choice in the face of the kinds of genomic causal complexity that we know constitutes the real world. Careful, honest use of simulation to know about nature and as a guide is one real answer to the regularly heard taunt that if someone doesn't have a magic answer about what to do instead, s/he has no right to criticize business as usual.

Simulation, when not done just to cook the books in favor of what one already is determined to do, can show where one needs to look to gain an understanding. It is no more garbage in, garbage out than mindless data collection, but at least when mistakes or blind alleys are found by simulation, there isn't that much garbage to have to throw out before getting to the point.  Well-done simulation is not garbage in, garbage out, but a very fast and cost-effective 'possibilities in, knowledge out'.

Our prediction
We happen to think that life is genetically as complex as it looks from a huge diversity of studies large and small, on various species. One possibility is that this complexity implies there is in fact no short-cut to disease prediction for complex traits. Another is that some clever young persons, with our without major new data discoveries, will see a very different way to view this knowledge, and suggest a 'paradigm shift' in genetic and evolutionary thinking.  Probably more likely is that, if we take the complexity seriously, we can develop a more effective and sophisticated approach to understanding phenogenetic processes, the connections between genotypes and phenotypes.