Showing posts with label forward simulation. Show all posts
Showing posts with label forward simulation. Show all posts

Wednesday, February 4, 2015

Exploring genomic causal 'precision'

Regardless of whether or not some geneticists object to the cost or scientific cogency of the currently proposed Million Genomes project, it is going to happen.  Genomic data clearly have a role in health and medical practice. The project is receiving kudos from the genetics community, but it's easy to forget that many questions related to understanding the actual nature of genomic causation and the degree to which that understanding can in practice lead to seriously 'precise' and individualized predictive or therapeutic medicine remain at best unanswered.  The project is inevitable if for no other reason that DNA sequencing costs are rapidly decreasing.  So let's assume the data, and think about what our current state of knowledge tells us we'll be predict from it all.

An important scientific (rather than political or economic) point in regard to recent promises, is that, currently, disease prediction is actually not prediction but data-fitting retrodiction.  The data reflect what has happened in the past to bearers of identified genotypes.  Using the results for prediction is to assume that what is past is prologue, and to extrapolate retrospectively estimated risks to the future. In fact, the individual genomewide genotypes that have been studied to estimate past risk will never recur in the future: there are simply too many contributing variants to generate each sampled person's genotype.

Secondly, if current theory underlying causation and measures like heritability is even remotely correct, the bulk of the risk associated with individual genomic factors are inherited in a Mendelian way but do not as a rule cause traits that way.  Instead, each factor's effects are genome context-specific and act in a combinatorial way with the other contributing factors, including the other parts of the genome, and the other cells in the individual, and more.

Thirdly, in general, most risk seems not due to inherited genetic factors or context, because heritability is usually far below 100%.  Risk is due in large part to lifestyle exposures and interactions.  It is very important to realize that we cannot, even in principle know what current subjects' future environmental/lifestyle exposures will be, though we do know that they will differ in major ways from the exposures of those patients or subjects from whom current retrospective risks have been estimated.  It is troubling enough that we are not good at evaluating, or even measuring current subjects' past exposures, whose effects we are now seeing along with their genotypes.

In a nutshell, the assumption underlying current 'personalized' medicine is one of replicability of past observations, and statistical assessments are fundamentally based on that notion in one way or another.

Furthermore, most risk estimates used are perforce for practical reasons based essentially on additive models:  add up the estimated risk from each relevant genome site (hundreds of them) to get the net risk.  Depending on the analysis, this leaves little room for non-additive effects, because things are estimated statistically from a population of samples, etc.  These issues are well known to statisticians, perhaps less so to many geneticists, even if there are many reasons, good and bad, to keep them in the shadows. Biologically, as extensive systems analysis shows clearly, DNA functions work by its coded products interacting with each other and with everything else the cell is exposed to.  There is simply no reason to assume that within each individual those interactions are strictly additive at the mechanistic level, even if they are assessed (estimated) statistically from large population samples.

For these and several other fundamental reasons, we're skeptical about the million genome project, and we've said that upfront (including in this blog post.)  But supporters of the project are looking at the exact same flood of genomic data we are, and seeing evidence that the promises of precision medicine are going to be met.  They say we're foolish, we say they're foolish, but who's right?  Well, the fundamental issue is the way in which genotypes produce phenotypes, and if we can parameterize that in some way, we can anticipate the realms in which the promised land can be reached, ways in which it is not likely to be reached, and how best to discriminate between them.

Simulation
Based on work we've done over the past few years, one avenue we think should be taken seriously, which can be done at very low cost could potentially save a large amount of costly wheel-spinning, is computer simulation of the data and the approaches one might take to analyze it.

Computer simulation is a well-accepted method of choice in fields dealing with complex phenomena, as in chemistry, physics, and cosmology.  Computer simulation allows one to build in (or out) various assumptions, and to see how they affect results.  Most importantly, it allows testing whether the results match empirical data. When data are too complex, total enumeration of factors, much less analytical solutions to their interactions is simply not possible.

Biological systems have the kind of complexity that these other physical-science fields have to deal with.  A good treatment of the nature of biological systems, and their 'hyper-astronomical' complexity, is Andreas Wagner's recent book The Arrival of the Fittest.   This illustrates the types of known genetically relevant complexity that we're facing.  If simulation of cosmic (mere 'astronomical' complexity) is a method of choice in astrophysics, among other areas, it should be legitimate for mere genomics.

A computer simulation can be deterministic or probabilistic, and with modern technology can mimic most of the sorts of things one would like to know in the promised miracle era of genomewide sequencing of everything that moves.  Simulation results are not real biology, of course, any more than simulated multiple galaxies in space are real galaxies.  But simulated results can be compared to real data.  As importantly, with simulation, there is no measurement or ascertainment error, since you know the exact 'truth', though one can introduce sampling or other sorts of errors to see how they affect what can be inferred from imperfect real data.  If simulated parameters, conditions, and results resemble the real world, then we've learned something.  If they don't, then we've also learned something, because we can adjust the simulations to try to understand why.

Many sneer at simulation as 'garbage in, garbage out'.  That's a false defense of relying on empirical data, that we know are loaded with all sorts of errors.  Just as with simulation, an empiricist can design samples and do empirical data in garbage in, garbage out ways, too.

Computer simulation can be done at a tiny fraction of the cost of collecting empirical data. Simulation involves no errors or mistakes, no loss or inadvertent switching of blood samples, no problems due to measurement errors or imprecise definition of a phenotype, because you get the exact data that were simulated.  It is very fast if a determined effort is made to do it.  Even if the commitment is made to collect vast amounts of data, one can use simulation to make the best use of them.

Most simulations are built to aid in some specific problem.  For example, under (say) a model of some small number of genes, with such-and-such variant frequency, how many individuals would one need to sample to get a given level of statistical power to detect the risk effects in a case-control study?

In a sense, that sort of simulation is either quite specific, or in some cases been designed to prove some point, such as to support a proposed design in a grant application.  These can be useful, or they can be transparently self-serving.  But there is another sort of simulation, designed for research purposes.

Evolution by phenotype
Most genetic simulations treat individual genes as evolving in populations.  They are genotype-based in that sense, essentially simulating evolution by genotype.  But evolution is phenotype-based: individuals as wholes compete, reproduce, or survive, and the genetic variation they carry is or isn't transmitted as a whole.  This is evolution by phenotype, and is how life actually works.

There is a huge difference between phenotype-based and gene-based simulation, and the difference is highly pertinent to the issues presently at stake.  That is because multiple genetic variants whether changing under drift or natural selection (almost always, it's both, with the former having the greatest effect), that we get the kind of causal complexity and elusive gene-specific causal effects that clearly is the case.  And environmental effects need to be taken into account directly as well.

I know this not just by observing the data that are so plentiful but because I and colleagues have developed an evolution by phenotype simulation program (it's called ForSim, and is freely available and open-source, so this is not a commercial--email me if you would like the package).  It is one of many simulation packages available, which can be found here: NCI link.  Our particular approach to genetic variation and its dynamics in populations has been used to address various problems and it can address the very questions at issue today.

With simulation, you try to get an idea of some phenomenon so complex or extensive that data in hand are inadequate or where there is reason to think proposed approaches will not deliver on what is expected.  If a simulation gives results that don't match the structure of available empirical data reasonably closely, you modify the conditions and run it again, in many cases only requiring minutes and a desk-top computer.  Larger or very much larger simulations are also easily and inexpensively within reach, without waiting for some massive, time-demanding and expensive new technology. Even very large-scale simulation does not require investing in high technology, because major universities already have adequate computer facilities.

Simulation of this type can include features such as:
* Multiple populations and population history, with specifiable separation depth, and with or without gene flow (admixture)
*  The number of contributing genes, their length and spacing along the genome
*  Recombination, mutation, and genetic drift rates, environmental effects, and natural selection of various kinds and intensities to generate variation in populations
* Additive, or function-based non-additive, single or multiple phenotype determination
* Single and multiple related or independent phenotypes
* Sequence elements that do and that don't affect phenotype(s) (e.g., mapping-marker variants)

Such simulations provide
* Deep known (saved) pedigrees
* Ability to see the status of these factors as they evolve, saving data at any point in the past (can even mimic fossil DNA)
* Each factor can be adjusted or removed, to see what difference it makes.
* Testable sampling strategies, including 'random', phenotype-based (case-control, tail, QTL, for GWAS, families and Mendelian penetrance, admixture, population structure effects)
*  Precise testing of the efficacy, or conditions, for predicting retrofitted risks, to test the characteristics of 'precision' personalized medicine.

These are some of what the particular ForSim system can do, listed only because I know what we included in our particular program.  I don't know much about what other simulation programs can do (but if they are not phenotype-based they will likely miss critical issues).  Do it on a desktop or, for grander scale, on some more powerful platform.  Other features that relate to some of the issues that the current whole genome proposed sequence implicitly raises could be built in by various parameter specification or by program modification, at a minuscule fraction of the cost of launching off on new sequencing.

Looking at this list of things to decide, you might respond, in exasperation, "This is too complicated!  How on earth can one specify or test so many factors?"  When you say that, without even yet pressing the 'Run' key, you have learned a major lesson from simulation!  That's because these factors are, as you know very well, involved in evolutionary and genetic processes whose present-day effects we are being told can be predicted 'precisely'.  Simulation both clearly shows what we're up against and may give ideas about how to deal with it.

Above is a schematic illustration of the kinds of things one can examine by simulation, and check with real data.  In this figure (related to work we've been involved with), mouse strains were selected for 'opposite' trait value, then interbred, and then a founding few from each strain were crossed, and then intercrossed for many generations to let recombination break up gene blocks.  Then use markers identified in the sequenced parental strain, to map variation causally related to the strains' respective inbred trait. Many aspects and details of such a design can be studied with the help of such results, and there are surprises that can guide research design (e.g., there is more variation than the nominal idea of inbreeding and 'representative' strain-specific genome sequencing generally considers, among other issues).

Possibilities in, knowledge out
As noted above, it is common to dismiss simulation out of hand, because it's not real data, and indeed simulations can certainly be developed that are essentially structured to show what the author believes to be true.  But that is not the only way to approach the subject.

A good research simulation program is not designed to generate any particular answer, but just the opposite.  Simulation done properly doesn't even take much time to get to useful answers.  What it gives you is not real data but verisimilitude--when you match real data that are in hand, you can make sharper, focused decisions on what kinds of new data to obtain, or how to sample or analyze them, or, importantly, what they can actually tell you.  Just as importantly, if not more so, if you can't get a good approximation to real data, then you have to ask why.  In either case, you learn.

Because of its low relative cost, the preparatory use of serious-level simulation should be a method of choice in the face of the kinds of genomic causal complexity that we know constitutes the real world. Careful, honest use of simulation to know about nature and as a guide is one real answer to the regularly heard taunt that if someone doesn't have a magic answer about what to do instead, s/he has no right to criticize business as usual.

Simulation, when not done just to cook the books in favor of what one already is determined to do, can show where one needs to look to gain an understanding. It is no more garbage in, garbage out than mindless data collection, but at least when mistakes or blind alleys are found by simulation, there isn't that much garbage to have to throw out before getting to the point.  Well-done simulation is not garbage in, garbage out, but a very fast and cost-effective 'possibilities in, knowledge out'.

Our prediction
We happen to think that life is genetically as complex as it looks from a huge diversity of studies large and small, on various species. One possibility is that this complexity implies there is in fact no short-cut to disease prediction for complex traits. Another is that some clever young persons, with our without major new data discoveries, will see a very different way to view this knowledge, and suggest a 'paradigm shift' in genetic and evolutionary thinking.  Probably more likely is that, if we take the complexity seriously, we can develop a more effective and sophisticated approach to understanding phenogenetic processes, the connections between genotypes and phenotypes.

Tuesday, June 5, 2012

Steal This Book! Computer simulation and scientific theory

Abbie Hoffman
In the riotous protest times of the '70s, leading protester Abbie Hoffman published Steal This Book, "a manual of survival in the prison that is Amerika."  Of course, one must assume that Hoffman wasn't too opposed to the system to decline any royalties, nor that he really meant for copies to be stolen: presumably the idea was to read the book and understand the realities of society at the time.  Then you could choose to accept or fight the system, or at least understand it and what it's doing to you.

We devote a lot of effort in MT to commenting on and, yes, criticizing what we believe are deserving targets in contemporary science, especially as relates to genetics and evolution. We premised MT on ideas in our book of the same name, because we think evolution and an over stress on simplified genetic causal thinking diverts attention from many aspects of biology that we feel are at least as important.

People resist learning some lessons we think they should learn, perhaps largely out of ignorance (though in many ways intentionally not facing what might dampen various vested interests). 

With Brian Lambert, I have developed a highly general and flexible computer simulation program called ForSim, for simulating genetic causation and its evolution.  Its dual major purposes are first, to generate simulated, but realistic, data to test various theories and detection methods for complex phenotypes--such as those so intensely being pursued by GWAS and other methods.   Users can simulate the data and then sample it in various ways (families, case-control studies, etc.) to see how much and how one can find of what is known (because the simulation generates all the data required) to be the truth.  Secondly, the evolution of that genetic architecture within and between populations can be simulated, to understand how genetic effects change.

ForSim is a net-effects program, that omits many important aspects of genetic + environmental causation, such as those that make up the bulk of the book MT.  Thus, it greatly oversimplifies reality. But it tries to be natural in many ways (future additions will explicitly allow simulation of gene networks, developmental biology, and episodic traits like some diseases).

ForSim is a complex, intricate program and most readers of this blog would not be interested in or attempt to use it.  Fine, that's not our point. Our point in mentioning it is that just to see what is involved in complex traits and their evolution is a sobering lesson in why we object to simplistic ideas and rosy promises. 

ForSim execution flow
If one absorbs the message, one should be less sanguine or naive about what is being promised and found (or not) in the real world.  And one can get a sense of why we say what we do!  We did not invent biological complexity or the reasons why gene mapping (GWAS and similar approaches) are struggling as they are (as reflected rather clearly in the flood of papers aggressively praising their dramatic success).

We don't expect you to use ForSim, but if you're interested in seeing just what is involved in even a restricted evolutionary simulation, read the ForSim book!  You don't have to steal it, because the Manual can be downloaded here.  It's free (as is the program for any MT reader who might want to try it).

Again, we're not advertising anything from which we make any monetary or other gain.  We use the program, but just reading the Manual can be very instructive.  We wrote the program, and use it, and talk about it, because one way or another we think everyone, scientists and public alike, should be made aware of the realities of the causal complexity that so often is an inherent part of life.

But....wait!  What exactly is computer simulation?  Can't you simulate anything you want, the way a video game simulates Dungeons and Space Fighters?  Isn't what we really need an improved actual theory, some laws of life that really work well in terms of relating your genes to your traits?  Surprisingly, the answer is yes, you can simulate anything you want, but no, simulation isn't inferior to other kinds of theory even in this same respect.  We'll explain that next time.

Monday, May 4, 2009

How complex is 'complex'?

The word 'complex' is frequently used, though not always as clearly as it might be. In today's genetics arena it means a trait that is the result of multiple genetic elements as well as environmental factors that are usually unknown or not specified, but can include the genetic element's genomic background. Can we get a clearer understanding of this interaction in some way that has not yet been well-explored?

Most complex traits, whose genetic contributors GWAS and related mapping methods are designed to find (see earlier posts) show substantial evidence of being 'genetic' in some sense: there is correlation of the trait among relatives or an association of risk of the trait--like a disease--among family members.

The problem is that despite evidence for genetic involvement, GWAS and other methods have only been able to identify a small fraction of the contributing elements. One response is that we need larger studies. Another is that the objective is not to account for the disease in terms of genes, but to find genetic pathways that are involved.

Most common diseases have increased substantially, if not dramatically, within living memory and more importantly within the time since trustworthy epidemiological data on incidence (rate of new cases per year) or prevalence (fraction of persons affected) have been available.

This would suggest to reasonable people, even including some geneticists, that at least for preventive purposes the major responsible (and avoidable) factors for the disease are environmental, such as exposures to risk factors like toxins, lifestyle changes such as in diet, etc.

A few years ago, the molecular technology infrastructure for mapping studies was laid down, and paid for on the rationale that common genetic variants were likely responsible for these common diseases--and hence that genetics was a right way to approach them. Common variants for common disease (CVCD) became a mantra.

In response to the environmental and other arguments raised even at the time, proponents of CVCD and the investment in the gene-mapping infrastructure (e.g., the HapMap project) said that, yes, environmental factors clearly were involved, but the increase in prevalence was due to their interaction with common genetic susceptibility variants.

Subsequent mapping, including numerous, often huge genomewide association studies, has generally failed to find such variants. The meaning of 'common' can of course be adjusted to fit results, but the bulk of the heritability of these many studied traits remains unexplained. It's a fair question whether these traits are truly complex and largely unmappable, or whether we just haven't studied them enough.

A kind of widespread relevant evidence may be the following. The substantial heritability of common disease as well as normal traits suggests that many genes contribute; the traits are often called 'polygenic' for that reason. But these many genes might individually vary in their effects. For many theoretical and empirical reasons, one would expect some alleles (genetic variants) at one or a few genes, to interact or respond more strongly to changing environmental factors.

If that is the case, then the more important genes that were not identifiable in case-control or family samples before the environmental change, should be mappable afterward. That's because those variants that would be the main responders to the environmental change, whatever it was. Their individual effects, modest before the change, should be major after it.

Yet, today, after a long list of diseases have had large, rapid increases in prevalence, the GWAS findings are as we have seen: they are not identifying much that is of population-scale importance. On the surface, this suggests that the argument about complexity really is correct: there are, indeed, many genes involved, but they each make very small net effect on risk. A few are detected whose effects are greater, but they are few and even their effects are only modest.

From this perspective, which is based on data, not theory, secular trends in risk and the failure of GWAS to find CVCD's is relevant data, suggest that complex traits really are basically homogeneous in terms of genetic causation.

Now, if this is true it constitutes material evidence that should change our understanding of the nature of these traits: why would it be that there are generally no major alleles waiting for environmental changes to give them a chance to be expressed? Indeed, isn't that just how natural selection is supposed to work, with environmental change favoring 'good' genetic variants in the population and raising them to high frequency? Those variants should have substantial effect on the trait so the organism carrying the variants would reproduce more successfully.

If our thinking is correct, then this tells us something. Perhaps the networks of which biological traits are built are internally adjusting--strong changes in one part of a pathway network lead to slowing down of others. Yet, secular trends show that the net result can involve major change. It is indeed somewhat difficult to believe that the genetic responses to environmental changes are so internally homogeneous that even after major stimulus none really stands out even when studied in large samples. There must be a message there--if we can but figure it out!

These are just superficial ideas at this point, but they could help direct changes in what we look for, or how we look. We are starting to use an evolutionary simulation program that Brian Lambert in our group has written (see the description of ForSim on Ken's web page for details) to see if this point is correct as we think, or if there is some aspect of genetic control that we are overlooking. Stay tuned for results.