The Mermaid's Tale: Computer simulation

Showing posts with label Computer simulation. Show all posts

Thursday, March 12, 2015

Simulating complexity and predicting the future

Predicting complex disease is the latest genomics flavor of the day. Or rather, it's the old flavor with a new name -- precision medicine. So, we were pleased to be alerted to a new paper (H/T Peter Tennant and Mel Bartley; "The mathematical limits of genetic prediction for complex chronic disease," Keyes et al., Journal of Epidemiology and Community Health) that addresses the prediction question by simulating a lot of data to look at how plausible it will be to predict complex disease given the wealth of potentially interacting risk factors that will need to be taken into account . This question of course is particularly timely, given the new million genomes, precision medicine effort proposed by President Obama and endorsed by the head of the NIH, and many others.

The Crystal Ball, by John William Waterhouse: scrying in crystal; Wikipedia

A few weeks ago, Ken blogged about the advantages of using computer simulation to probe causal connections in genetics and epidemiology (here and here). Simulations can be valuable because they allow exploration of complexity with known assumptions built in explicitly and hence testably, and because there are no data or measurement errors (unless introduced intentionally and then they're still identifiable). If the results resemble real data, then one has confidence in the assumptions. If not, conditions can be changed to explore why not. Also, things far too complex to be affordably tested in the real world can be simulated, and simulation is fast and inexpensive, as a way to explore the nature of causation in a given context.

Keyes et al. simulate one million populations, with 10,000 individuals per population, to explore the question of how possible, given epistasis (gene interaction), and gene x environment interaction, predicting complex diseases will be. They point out that while genetic and epidemiological studies have been useful for finding correlations between risk factors and disease, they've been less useful for predicting which individuals will develop a given disease.

And, they point out that genome wide association studies (GWAS) have been invaluable for demonstrating that complex diseases are by and large polygenic, and that different subsets of many genes are apparently interacting in individuals who share a disease. But, they haven't been useful for prediction of complex traits.

But, identifying interacting genes, and gene by gene interaction has proven to be difficult. Thus, Keyes et al write, "[i]n this paper, we use simulated data of one million separate populations to demonstrate the drivers of the association between a germline genetic risk factor and a disease outcome, drawing observations that have implications for personalised medicine and genetic risk prediction."

They first create a hypothetical disease, one that is caused by a germ line genetic variant and environmental exposure to one or more risk factors. Risk of disease is higher in those exposed to both than the additive effect in individuals exposed to genetic risk or environmental risk alone. And, importantly, the disease can also be caused in many other ways. Keyes et al. varied the rate of genetic exposure, environmental exposure, and background prevalence of disease in each of their simulated populations.

They simulated the enormous number of populations they did in order to accommodate every possible prevalence of the combination of risk factors, from 1 to 100%. They then compared nine different scenarios of genetic and environmental risk exposure, low, moderate and high, estimating the risk of disease for those with compared with those without the risk allele.

Using simulations that span the range of potential possible prevalences of genes, environmental factor and unrelated factors, we show that the magnitude of both the risk ratio and risk difference [risk of disease to those exposed to the genetic risk factor vs those not exposed] association between a genetic factor and health outcome depends entirely on the prevalence of two factors: (1) the factors that interact with the genetic variant of interest; and (2) the background rate of disease in the population. These results indicate that genetic risk factors can only adequately predict disease in the presence of common interacting factors, suggesting natural limits on the predictive ability of individual common germline genetic factors in preventative medicine.

And, four conclusions. First, predicting complex disease from genes will continue to be largely unsuccessful, unless the environmental context and gene interactions are understood. Second, it's when background disease rates are low, and environmental risk factors common that predicting disease from genes is going to be most reliable. Third, environmental context is important in predicting the effect of genes. And, fourth, non-replicability of many genotype/phenotype studies is likely to be due to differing prevalences of genetic and environmental risk factors in the different study populations. Trait 'heritability' is context dependent, not an inherent characteristic of the trait itself.

Our simulation program, ForSim, referred to earlier, is a more complex and evolutionarily sophisticated approach that specifies things less explicitly and that can apply to multiple populations and other things. But if anything, in its simulated results, with causation and variation more realistic, causation will be even less precisely estimable or predictable than in the current paper, whose results are already quite convincing.

We'd just add that while this study is certainly a cautionary tale, the authors don't, in our view, acknowledge the full import of their conclusions. Every genome is unique and unpredictable, and future environments are unpredictable, even in principle, so that if predicting complex diseases depends on knowing environmental and genomic context, it's not going to be possible. It may be possible to retrodict complex disease based on understanding past environment and observed genes and genomes, but solving the prediction problem is another question.

Monday, February 9, 2015

Simulation: surprisingly common!

By Ken Weiss

The other day we did a post suggesting that given the commitment to a mega-genomic endeavor based on the belief that genomes are predictive entities about our individual lives generally, we should explore the the likely causal landscape that underlies the assumption, with tools including research-based computer simulation. Simulation can show how genomic causation can most clearly be assessed from sequence data. By comparison to the empirical project itself, simulation is very inexpensive and trivially fast. Research simulation has a long history of various sorts, and is a tool of choice in many serious sciences. Research simulations are not just 'toy model' curiosities, but are real science, because the conclusions and predictions can be tested empirically.

Still, many feel free to dismiss simulations, as we discussed. It seems in a sense too easy--too easy to cook up a simulation to prove what you want to prove, too easy to dismiss the results as unrelated to reality. But what alternatives do we have?

First, roughly speaking, in the physical sciences there exists very powerful theory based on well-established principles, or 'laws'. Laws of gravity or motion are classic examples. A key component of this theory, is replicability. In biology, our theory of evolution and hence of genomic causation is far less developed in terms of, yes, its 'precision'. Or perhaps it's more accurate to say that as currently understood, genetic theory is fine but we are inaptly applying statistical methods designed for replicable phenomena for things that are not replicable as the methods assume (e.g., every organism is different). In the physical sciences to a great extent, the theory comes externally to a new situation under study, that is, is already established and being applied from the outside to the new situation. Newton's laws, the same laws, very specifically help us predict rates of fall here on earth, and also how to land a robot on a comet far away in the solar system, or even study the relative motion of remote galaxies.

By contrast, evolution is a process of differentiation, not similarity. It has its laws that may be universal on earth, and in their own way comparably powerful, but that are very vaguely general: evolution is a process of ad hoc change based on local genomic variation and local circumstances--and luck. They do not give us specific, much less precise 'equations' for describing what is happening.

We usually must approach evolution and genomic causation from an internal comparison point of view. We evaluate data by comparing our empirical samples to what we'd get if the experiment were repeatable and nothing systematic is going on ('null' hypothesis) or if it's more likely that something we specify is (alternative hypothesis). That is, we use things like significance tests, and make a subjective judgment about the results. Or, we construct some formulation of what we think is going on, like a regression equation Y = a + bX, where Y is our outcome and X is some measured variable, a is an assumed constant and b is an assumed relative-effect constant per unit of X. This just illustrates what we have a huge array of versions of.

But these 'models' are almost always very generic (in part because we have good ready-made algorithms or even programs to estimate a and b). We make some subjective judgment about our estimate of the parameters of this 'model' such as that our a and b estimates are correct to a known extent and are in fact causal factors. One might estimate the value of a or b to be zero, and thus to declare on the basis of some such subjective judgment, that the factor is not involved in the phenomenon. Such applied statistical approaches are routine in our areas of science, and there are many versions of the basic idea. They are these days often, if informally and especially in terms of policy decisions, taken to be true parameters of nature. For example, as in this simplistic equation, the assumption is made that the effects of a and b on Y are linear. There is usually no substantial justification for that assumption.

In a fundamental sense, statistical analysis itself is computer simulation! It is not the testing of an a priori theory against data, either to estimate the values of things involved or to test the reality of the theory. The models tested rarely are intended to specify the actual process going on, or are purposely ignoring other factors that may be involved (or, of course, are ignored because they aren't known of).

Proper research simulation goes through the same sort of epistemological process. It make some very specific assumptions about how things are, generates some consequent results, which are then compared to real data. If the fit is good then, as with regular statistical analysis, the simulated model is taken to be informative. If the fit is not good then, as with regular statistical analysis, the simulation is adjusted (analogous to adding or removing terms, or differently estimate parameters, in a statistical 'model'). Simulation models are not constrained in advance--you can simulate whatever you specify. But neither are statistical models--you can data-fit whatever 'model' you specify.

There are lots of simulation efforts going on all the time. Many are using the approach to try to understand the complexity of gene-interaction systems in experimental settings. There is much of this also in trying to understand details such as DNA biochemistry, protein folding, and the like. There is much simulation in population ecology, and in microbiology (including infectious disease dynamics). I personally think (because it's what I'm particularly interested in) that not nearly enough is being done to try to understand genetic epidemiology and the evolution of genomically complex traits. To a great extent, the reason in my oft-expressed opinion is the drive to keep the support for very large-scale and/or long-term studies, given the promises that have been being made of medical miracles from genetic screening. Those reasons are political, not scientific.

Analytic vs Descriptive theory
I think that the best way to get closer to the truth in any science is to have a more specific analytic theory, that describes the actual causal factors and processes relative to the nature of what you're trying to understand. Such explanations begin with first principles and make specific predictions, as contrasted to the kind of generic descriptive theory of statistical models. Both approaches are speculative until tested, but generic statistical descriptions generally do not explain the data, and hence have to make major assumptions when extrapolating retro-fitted statistical estimates to prospective prediction. The latter is the goal of analytic theory, and certainly what is being grandly promised. Here, 'analytic' can include probabilistic factors including, of course, the pragmatic ones of measurement and sampling errors, etc. It is an open question whether the nature of life inherently will thwart both of these, leaving us with the challenge to at least optimize the way we use genomic data.

It is not clear if we have current theory that is roughly as good as it gets in regard to explaining evolutionary histories or genomic predictive power. Life certainly did evolve, and genomes do affect traits in individuals, but how well those phenomena can be predicted by enumerative approaches to variation in DNA sequences (not to mention other factors we currently lump as 'environment') based on classic assumptions of replicability, is far from clear. At least, at present it seems that as a general rule, with exceptions that can't easily be predicted but can be understood (such as major single-gene diseases or traits with strong effects of the sort Mendel studied in peas), we do not in fact have precise predictive power. Whether or when we can achieve that remains to be seen.

There is one difference, however: statistical models cannot be applied until you have gone to the extent and expense of collecting adequate amounts of data. In truth, you cannot know in advance how much, or often even what type of, data you would need. That is why, in essence, our collect-everything mentality is so prevalent. By contrast, prospective research computer simulation can do its work fast, flexibly: it only has to sample electrons!

Denigrating computer simulation should not be done by anyone who is actually doing computer simulation, by calling their work by another name such as statistical modeling. There is plenty of room for serious epistemological discussion about whether or how we are or aren't yet able to understand evolutionary and genomic process with sufficient accuracy.

Wednesday, February 4, 2015

Exploring genomic causal 'precision'

By Ken Weiss

Regardless of whether or not some geneticists object to the cost or scientific cogency of the currently proposed Million Genomes project, it is going to happen. Genomic data clearly have a role in health and medical practice. The project is receiving kudos from the genetics community, but it's easy to forget that many questions related to understanding the actual nature of genomic causation and the degree to which that understanding can in practice lead to seriously 'precise' and individualized predictive or therapeutic medicine remain at best unanswered. The project is inevitable if for no other reason that DNA sequencing costs are rapidly decreasing. So let's assume the data, and think about what our current state of knowledge tells us we'll be predict from it all.

An important scientific (rather than political or economic) point in regard to recent promises, is that, currently, disease prediction is actually not prediction but data-fitting retrodiction. The data reflect what has happened in the past to bearers of identified genotypes. Using the results for prediction is to assume that what is past is prologue, and to extrapolate retrospectively estimated risks to the future. In fact, the individual genomewide genotypes that have been studied to estimate past risk will never recur in the future: there are simply too many contributing variants to generate each sampled person's genotype.

Secondly, if current theory underlying causation and measures like heritability is even remotely correct, the bulk of the risk associated with individual genomic factors are inherited in a Mendelian way but do not as a rule cause traits that way. Instead, each factor's effects are genome context-specific and act in a combinatorial way with the other contributing factors, including the other parts of the genome, and the other cells in the individual, and more.

Thirdly, in general, most risk seems not due to inherited genetic factors or context, because heritability is usually far below 100%. Risk is due in large part to lifestyle exposures and interactions. It is very important to realize that we cannot, even in principle know what current subjects' future environmental/lifestyle exposures will be, though we do know that they will differ in major ways from the exposures of those patients or subjects from whom current retrospective risks have been estimated. It is troubling enough that we are not good at evaluating, or even measuring current subjects' past exposures, whose effects we are now seeing along with their genotypes.

In a nutshell, the assumption underlying current 'personalized' medicine is one of replicability of past observations, and statistical assessments are fundamentally based on that notion in one way or another.

Furthermore, most risk estimates used are perforce for practical reasons based essentially on additive models: add up the estimated risk from each relevant genome site (hundreds of them) to get the net risk. Depending on the analysis, this leaves little room for non-additive effects, because things are estimated statistically from a population of samples, etc. These issues are well known to statisticians, perhaps less so to many geneticists, even if there are many reasons, good and bad, to keep them in the shadows. Biologically, as extensive systems analysis shows clearly, DNA functions work by its coded products interacting with each other and with everything else the cell is exposed to. There is simply no reason to assume that within each individual those interactions are strictly additive at the mechanistic level, even if they are assessed (estimated) statistically from large population samples.

For these and several other fundamental reasons, we're skeptical about the million genome project, and we've said that upfront (including in this blog post.) But supporters of the project are looking at the exact same flood of genomic data we are, and seeing evidence that the promises of precision medicine are going to be met. They say we're foolish, we say they're foolish, but who's right? Well, the fundamental issue is the way in which genotypes produce phenotypes, and if we can parameterize that in some way, we can anticipate the realms in which the promised land can be reached, ways in which it is not likely to be reached, and how best to discriminate between them.

Simulation
Based on work we've done over the past few years, one avenue we think should be taken seriously, which can be done at very low cost could potentially save a large amount of costly wheel-spinning, is computer simulation of the data and the approaches one might take to analyze it.

Computer simulation is a well-accepted method of choice in fields dealing with complex phenomena, as in chemistry, physics, and cosmology. Computer simulation allows one to build in (or out) various assumptions, and to see how they affect results. Most importantly, it allows testing whether the results match empirical data. When data are too complex, total enumeration of factors, much less analytical solutions to their interactions is simply not possible.

Biological systems have the kind of complexity that these other physical-science fields have to deal with. A good treatment of the nature of biological systems, and their 'hyper-astronomical' complexity, is Andreas Wagner's recent book The Arrival of the Fittest. This illustrates the types of known genetically relevant complexity that we're facing. If simulation of cosmic (mere 'astronomical' complexity) is a method of choice in astrophysics, among other areas, it should be legitimate for mere genomics.

A computer simulation can be deterministic or probabilistic, and with modern technology can mimic most of the sorts of things one would like to know in the promised miracle era of genomewide sequencing of everything that moves. Simulation results are not real biology, of course, any more than simulated multiple galaxies in space are real galaxies. But simulated results can be compared to real data. As importantly, with simulation, there is no measurement or ascertainment error, since you know the exact 'truth', though one can introduce sampling or other sorts of errors to see how they affect what can be inferred from imperfect real data. If simulated parameters, conditions, and results resemble the real world, then we've learned something. If they don't, then we've also learned something, because we can adjust the simulations to try to understand why.

Many sneer at simulation as 'garbage in, garbage out'. That's a false defense of relying on empirical data, that we know are loaded with all sorts of errors. Just as with simulation, an empiricist can design samples and do empirical data in garbage in, garbage out ways, too.

Computer simulation can be done at a tiny fraction of the cost of collecting empirical data. Simulation involves no errors or mistakes, no loss or inadvertent switching of blood samples, no problems due to measurement errors or imprecise definition of a phenotype, because you get the exact data that were simulated. It is very fast if a determined effort is made to do it. Even if the commitment is made to collect vast amounts of data, one can use simulation to make the best use of them.

Most simulations are built to aid in some specific problem. For example, under (say) a model of some small number of genes, with such-and-such variant frequency, how many individuals would one need to sample to get a given level of statistical power to detect the risk effects in a case-control study?

In a sense, that sort of simulation is either quite specific, or in some cases been designed to prove some point, such as to support a proposed design in a grant application. These can be useful, or they can be transparently self-serving. But there is another sort of simulation, designed for research purposes.

Evolution by phenotype
Most genetic simulations treat individual genes as evolving in populations. They are genotype-based in that sense, essentially simulating evolution by genotype. But evolution is phenotype-based: individuals as wholes compete, reproduce, or survive, and the genetic variation they carry is or isn't transmitted as a whole. This is evolution by phenotype, and is how life actually works.

There is a huge difference between phenotype-based and gene-based simulation, and the difference is highly pertinent to the issues presently at stake. That is because multiple genetic variants whether changing under drift or natural selection (almost always, it's both, with the former having the greatest effect), that we get the kind of causal complexity and elusive gene-specific causal effects that clearly is the case. And environmental effects need to be taken into account directly as well.

I know this not just by observing the data that are so plentiful but because I and colleagues have developed an evolution by phenotype simulation program (it's called ForSim, and is freely available and open-source, so this is not a commercial--email me if you would like the package). It is one of many simulation packages available, which can be found here: NCI link. Our particular approach to genetic variation and its dynamics in populations has been used to address various problems and it can address the very questions at issue today.

With simulation, you try to get an idea of some phenomenon so complex or extensive that data in hand are inadequate or where there is reason to think proposed approaches will not deliver on what is expected. If a simulation gives results that don't match the structure of available empirical data reasonably closely, you modify the conditions and run it again, in many cases only requiring minutes and a desk-top computer. Larger or very much larger simulations are also easily and inexpensively within reach, without waiting for some massive, time-demanding and expensive new technology. Even very large-scale simulation does not require investing in high technology, because major universities already have adequate computer facilities.

Simulation of this type can include features such as:
* Multiple populations and population history, with specifiable separation depth, and with or without gene flow (admixture)
* The number of contributing genes, their length and spacing along the genome
* Recombination, mutation, and genetic drift rates, environmental effects, and natural selection of various kinds and intensities to generate variation in populations
* Additive, or function-based non-additive, single or multiple phenotype determination
* Single and multiple related or independent phenotypes
* Sequence elements that do and that don't affect phenotype(s) (e.g., mapping-marker variants)

Such simulations provide
* Deep known (saved) pedigrees
* Ability to see the status of these factors as they evolve, saving data at any point in the past (can even mimic fossil DNA)
* Each factor can be adjusted or removed, to see what difference it makes.
* Testable sampling strategies, including 'random', phenotype-based (case-control, tail, QTL, for GWAS, families and Mendelian penetrance, admixture, population structure effects)
* Precise testing of the efficacy, or conditions, for predicting retrofitted risks, to test the characteristics of 'precision' personalized medicine.

These are some of what the particular ForSim system can do, listed only because I know what we included in our particular program. I don't know much about what other simulation programs can do (but if they are not phenotype-based they will likely miss critical issues). Do it on a desktop or, for grander scale, on some more powerful platform. Other features that relate to some of the issues that the current whole genome proposed sequence implicitly raises could be built in by various parameter specification or by program modification, at a minuscule fraction of the cost of launching off on new sequencing.

Looking at this list of things to decide, you might respond, in exasperation, "This is too complicated! How on earth can one specify or test so many factors?" When you say that, without even yet pressing the 'Run' key, you have learned a major lesson from simulation! That's because these factors are, as you know very well, involved in evolutionary and genetic processes whose present-day effects we are being told can be predicted 'precisely'. Simulation both clearly shows what we're up against and may give ideas about how to deal with it.

Above is a schematic illustration of the kinds of things one can examine by simulation, and check with real data. In this figure (related to work we've been involved with), mouse strains were selected for 'opposite' trait value, then interbred, and then a founding few from each strain were crossed, and then intercrossed for many generations to let recombination break up gene blocks. Then use markers identified in the sequenced parental strain, to map variation causally related to the strains' respective inbred trait. Many aspects and details of such a design can be studied with the help of such results, and there are surprises that can guide research design (e.g., there is more variation than the nominal idea of inbreeding and 'representative' strain-specific genome sequencing generally considers, among other issues).

Possibilities in, knowledge out
As noted above, it is common to dismiss simulation out of hand, because it's not real data, and indeed simulations can certainly be developed that are essentially structured to show what the author believes to be true. But that is not the only way to approach the subject.

A good research simulation program is not designed to generate any particular answer, but just the opposite. Simulation done properly doesn't even take much time to get to useful answers. What it gives you is not real data but verisimilitude--when you match real data that are in hand, you can make sharper, focused decisions on what kinds of new data to obtain, or how to sample or analyze them, or, importantly, what they can actually tell you. Just as importantly, if not more so, if you can't get a good approximation to real data, then you have to ask why. In either case, you learn.

Because of its low relative cost, the preparatory use of serious-level simulation should be a method of choice in the face of the kinds of genomic causal complexity that we know constitutes the real world. Careful, honest use of simulation to know about nature and as a guide is one real answer to the regularly heard taunt that if someone doesn't have a magic answer about what to do instead, s/he has no right to criticize business as usual.

Simulation, when not done just to cook the books in favor of what one already is determined to do, can show where one needs to look to gain an understanding. It is no more garbage in, garbage out than mindless data collection, but at least when mistakes or blind alleys are found by simulation, there isn't that much garbage to have to throw out before getting to the point. Well-done simulation is not garbage in, garbage out, but a very fast and cost-effective 'possibilities in, knowledge out'.

Our prediction
We happen to think that life is genetically as complex as it looks from a huge diversity of studies large and small, on various species. One possibility is that this complexity implies there is in fact no short-cut to disease prediction for complex traits. Another is that some clever young persons, with our without major new data discoveries, will see a very different way to view this knowledge, and suggest a 'paradigm shift' in genetic and evolutionary thinking. Probably more likely is that, if we take the complexity seriously, we can develop a more effective and sophisticated approach to understanding phenogenetic processes, the connections between genotypes and phenotypes.

Comments

We always welcome comments, but we moderate them to reduce spam, gratuitous unkindness and so forth. Because we moderate comments, they won't appear on the blog until one of us publishes them, but we try to do that in a timely way.

We've had to make a change to the commenting page. People had told us that Blogger was eating their comments, so now, rather than embedding comment editing with the posts, it has to be done on a separate, full page. Unfortunately, the 'reply' option has disappeared so comments will just follow one another. We'll see how this goes.

FINALLY, please note that we assert copyright for the material written by us that appears here

Link List