The Mermaid's Tale: Is genomic causation deterministic or probabilistic? Some further musings about a central issue in biology

NOTE: This post is a musing about issues of causation that I think are fundamental to the nature of life, but are (to me, at least) elusive. In that sense, these are notes to myself, to get thoughts on 'paper' as they race through, in case they contain more than baloney. But I would welcome any reflections by readers (if politely expressed!).

We have recently done a series of posts (here's the first in the series of four) on whether one should 'believe' in genetic determinism in life. The issue arises for many reasons, but not least because of the way that we infer the mode of causation of ordinary as well as disease phenotypes (traits) from individuals' genotypes (or, for multiple sites along the genome and the two copies of the genome in diploid species like animals and plants, their 'genometypes').

Determinism and genes
It's not fashionable these days to express causation in billiard-ball like deterministic terms. Instead, whether rightly or not, we use probabilistic causation (a genometype or a part of it, confers some 'risk' rather than certainty of a given outcome). Yet, using modern terminology, Mendel showed us that at the DNA sequences inherited at a given place in the genome (a 'gene' or functional genomic unit), causation of traits (those he carefully selected to study, like green or yellow, smooth or wrinkled peas) is highly deterministic: if a plant inherited a specific combination, its peas essentially always reflected the effects of that combination. The causation is, for practical purposes at least, deterministic. For some human traits, including important diseases, things seem to be 'Mendelian' in the same way, and, as we say, 'segregate' in families in a Mendelian way.

For many traits, however, this doesn't work. The trait values (like height or blood pressure) are correlated among relatives in ways that clearly suggest genetic causative factors. There is every reason to think this is because many genes contribute jointly to the trait. But because the effects of the contributing gene(s) don't segregate easily, it is hard to find them.

For this reason, omics strategies, like genomewide 'mapping' technologies, are to scan the entire genome to find every place where sequence variation is associated with trait variation in a sample from the population. But we know this only identifies a small fraction of the overall evidence for inherited factors. And it uses a statistical approach to causation that itself, rather than what's going on at the gene level, may determine how we interpret the results. The methods estimate the probabilistic effects of a given site on the trait. We can then take such data for all tested sites and estimate a person's overall genometypic risk, by methods that, in essence, sum up what we have found at each site, each treated as having independent, probabilistic effects. (There are some modifications of this latter statement, but they don't change the gist of it).

This suggests that each genomic site contributes probabilistically, and that the scrambling of all these contributors among people is what generates the generally smooth (e.g., 'normal' or bell-shaped) distribution of traits among people in the population (some with high, some low, and most with intermediate height, weight, blood pressure, etc.). We treat the trait distribution as random draws by each individual, of variants at each spot in the genome, from a 'gene pool' of their respective frequencies, and sum these individual effects to get the net result. (Here, a very important factor but one beside our current point, we're ignoring environments).

Yet we have a problem: is the genotype of a person really probabilistically determinative, or is its causative effects stronger, even if only individually partial to the final trait? Since each person has a unique genotype, we have only a single observation of that genotype and its result ('affected' or not, specific blood pressure or weight). The probabilities we use to estimate net risk for the observed genotype are assembled as if of independent units, because that is built into the results by the statistical methods used, which may not be how nature actually works! The reason is that each variant is found both in cases and controls, but at different frequencies, yielding a probabilistic association with the trait. But how we measure things should not be an end itself but a tool for getting to the truth.

We know there is a problem. If you look at inbred animals, where you really can get many replicates of essentially the same genotype, the traits are often quite specific and don't vary much. We discussed this in the latter parts of our recent series. This shows, clearly, that individual genotypes can be very highly determinative (again, with some sources of error, and the huge caveat about environments, which is highly standardized in the lab).

Now if this is the case, then every person (every chimp, robin, fruit fly, or oak) has a unique, determinative genotype, and it is the assembly of these fixed causal packets among individuals that generates the 'normal' trait variation in the population. This seems strange, and perhaps inconsistent with the fact that the same distribution can be explained, probabilistically, by assembling each person's genome with the individual contributing variants treated as independent probabilistic units. The latter suggests that these genes really do rattle around more or less independently, to generate the trait and can be used to generate personalized genomic predictions. But that's incorrect.

No matter how complex the interactions among individual parts of the genome, since recombination scrambles the elements, the resulting population of wholly deterministic genometypes will also have a similar kind of trait distribution. But that is not due to an assemblage of individually probabilistic elements, nor is it itself necessarily even probabilistic--each genometype is likely to be highly causative in a given environment, as in inbred strains of animals and plants.

Recombination, evolution, and multifactorial causation
We know that individual genetic elements are exchanged by the process of recombination, between the differing copies each new parent had inherited from its parents. And population genetics is the study of how the frequencies of these elements change over time, because of mutation, natural selection, and the chance aspects of reproduction.

These considerations, all well understood, explain the relative frequencies of the elements, and the frequencies of their expected combinations in an infinite population. In a real, that is finite, population, only a historically driven and trivial subset of those possible combinations (ever changing because of mutation, birth, and death) will ever exist, much less be captured in a given sample.

But if this tells us something about what combinations of elements will be found, which whole genome sequencing can tell us explicitly (in principle at least), it tells us nothing about how the elements interact to produce organisms' traits. Molecular genetics can do experiments and identify interacting partners, networks, and the like--we ourselves along with countless hundreds of biologists are working out such things. Even that exploration of mechanism in stereotypical settings like cell culture, however, cannot tell us about what the variation in these elements in each copy of the genome--each genometype--will do to the resulting phenotype in a population of individuals if each of their genotypes is perfectly deterministic. Since each genometype is unique in history, it isn't clear where such information can come from. Unless we can cleverly get it from the approximation we can get from studying inbred strains, or something like that.

Macroscale rendering of microscale determinism...
If these thoughts are on track, they could have a very important consequence: We perhaps can't parse a genometype's effects with current statistical methods, even in principle, because the parts aren't separable. If so, neither can we legitimately treat its effects probabilistically: we only get those probabilities by analyzing the parts separately! Thus, the risks assigned by individual genome predictors (e.g., companies that do this) are, in that sense, inaccurate to an unknown, and with current methods, unknowable extent, if not downright bogus. And this has implications for clinical reporting of 'incidental' genetic findings, as we posted on yesterday.

The exceptions that prove the rule don't really do that: they are the single-site effects that are so strong on their own that other effects don't matter, even though we know they must be there. We have many examples, though mostly they're rare. And we also know that even in these, environments can make a huge difference in outcome. But they don't really tell us about the nature of causation and are not really relevant for the bulk of outcomes that, as noted above, don't 'segregate' in families, etc.

This seems to us to be a serious issue in the epistemology of life viewed from a genomic point of view. Even if things are, causally, truly probabilistic, we can't access enough information to see how. Instead, we apply a method that may turn truly deterministic causation into what appears to be probabilistic causation. One hates to borrow from other sciences, but this has something of the aroma of quantum mechanical uncertainties about it, at least in some logical senses. Is population a probabilistic distribution of deterministic possibilities; the observed trait in an individual 'collapses' the distribution to a point--that particular trait value--but, by being unique, makes it impossible to recapture those possibilities?

At the same time, and curiously, organisms are microscapes of deterministic factors rendered on the macroscale: they have millions of cells, each expressing genes in ways that involve millions of molecular interactions that are individually probabilistic at the fundamental level. Different individuals have different numbers of cells, different developmental stuttering, and so on. And then there is somatic mutation that makes every cell's genome sequence different at least to some extent. And there are traits in inbred strains that have high probabliity or delayed onset, but not certainty, among the animals in a given strain (which we said in our recent series may provide good opportunities to try to understand why).

The apparently highly deterministic causation of individual inbred genotypes seems itself to be the net result of so many probabilistic interactions that a kind of 'law of large numbers' make the net results very stable. Still, as the result seems similar for large species as well as small--if our impression of the literature, admittedly impressionistic, is accurate--it may not just be a matter of the number of cells, but some other factors may exist that constrain outcomes. This is something to investigate specifically.

If these points are correct, they show deep problems in trying to understand the variation in life from a genomic causal point of view.

Or, maybe these thoughts just reflect something disagreeable that I ate.

Link List

Tuesday, June 4, 2013

Is genomic causation deterministic or probabilistic? Some further musings about a central issue in biology

No comments: