In genetics and epidemiology, we want to explain the causes of diseases, and we want to be able to predict who will get them. We generally do this statistically, by working out the probability that we are right in our guesses about alleles or environmental causal factors. But this is problematic for a number of reasons.
The concept of probability and the term itself have always been elusive or variable in meaning. But they refer to collections of objects and the relative proportions of some specified component, or alternatively to the distribution of outcomes of repeated identical trials. They work great for truly repeatable phenomena such as flipping coins or rolling dice or sampling colored balls from the proverbial urn (actually, there are problems even here, of both practical and theoretical sorts, but they are minor relative to the point we wish to make).
In discussions of the phenomena pertaining to evolutionary genetics, 'probability' arises all the time. We have genetic drift--the 'probability' of change in frequency of a given allele in a population of specified size; natural selection and evolutionary fitness--the relative probability of the bearer of a given genotype to reproduce relative to the probability of alternative genotypes (adaptive fitness); the probability of migration and hence gene flow, and much more.
The same sorts of things arise in biomedical genetics: Mendelian inheritance--the probability of a given allele being transmitted to a given offspring; the relative probability of a disease arising in a person of a given genotype; the probability of lung cancer in a smoker (per cigarette consumed, etc.); the probability of having a given stature or blood pressure in a person of a given genotype, and much more.
Such things are as routinely discussed as the daily news, or what some guest said on The Daily Show. But there is a problem. The meaning of the term is unclear, often to the user as well as hearer. The term is deeply built into many areas of evolutionary and causative genetics, our means of drawing conclusions from both observational and experimental data. But upon closer scrutiny, the term is being used based on highly questionable or hopeful assumptions that at best only apply to a subset of situations.
Replicability and probability: p-eeing into the wind
The standard way to represent a probability is by a symbol, often the letter p. Once you have identified this value, it is easy to write it into various equations, like, for example, the probability of observing 3 Heads in a row is p^3 (p cubed) where p is the probability of Heads on any given flip. Or, if you assume that your data involve sampling from the same urn filled with certain proportion of red and blue balls (or, say, of diabetics or tall people in a sample from a population of people with a given genotype) and the probability of getting your result just by chance (that is, no genetic causation involved), p is less than 0.05, then you declare to the world that something unusual is going on--that is, your research has found something (such as that the genotype is a causal factor for diabetes or tall stature). But if your actual set-up is not something simple like balls in an urn, or dice—that sort of thing, then what is it that is really happening? How reliable or interpretable are these probabilities?
The assumed sort of regular repeatable condition, in the real evolutionary and biomedical world where we live, is usually a fiction! That doesn't mean there is no truth involved. It means that we think there is some truth in how we've organized our description of what's going on, and in a somewhat circular way we're using our data to check that. Unfortunately, we usually don't know even what we think is going on in a rigorous enough way for the p's we employ to be accurate, or to put it more fairly, they are inaccurate to an unknown extent. Is that extent even knowable in principle?
In the areas of biology and health discussed here, we are usually far from having such knowledge. We use 'p' values or their statistical equivalents all the time, both for the processes (e.g., risk of some disease or outcome) or unusualness in support of our causal assertion (e.g., statistical significance).
In most or perhaps every case, we are making strong underlying assumptions about the real world that is our objective to understand. That assumption is one of replicability--that is, that we are dealing with repeatable events, often that there is some ‘random’ aspect of causation (itself usually a vague assumption not seriously explained). That in turn implies that there truly exists some p-values or probability distributions underlying what we are studying. Repeatability is a basic assumption, what one might even call a tenet about the nature of the organization of the world we are studying.
In what sense do such distributions or probabilities actually exist? Is the world of our interest organized, symmetric, and repeatable enough for the models being tested to be realistic? If the cosmos were purely deterministic (e.g., no truly probabilistic quantum phenomena), then we would always be sampling from worlds of proportions and our probability values would reflect repeated sampling events ('random' assuming we really knew what that meant in practice).
But suppose to the contrary that to an important extent each event is unique, even if perhaps wholly predictable if had we enough information, and that it is literally impossible for us to be both in the world and have enough information about the world in order to make such predictions. Perhaps either truly, or pragmatically, the 'probabilities' on which so much of what we are about in modern genetics, and, of course, the promises being made, simply don't exist. To the extent that's true, the edifice of probabilistic science and statistical inference are mistaken or misleading, to an unknown and probably unknowable extent.
Occasionality: a more appropriate alternative concept--where there's no oh!
When many factors contribute to some measured event, and these are either not all known or measured, or in non-repeatable combinations, or not all always present, so that each instance of the event is due to unique context-dependent combination, we can say that it is an ‘occasional’ result. In the usual parlance, the event occasionally happens and each time the conditions may or may not be similar. This is occasionality rather than probability, and there may not be any 'o-value' that we can assign to the event.
This is, in fact, what we see. Of course, regular processes occur all around us, and our event will involve some regular processes, just not in a way to which probability values can be assigned. That is, the occasionality of an event is not an invocation of mystic or immaterial causation. The word merely indicates that instances of the event are individually unique to an extent not appropriately measured, or not measured with knowable accuracy or approximation, by probabilistic statistical (or tractable deterministic) approaches. The assumption that the observations reflect an underlying repeated or repeatable process is inaccurate to an extent as to undermine the idea of estimation and prediction upon which statistical and probabilistic concepts are based. The extent of that inaccuracy is itself unknown or even unknowable.
There are clearly genetic causal events that are identifiable and, while imperfect because of measurement noise and other unmeasured factors, sufficiently repeatable for practical understanding in the usual way and even treated with standard probability concepts. Some variants in the CFTR gene and cystic fibrosis fall into that category. Enough is known of the function of the gene and hence of the causal mechanism of the known allele that screening or interventions need not take into account other contextual factors that may contribute to pathogenesis but in dismissible minor ways. But this seems to be the exception rather than the rule. Based on present knowledge, I would suggest that that rule is occasionality.
One problem clearly is definitional. Where causation has traditionally been assigned to some specific outcome, we now realize that outcomes, like causes, constitute a spectrum or syndrome of effects, so that we have as much variation in 'the' outcome as in the input variables. Clearly things are not working!
Occasionality can be vaguely viewed as something like a rubbery lattice of non-rigid interacting factors, with some connections and factors (nodes) missing or added from time to time without known (or knowable?) causal processes (e.g., what new mutations might lead to part of the genome to contribute to some spectrum of phenotype outcomes?), connecting rungs being broken or new ones added as the lattice jostles down a roiling stream in and around obstacles. Particular vortices, or outcomes, a set of vertices (traits) that may or may not bob, occasionally, up to the surface individually or in different combinations to be measured as 'outcomes'. This is at least an attempt at a heuristic way to envision the elusive fish the biological community are trying to land.
Metaphysics and physics envy
It has often been observed that biology's love affair with the kinds of mathematical/statistical approaches being taken is a form of physics envy. It's an aspect of the belief system that pervades both evolutionary and genetic thinking. It avoids the somewhat metaphysical reality, that we know life involves emergent properties that are not enumerable in ways ordinary physical and chemical systems seem (at least to outsiders like us, and at the macro-scale) to involve. If what is wrong or missing (if anything) were clear, these areas of the life sciences would be doing things differently.
This may sound far too vague and airy, but the evidence is all around us. Last week, as we wrote about here, even 40 or more years of nutritional studies, very sophisticated and extensive, are reported to have been 'wrong' in regard to the risk of dietary cholesterol, and salt consumption. 'Wrong' means not supported by recent research. The nature of even of blood-levels of many factors like these also seems ephemeral, that is, findings of one study are contradicted by another study. Which (if any) are to be believed? Why do we think our new studies or methods obsolesce the older ones, when in essence there aren't very substantial differences? Why haven't we got solid, replicable, clear-cut causal understanding?
The difference between occasionality and probability is somewhat like the difference between continuous and discontinuous (quantitative vs qualitative) variation. The challenge is to understand and make inferences about the latter when we don’t have underlying analytic (mathematical function based) theory to apply. We don’t want to say an effect has no cause (unless we are involved in quantum mechanics, and even that's debated!), but we do need to understand unique combinatorial causation in ways we currently don’t. This will have to be mixed with the statistical (probabilistic) aspects we can't avoid, such as sampling and measurement effects.
There are many reasons to dismiss such a term or concept as occasionality. First, because it is not operationalized: we don’t really know what to do about it. Second, there is a tendency to reject some new term that is invoked in a critical vein. Thirdly, the practical reason is that even if the idea is correct, if it were acknowledged too fully it would undermine the business that most in this area of science are about, and certainly would be resisted. But if we were to force ourselves to ask what we would do if we knew the phenomena we want to understand were manifestations of occasionality rather than probability, it might force us to think and act differently.
The concept of probability and the term itself have always been elusive or variable in meaning. But they refer to collections of objects and the relative proportions of some specified component, or alternatively to the distribution of outcomes of repeated identical trials. They work great for truly repeatable phenomena such as flipping coins or rolling dice or sampling colored balls from the proverbial urn (actually, there are problems even here, of both practical and theoretical sorts, but they are minor relative to the point we wish to make).
Dice; Wikipedia |
In discussions of the phenomena pertaining to evolutionary genetics, 'probability' arises all the time. We have genetic drift--the 'probability' of change in frequency of a given allele in a population of specified size; natural selection and evolutionary fitness--the relative probability of the bearer of a given genotype to reproduce relative to the probability of alternative genotypes (adaptive fitness); the probability of migration and hence gene flow, and much more.
The same sorts of things arise in biomedical genetics: Mendelian inheritance--the probability of a given allele being transmitted to a given offspring; the relative probability of a disease arising in a person of a given genotype; the probability of lung cancer in a smoker (per cigarette consumed, etc.); the probability of having a given stature or blood pressure in a person of a given genotype, and much more.
Such things are as routinely discussed as the daily news, or what some guest said on The Daily Show. But there is a problem. The meaning of the term is unclear, often to the user as well as hearer. The term is deeply built into many areas of evolutionary and causative genetics, our means of drawing conclusions from both observational and experimental data. But upon closer scrutiny, the term is being used based on highly questionable or hopeful assumptions that at best only apply to a subset of situations.
Replicability and probability: p-eeing into the wind
The standard way to represent a probability is by a symbol, often the letter p. Once you have identified this value, it is easy to write it into various equations, like, for example, the probability of observing 3 Heads in a row is p^3 (p cubed) where p is the probability of Heads on any given flip. Or, if you assume that your data involve sampling from the same urn filled with certain proportion of red and blue balls (or, say, of diabetics or tall people in a sample from a population of people with a given genotype) and the probability of getting your result just by chance (that is, no genetic causation involved), p is less than 0.05, then you declare to the world that something unusual is going on--that is, your research has found something (such as that the genotype is a causal factor for diabetes or tall stature). But if your actual set-up is not something simple like balls in an urn, or dice—that sort of thing, then what is it that is really happening? How reliable or interpretable are these probabilities?
The assumed sort of regular repeatable condition, in the real evolutionary and biomedical world where we live, is usually a fiction! That doesn't mean there is no truth involved. It means that we think there is some truth in how we've organized our description of what's going on, and in a somewhat circular way we're using our data to check that. Unfortunately, we usually don't know even what we think is going on in a rigorous enough way for the p's we employ to be accurate, or to put it more fairly, they are inaccurate to an unknown extent. Is that extent even knowable in principle?
In the areas of biology and health discussed here, we are usually far from having such knowledge. We use 'p' values or their statistical equivalents all the time, both for the processes (e.g., risk of some disease or outcome) or unusualness in support of our causal assertion (e.g., statistical significance).
In most or perhaps every case, we are making strong underlying assumptions about the real world that is our objective to understand. That assumption is one of replicability--that is, that we are dealing with repeatable events, often that there is some ‘random’ aspect of causation (itself usually a vague assumption not seriously explained). That in turn implies that there truly exists some p-values or probability distributions underlying what we are studying. Repeatability is a basic assumption, what one might even call a tenet about the nature of the organization of the world we are studying.
In what sense do such distributions or probabilities actually exist? Is the world of our interest organized, symmetric, and repeatable enough for the models being tested to be realistic? If the cosmos were purely deterministic (e.g., no truly probabilistic quantum phenomena), then we would always be sampling from worlds of proportions and our probability values would reflect repeated sampling events ('random' assuming we really knew what that meant in practice).
But suppose to the contrary that to an important extent each event is unique, even if perhaps wholly predictable if had we enough information, and that it is literally impossible for us to be both in the world and have enough information about the world in order to make such predictions. Perhaps either truly, or pragmatically, the 'probabilities' on which so much of what we are about in modern genetics, and, of course, the promises being made, simply don't exist. To the extent that's true, the edifice of probabilistic science and statistical inference are mistaken or misleading, to an unknown and probably unknowable extent.
Occasionality: a more appropriate alternative concept--where there's no oh!
When many factors contribute to some measured event, and these are either not all known or measured, or in non-repeatable combinations, or not all always present, so that each instance of the event is due to unique context-dependent combination, we can say that it is an ‘occasional’ result. In the usual parlance, the event occasionally happens and each time the conditions may or may not be similar. This is occasionality rather than probability, and there may not be any 'o-value' that we can assign to the event.
This is, in fact, what we see. Of course, regular processes occur all around us, and our event will involve some regular processes, just not in a way to which probability values can be assigned. That is, the occasionality of an event is not an invocation of mystic or immaterial causation. The word merely indicates that instances of the event are individually unique to an extent not appropriately measured, or not measured with knowable accuracy or approximation, by probabilistic statistical (or tractable deterministic) approaches. The assumption that the observations reflect an underlying repeated or repeatable process is inaccurate to an extent as to undermine the idea of estimation and prediction upon which statistical and probabilistic concepts are based. The extent of that inaccuracy is itself unknown or even unknowable.
There are clearly genetic causal events that are identifiable and, while imperfect because of measurement noise and other unmeasured factors, sufficiently repeatable for practical understanding in the usual way and even treated with standard probability concepts. Some variants in the CFTR gene and cystic fibrosis fall into that category. Enough is known of the function of the gene and hence of the causal mechanism of the known allele that screening or interventions need not take into account other contextual factors that may contribute to pathogenesis but in dismissible minor ways. But this seems to be the exception rather than the rule. Based on present knowledge, I would suggest that that rule is occasionality.
One problem clearly is definitional. Where causation has traditionally been assigned to some specific outcome, we now realize that outcomes, like causes, constitute a spectrum or syndrome of effects, so that we have as much variation in 'the' outcome as in the input variables. Clearly things are not working!
Occasionality can be vaguely viewed as something like a rubbery lattice of non-rigid interacting factors, with some connections and factors (nodes) missing or added from time to time without known (or knowable?) causal processes (e.g., what new mutations might lead to part of the genome to contribute to some spectrum of phenotype outcomes?), connecting rungs being broken or new ones added as the lattice jostles down a roiling stream in and around obstacles. Particular vortices, or outcomes, a set of vertices (traits) that may or may not bob, occasionally, up to the surface individually or in different combinations to be measured as 'outcomes'. This is at least an attempt at a heuristic way to envision the elusive fish the biological community are trying to land.
Metaphysics and physics envy
It has often been observed that biology's love affair with the kinds of mathematical/statistical approaches being taken is a form of physics envy. It's an aspect of the belief system that pervades both evolutionary and genetic thinking. It avoids the somewhat metaphysical reality, that we know life involves emergent properties that are not enumerable in ways ordinary physical and chemical systems seem (at least to outsiders like us, and at the macro-scale) to involve. If what is wrong or missing (if anything) were clear, these areas of the life sciences would be doing things differently.
This may sound far too vague and airy, but the evidence is all around us. Last week, as we wrote about here, even 40 or more years of nutritional studies, very sophisticated and extensive, are reported to have been 'wrong' in regard to the risk of dietary cholesterol, and salt consumption. 'Wrong' means not supported by recent research. The nature of even of blood-levels of many factors like these also seems ephemeral, that is, findings of one study are contradicted by another study. Which (if any) are to be believed? Why do we think our new studies or methods obsolesce the older ones, when in essence there aren't very substantial differences? Why haven't we got solid, replicable, clear-cut causal understanding?
The difference between occasionality and probability is somewhat like the difference between continuous and discontinuous (quantitative vs qualitative) variation. The challenge is to understand and make inferences about the latter when we don’t have underlying analytic (mathematical function based) theory to apply. We don’t want to say an effect has no cause (unless we are involved in quantum mechanics, and even that's debated!), but we do need to understand unique combinatorial causation in ways we currently don’t. This will have to be mixed with the statistical (probabilistic) aspects we can't avoid, such as sampling and measurement effects.
There are many reasons to dismiss such a term or concept as occasionality. First, because it is not operationalized: we don’t really know what to do about it. Second, there is a tendency to reject some new term that is invoked in a critical vein. Thirdly, the practical reason is that even if the idea is correct, if it were acknowledged too fully it would undermine the business that most in this area of science are about, and certainly would be resisted. But if we were to force ourselves to ask what we would do if we knew the phenomena we want to understand were manifestations of occasionality rather than probability, it might force us to think and act differently.
Interesting stuff. Here's a quick reaction, after a quick read.
ReplyDeleteI'm reminded of the statistical idea of 'varying effects'. The idea is that each time you study something, you are sampling from a different population and so we can't expect there to be a single effect to estimate. Instead, effects are themselves considered random. In practice the goal of this perspective is to try to also estimate variation in the effect. Estimating such variation can be attempted by studying something repeatedly and under various conditions. Sometimes this is called hierarchical modelling.
So on one hand I think you are identifying something that statisticians already know about. On the other hand:
1. looking at a problem from a different perspective is useful
2. estimating variation in an effect is difficult, because modelling is hard and requires self-criticism
3. it is a natural human tendency to turn grey areas into black-and-white ones, and so there is little motivation to try to estimate variation in effects
Anyways ... thanks for the thought-provoking post.
Thanks for your comments, Steve.
ReplyDeleteMost biostatistical and bioinformatic (and evolutionary genetics) approaches want 'linear', steady causes. I don't know about variable effects but will see what I can find out. But I would guess that statisticians, deeply wedded to distributions based on replicable probabilities--say distributions of probabilities among samples--as to cling to what may even with that approach rely too much on probability. The notion of 'occasionality', such as it is, is that there is no such probability in a tractable sense.
Another natural tendency that people must (or should) recognize, is to turn grey areas into green ones--that is, the uncertainties you mention, and that we tried to address by calling them 'occasionality', are not good if you have to propose a fundable project.
I rarely believe published research claims. The way I think about this
ReplyDeletetendency is that, in my experience and opinion, most researchers
underestimate variation and uncertainty. When reading a study, I try
to identify sources of variation and uncertainty that are not
accounted for in the statistical analysis. I recognize that I too have
probably missed some of these sources, so even when a study looks good
I'm still skeptical. I am also very self-skeptical and as a result my
career is floundering (green-areas indeed!). But just because
modelling all of these sources of variation and uncertainty is
difficult, I still find it useful to try, at least from the
perspective of getting closer to the truth. How does the occasionality
concept fit into this attitude?
I have retired from running a lab as of a year ago. I do still work with simulations (see earlier posts a week or two ago), which I believe should be a method of choice in this area. Simulations cannot in themselves tell us what is going on, but they can show us reasons for thinking that we're missing something and we should know it. Simulation easily generates the kind of mapping complexity that we've seen for the last 20 years.
ReplyDeleteI think that one can simulate what seem like realistic scenarios, show that by treating them in the currently popular ways we get similar answers, and yet show why these approaches are not solving the problems they have promised to solve.
My work to date has not even included simulating somatic mutations, orders of magnitude more complex. But simulation shows at least the nature of the problem. To me, if enough people (especially young ones with the freedom to think) confront the realities, someone will develop better answers. If the idea of 'occasionality' is more than just a word I've coined, it may trigger that sort of insight in someone. That's a hope, anyway. But unlike so much that we read from others, I can't promise such precision understanding.
Well put. I really liked the simulation post by the way, especially "Under many conditions [simulation] does what is asked of it, but if the conditions don't hold, or we have no idea if they do, then we are asking for trouble."
ReplyDeleteThe usual story of transformative ideas (real ones, not the daily claims in the BBC or NYTimes, Nature, or Science) is that old ideas are shown to be approximate, for understandable reasons, but insufficiently accurate or mistaken about the underlying causal processes.
ReplyDeleteOf course, if evolution really leads to as much genomic variation and so on as we see--which I've referred to as 'occasionality' in this post--then maybe a new 'theory' isn't called for; instead, what would be called for is that we actually respect the understanding we already have and not pretend it's something other than that.
This is challenging. If scientists know that an outcome *might* occur, then the bare minimal statistical meaning is that the outcome has a probability greater than 0 and less than 1. From a philosophical perspective, I call this an *indeterministic outcome.* At best, we can only make an approximation of that probability. For example, in theory, a fair coin toss has a .5 probability of heads and 5 probability of tails. But we can never know if a particular coin toss is completely fair or say has a .0500000000001 probability of heads. Other indeterministic outcomes are less predictable such as the hourly temperature and real feel forecast for 5-10 days in the future.
ReplyDeleteScientists still take stabs at this, but salt is always needed.
I believe in *real* probabilities for everything but humbly admit that the worlds best geniuses cannot calculate those probabilities. So i guess that means I believe in the reality of probability while acknowledging that calculating it ahead of time in many cases in impossible. This is a sum of part of my take mathematical platonism.
Oops, I pushed published when I first meant to press review. I see two typos in the last two sentences:
ReplyDelete"So I guess that means I believe in the reality of probability while acknowledging that calculating it ahead of time in many cases [is] impossible. This is a sum of part of my take [on] mathematical platonism."