There has been quite a stir over recent attempts to provide a general calculator of heart disease risk and associated recommendations. In particular, how reliable and how independent are the offered recommendations relative to the prescription of lifetime use of medicine like statins? How skeptical should the general public be about the reliability of the risks and the disinterestedness--lack of potential gain or self-interest--behind recommendations that a high fraction of the population go on life-long meds?
We've commented on aspects of this general issue before, and in particular about the shaky aspects of the GetOnStatins push. Here's another article in the NYTimes about this, raising issues about the accuracy of the risk estimates--the predictions--based on the chosen risk factors (cholesterol levels, age, sex, and numerous others). These calculators give wide-ranging results and are now widely viewed to be inaccurate. And the attacks we've seen are not the accusation that, while conflicts of interest seem likely to be a factor, the main problem is that pharma is pushing the recommendations to increase sales.
There are a couple of issues that are deeply problematic across a wide spectrum of the kinds of biomedical (and, indeed, evolutionary) research that is getting so much funding and public play.
First, the research and interpretations are based on what amounts to a deep belief system--yes, that's an apt word for it--an assumption that raw data collection, handed to high-speed computers, can find the patterns that will properly estimate risk, given various observed traits of the sampled person. That in turn amounts to assuming that such data do, in fact, contain and will reveal the nature of causal truth. That is essentially the rationale for the glamorous touting of what is now catch-phrased as 'Big Data' (we need our branding, apparently). It sounds impressive, and it generates attention and large, long-term studies that are the boon of professors who need to fund their own salaries and their research empires.
Whether the belief in computer analysis of uncritically collected data is scientifically appropriate or not is hard to separate from the thicket of vested career interests and the glamour of technology in our current society. But there are a few clear-cut problems that a few write about but that the system as a whole dare not think too hard about because it might slow down the train.
Problem 1: Risks are estimated retrospectively--from the past experience of sampled individuals, whether in a properly focused study or in a Big Data extravaganza. But risks are only useful prospectively: that is, about what will happen to you in your future, not about what already happened to somebody else (which, of course, we already know).
We frequently mention this issue because it's a fundamental problem, and in the case of the heart disease calculator, it is that the people from whom the risks have been estimated lived importantly different lifestyles from people now using the calculator (and it's regardless of the unknown future risk exposures). Smoking is lower and exercise higher. Drug (legal) exposures are different, diagnoses and treatment options, etc. Not to mention changes in prevalence of unknown risk factors.
This reveals something more that is deeply at the inescapable heart of the problem: We respond to 'news' about risks by altering our behavior, companies market different things to us as a result, and we undertake other behavior whose relation to the disease is not known or whose exposure levels can't be predicted....not even in principle. Thus we are saying essentially that "if you live like your grandparents did, this is your risk." But in truth we are not seers, nor is any Oz behind the curtain, who know, or can know, how you will live or who controls all the outcomes--if only we could discover him.
Problem 2: The idea is that by doing statistical association studies (correlation or regression analysis, for example) on what happened in the past we are revealing the causal understructure of the results we care about. Symbolically, we write
DiseaseProb = RiskExposureAmount x Dose-responseEffect + otherStuffIincludingErrors.
Such equations are routinely referred to as constituting a causal 'model', but in truth it's nothing of the kind. Instead it's generic rather than being developed from or tied in any serious way to the actual causation--the mechanism--that may apply to the risk factor. And 'may' is decided by a subjective statistical test of some kind that we choose to apply to any associations we find in the form of data we choose to collect (our study or sample design).
We are usually not actually applying any serious form of 'theory' to the model or to the reesults. We are just searching for non-random associations (correlations) that may be just chance, may be due to the measured factor, or may be due to some other confounding but unmeasured factors. And even this depends on how 'randomness' plays into our sampling and to causation, and how important rare events that cannot be captured by most samples (e.g., very rare genetic variants, or somatically arising variants that are not transmitted).
It is by such reasoning that we feel we can assume that the same association we observe in our sample will apply to the other people whom we haven't observed. Since we don't know how different your life will be from the lives from whom even these estimates were derived, we can't know how different your risks will be, even if we've identified the right factors.
Problem 3: Statistical analysis is based on probability concepts, which in turn are (a) based on ideas of repeatability, like coin flipping, and (b) that the probabilities can be accurately estimated. But people, not to mention their environments, are not replicable entities (not even 'identical' twins). We are not all totally unlike each other, but are never exactly alike. No two populations or two samples are identical (that fact ironically is based on a real theory, that of evolution). National exhaustive data bases will always have such problems, and the extent of them is essentially unknown, and we have no theory for it, largely because of Problem 2: we have no real theory for the disease causation problem.
Problem 4: Competing causes inevitably gum up the works. Your risk of a heart attack depends on your risk of completely unrelated causes, like car crashes, drug overdoses, gun violence, cancer or diabetes, etc. If those change, your risk of heart disease will change. If you're killed in a crash today you can't have a stroke tomorrow. But we cannot know if any such exposure factors will change or by how much. The car-crash illuistration is perhaps trivial. But environmental changes on a large scale certainly are not: War or pestilence are extreme examples that we know are regular types of occurrence.
Problem 5: Theory in physics is in many ways the historic precedent on which we base our thinking. This arose in the time of Galileo and Newton. One advantage physics has is that its objects are highly replicable. It is believed that every electron, everywhere in the universe, is identical. If you want to know if certain factors can reveal, say, the Higgs Boson, you collide beams of gazillions of identical protons at each other, their splatter pattern tells you something about their makeup. Things scatter at a predictable pattern and though its probabilistic, you know the expected distribution from your theory and can then estimate, to the closest possible extent, whether the result fits.
But life is not replicable in that way, and life is the product of an evolutionary process that not only involves imperfect replication (of DNA from one generation to the next) and differential proliferation depending on local circumstances. Life is about difference not replicability.
Problem 6: Big Data is proposed as the appropriate approach, not a focused hypothesis test. Big Data are uncritical data--by policy! This raises all sorts of issues such as nature of sample and accuracy of measurements (of genotypes and of phenotypes). Somatic mutation and cell-specific interactions and so forth, for example, are not measured, nor is the microbiome, and can't really be retro-collected. If the Big Data, Computers Are Everything approach works, then we will have undergone a change in the basic nature of science--it could be, of course, but now it's more a belief, for various reasons, than anything with sound theory behind it (other than theories of computing and statistics, etc., of course).
There is far insufficient evaluation of how well whole-population-based data bases, including CDC and cancer registries, have actually identified things not knowable in other ways--or what kinds of things (eg other than major factors) they found. Even posterchildren, like Framingham Heart Study of heart disease risks, and tests of the drug warfarin and clotting, have not been unambiguous successes. Had this been done modestly without even in the '90s promises of magic bullets, and was done on a modest scale, followed by proof of principle in terms of actual cures etc., then the situation might be different.
Based on fundamental beliefs that the ultimate control in the cosmos is at the level of basic physics, science believes that the world must be predictable and that everything, from gravity on up to life, must follow universal laws. That drives the belief that if we observe enough things we can find out how those laws apply to life, which is more complex than a hydrogen atom. But there could be, say, uncomputably many arrangements of components that explain much of biological causation, that statistical models can't explain even when statistical studies can reveal them, or causal interactions too indirect for our current kinds of non-theory-based statistical study designs can reveal.
This is old news that nobody wants to acknowledge
We're disclosing no secrets here, except the secret that the profession simply won't stop foisting this sort of approach onto the funding agencies and media, associated with promises of major positive social impact, rather than reforming the way science is done. The profession is not confessing that basically one writes regression equations because computers can do them, Big Data can be fed into them, and....really, .... that we usually haven't much of a clue how the causation actually works. Even if, as is likely often the case, a measured risk factor really is a risk factor, its connection with the outcomes is typically not even mildly understood, at present. Yet our daily proclamations to the media.
Even major environmental or genetic risk factors often take extensive, highly focused studies to work out in any quantitative way that is very useful in telling people what their risks are or helping them decide what to do about them. If a particular gene gives you, say, a 50% risk of a nasty disease, you might want to be screened, or you'll stop some behavior that confers such level of risk. You don't care if it's 55% or 71%.
But most risk factors change you risk by only a few percent, or less, and hardly measurable with accuracy, except by invoking the belief we discussed in Problem 3.
None of this is new to science or to this particular era in science. But in an age of impatience, where so many vested interests drive the system, communication is so rapid, and self-promotion so prevalent among professionals and the media etc., the issues are not given much attention. It doesn't pay to face up to them. They are too challenging.
The drumbeat of critiques of research results that emblazon the media is over-matched by the blare of excited promises and proclamations. It's as if there's an Oz behind the curtain who pulls various levers and makes things happen (and the scientist seems to claim that s/he is that Oz, or has secret communication with him). It's an understandable human tendency, it's our way of doing business, even if it's only somewhat connected to the problems we say we're doing our best to solve. Often, our best would be to slow down, scale down, focus and think harder. That's no guarantee, and it doesn't mean we shouldn't use technology and even large-scale approaches. But we should stop using them mainly as a way of keeping the tap open.
Now yesterday there was another story, reflecting a long-known finding that obesity is a risk factor for breast cancer. This isn't far-fetched, if the idea is correct that obesity relates to cholesterol which is a molecule in pathways related to various steroid hormones which could stimulate growth in breast cancer cells. The idea is that if one lowers cholesterol, this risk might be lowered. And how does one lower cholesterol? Taking statins is one way. So, like so many things, this is a complicated story.
It could be that the way things are being done, including Big Data 'omics' approaches is the best approach to genomic causation that there is, and we just won't get to the kinds of rigor that physics and chemistry enjoy. Or that something better is due, and it'll come along when somebody has the right insight. But that will likely happen to that person who is banging his/her head against problems like those we've outlined here, and in other posts.