Blogs like this are venues for expressing views on the current scene, in our case, related to genetics, evolution and a few other things we throw in. If you express a view, unless it's just plain vanilla, you will irritate some readers. In a sense, if you don't then there's no point in writing the blogpost. In this case, we heavily criticized the recent NYTimes article reporting that the government has now backed off its claim that dietary cholesterol is a heart disease risk factor. We try to be responsible, but that doesn't mean we have to expect agreement or to mince words!
We argued that the kind of herky-jerky yes/no results from huge long-term megastudies are so common that this shows the studies are rather useless and we think should be phased out, the results to date archived for anyone who wants to mine them, and the funds put to something that actually generates more trustworthy and stable results (if risks are stable enough to be estimated in these ways).
Well, this generated a very heated message from an old friend, a prominent genetic epidemiologist, who said that if we were listened to, it would lead to throwing the baby out with the bathwater. He was upset because he said it was not the data but the analysis of these big epidemiological (environmental or genetic) studies that was at fault. The studies are based essentially on correlation or regression models that assume everyone starts out as an equal blank slate, and whose individual risk is the basically additive total of the various risk factor exposures. Your sex gives you a 'dose' of risk, which age adds to, then smoking history, diet, and so on. Once all your risk-factor measures are toted up, your net risk can be estimated. It is this general approach that regardless of sophisticated details in the statistical method, is not good at finding what we really should be looking for.
Our friend's idea is that, for example, dietary cholesterol may on average not be harmful, but there are likely some subsets of the population for which it is a risk. Standard models may be convenient to apply, and everyone knows how to do that....but they miss the boat. The key problem is basically that risk factors interact and searching for complex interaction is rarely done because it is very demanding in terms of sample size, sample structure, and analytic tools. But there is no reason, for example, to think that males and females respond to a given risk factor equally per exposure dose. So interactions ('epistasis' in relation to genome elements) are given very light treatment and basically wished away.
My irate friend basically argued that what is needed is not an end to the data but to the methods.
One recent paper I was referred to applies application of a method for finding high-risk subsets, and has references to earlier descriptions of the methodology is this: Int J Epidemiol. ("Subgroups at high risk for ischaemic heart disease:identification and validation in 67 000 individuals from the general population", Frikke-Schmidt R et al."), 2015 Feb;44(1):117-28 (but unfortunately it is not freely available).
How effectively this will find really different subgroups is open. We know, for example, that males and females are not at equal risk and respond differently to other factors, as mentioned above. We know genomic components interact. We know that as you get older you get closer to various risks, such as heart disease or cancer, and that the same exposure has different impacts with age, and so on.
The idea both in public health and in medicine (and in evolutionary inference) of identifying causation as effectively as possible, and that includes identifying high risk individuals as early as possible, is of course absolutely the right thing. There are many instances of genetic risk factors like some variants in the gene responsible for cystic fibrosis, or in the BRCA1 gene related to breast cancer, where the Who Cares? principle applies: the single factor's effect is so predictably strong that one intervenes regardless of the details of how much risk is associated or what other factors might slightly modify the outcome. We know that the nominal risk factor (e.g., a mutation) doesn't always lead to the same degree of severity, but the variation isn't enough to cause doubt: Who Cares about the details?
But whether in general this sort of method of searching for statistical associations can identify risk earlier enough, where preventive measures might be more helpful, is unclear. We know about age and sex and smoking and so on, and maybe we don't really gain much from adjusting the exact values. Or maybe we would. Likewise for the complex interactions among hundreds of contributing genomic factors. But there the number of factors and the assumption of independence and so on need to be recognized as being as daunting as they are, relative to statistical risk analysis.
In my view, this is still walking in dreamland. The major factors will be identified, perhaps with more precision, but we face a huge, open-ended kind of 'multibody' problem. It's just not possible to analyze all the possible combinations of factors and their interactions to get combination-specific risk estimates. First, risks are contingent, one factor's effect depending on what else is present, as discussed above. Second, not all combinations will show up in the data, even in huge samples, so estimating risks if interactions must be accounted for will simply come up short, or perhaps better-put, with unknown or even unknowable precision.
Third, we know very well even from just the recent few decades, that incidence of outcomes changes hugely, yet the genomes basically don't, and while we may estimate the effects of the particular combinations of environments our sampled individuals were exposed to, we simply cannot, even in principle, know what environmental factors current individuals for whom we are being promised precise predictions, will be exposed to. Yet heritabilities, with all their problems, clearly show that genomes contribute typically far less than half of all risk. Environmentally and genomically, specific factors or variants come and go, and no two people are identical or even close to it.
A major issue is not just that there is no way, that is no way to know what risks are associated with most risk factors, much less interactions among them, even in principle, but we have no way of knowing the degree of precision of predictions. This is why, among other things, even if increasing our understanding is a very noble pursuit, promising 'precision' in prediction based on genomes or, really, almost any other risk factors, is irresponsible.