When there isn't an adequate formal theory for determining cause and effect, we often must rely on searches for statistical associations between variables that we, for whatever reason, think might cause an outcome and the occurrence of the outcome itself. One criterion is that the putative cause must arise before its effect, that is, the outcome of interest. That time-order is sometimes not clear in the kinds of data we collect, but we would normally say we're lucky in genetics because a person is 'exposed' to his or her genotype from the moment of conception. Everything that might cause an outcome, say a disease, comes after that. So gene mapping searches for correlations between inherited genomic variation and variation in the outcome. But the story is not as crystal clear as is typically presented.
In genomewide mapping studies, like case-control GWAS (or QTL mapping for quantitative traits), we divide the data into categories, based on say two variants (SNP alleles), A and B, at some genome position X. Then, if the outcome--say some disease under investigation--is more common among A-carriers than among B-carriers at some chosen statistical significance level, it is common to infer or even to assert that the A-allele is a causal factor for the disease (or, less often put this way, that B is causally protective). The usual story is that the difference is far from categorical, that is, the A-bearing group is simply at a higher probabilistic risk of manifesting the trait.
However, the usually unstated inference is that the presence of SNP A has some direct, even if only probabilistic, effect in causing the outcome. The gene may, for example, be involved in some signaling pathway related to the disease, so that variation in A affects the way the pathway, as a whole, protects or fails to protect the person.
Strategies to protect against statistical artifact
We know that in most cases many, or even typically most people with the disease do not carry the A allele, because the relative risks associated with the A allele are usually quite modest. So the correlation might be false, because as we know and too often overlook, correlation does not in itself imply causation. One way to test for potential confusion is to compare AA, AB, and BB genotypes to see if the 'dose' (number of copies) of A is correlated with some aspect of the disease. Usually there isn't enough data to resolve such differences with much convincing statistical rigor.
Another approach to protect against false associations is to extend the study and see if the same association is retained. But this is often very costly because of sample size demands or, if the study is in a small population, perhaps impossible. Likewise, people exit and enter study groups, changing the mix of variation, risking obscuring signal. One way to try to show systematic effects is to do a meta-analysis, by pooling studies. If the overall correlation is still there, even if individual studies have come to different risk estimates, one may have more confidence. This is, to my understanding, usually not done by regressing the allele frequency with risk, which seems like something that should be done, but there is heterogeneity in method, genotyping, accuracy, and size among studies so this is likely to be problematic.
An issue that seems often, if not usually to have been overlooked
An upshot of this typical kind of finding of very weak effect sizes is that the disorder is the result of any of a variety of genomic backgrounds (genomewide genotypes) as well as lifestyle exposures that aren't being measured. The background differences may complement the A's effects in varying ways so that the net effect is real, but on average weak. That's why non-A carriers have almost the same level of risk (again, that is, the net effect size of A is small).
But the problem arises when the excess risk in the A-carriers is assumed to be due to that allele. In fact, and indeed very likely, is that even in many affected A-carriers the disease may have arisen because of risky variables in networks other than the one involving the 'A/B' gene. That is why nearly as high a proportion of non-A-carriers are affected. Because of independent assortment and recombination among genes and chromosomes, the same distribution of backgrounds will be found in the A-bearing cases (though none of the genome-types will be the same even between any two individuals). In those A-individuals their outcome may be due entirely to variants other than the A allele itself, for example, because of variants in genes in other networks. That is, some, many, or even most A-bearing cases may not, in fact, be affected because of the A allele.
This seems to me very likely to be a common phenomenon, given what we know about the complex genotypic variation at the thousands of potentially relevant sites typed in genomewide analysis like GWAS in any sample. One well-known issue that GWAS methods can and often do correct for, is that some factors related to population structure (origins and marriage patterns among the sampled individuals) can induce false correlations. But even after that correction, given the true underlying causal complexity, it is likely that for some SNP sites it is only chance distributions of different complex genotypes between the A- and non-A SNP genotypes that suffice to generate the weak statistical effects, when so many sites in the genome are tested.
Suppose the A allele's estimated effects raise the risk of the disease to 5% in the A-carriers, and let's assume for the moment the convenient fiction that there is no error in this estimate. It may be the case that 5% of the A-bearing cases are due to the presence of the very-strong A-allele, and are doomed to the disease, whereas the other 95% of A-bearing are risk-free. Or it could be that every A-carrier's risk is elevated by some fraction so the average is 5%. Given that almost as many cases are usually seen in non-A carriers, such uniformity seems unlikely to be true. Almost certainly, at least in principle, is that the A-carriers have a distribution of risk, for whatever background genomic or environmental or stochastic reasons, but that whose average is 5%. These alternative interpretations are very difficult to test, and when does anyone actually bother?
The problem relates to intervention strategies
For many years, we have know from all sorts of mapping studies that most identified sites have very weak effects. We know that in many cases environmental factors are vastly more important (because, for example, the disease prevalence has changed dramatically in the last few decades). But the justification (rationale, or excuse) for continuing the huge Big Data approach is that it will at least identify 'druggable target' genes so the culpable pathway can be intervened in. Hoorah! for Big Pharma--even if the gene itself isn't that important, a network has many intervention points.
However, to the extent that the potential correlation fallacy discussed here is at play, targeting genetically based therapy at the A allele may fail not because the targeting doesn't work but because most A-carriers are affected mainly or exclusively for other reasons. If the inferential fallacy is not addressed, think of how long and how costly it would be to end up doing better.
The correlation fallacy discussed here doesn't even assume that the A-allele is not a true risk factor, which as we noted above may often be the case if the results in Big Data studies are largely statistical artifacts. The issue is that the A effect is unlikely to be very strong, because otherwise it would be easier to see or show up in family studies (as some rare alleles do, in fact), but simply because most individuals in both the A and non-A categories are affected for completely unrelated reasons. Again what we know about recombination, somatic mutation, independent assortment, and the complexity of gene and gene-environmental interactions, suggests that this simply must often be true. The correlation fallacy may pervasively lurk behind widely proclaimed discoveries from genome mapping.