Wednesday, May 9, 2018

Are common diseases generally caused by 'common' variants?

The funding mantra of genomewide mapping is that common variants cause common disease (CV-CD).  This was convenient for HapMap and other association-based attempts to find genetic causation.  The approach didn't require very dense genotyping or massive sample sizes, for example. Normally, based on Mendel's widely known experiments and so on, one would expect anything 'genetic' to run in families; however, because of notions like low penetrance--low probability of having the trait even if you've inherited the genetic variants--small nuclear families can't work, as a rule, and big enough families would be too costly or even impossible to ascertain.  In particular, for traits due to the effects of many genes or environmental factors and/or only weak causal variants' effects, families would not really be practicable.

So, conveniently, when DNA sequencing on a genomewide scale became practicable, the idea was that sequence variants might not have wholly determinative effects but the effects might be enough that we just need to find them in the population as a whole, not the smallish families that we have a hard enough time ascertaining.  People carrying such a variant would have a higher probability of showing the trait.

It was a convenient short-cut, but there is a legitimate evolutionary rationale behind this:  The same mutation will not recur very often so that if there are many copies of a causative allele (sequence variant) in a population, these are probably identical by descent (IBD), from a single ancestral mutational event.  In that sense, genomewide association studies (GWAS) are finding family members carrying the same allele, but without having to work through the actual (inaccessible, very large, multi-generation ancestral) pedigrees that connect them.  If the IBD assumption were not basically true, then different instances of the same nucleotide change will have different local genomic backgrounds, and the effects will often or even likely vary among the descendants of different mutations, affecting association tests, though the analysis rarely, if ever, attempts to detect or adjust for this.

In principle it can work well if a trait really is caused by alleles at a tractably small number of genes.  That's a very big 'if', but assuming it, which is similar to assuming the trait is a classical Mendelian trait, then one can find association of the allele with the trait among affected people, because, at that site at least, they are distant relatives.  If it is to be considered, though, the effect of a given allele does have to be strong enough, and its frequency in the sample large enough, to pass a statistical significance test.  This is a potential major issue since in very large samples searching countless sites across the genome, reaching significance means many observations, and in a sense requires an allele's frequency and/or individual impact to be high.

In essence, this is the underpinning and implicit justification for the huge GWAS empire.  There are many details, but one important assertion by the leaders of the new EverBigger (and more costly) AllOfUs project, is that common diseases are their target.  Rare diseases generally just won't show up often enough to find statistically reliable 'hits'.

Of course, 'common' is a subjective term, and if one searches millions of genome sites and their allele frequencies vary in the sample, tons of them might be 'common' by such a definition.  And they will have to have strong enough effects to be detectable as well based on suitably convincing significance criteria.  So we might expect CV-CD to be a proper description of such studies.  But there is a subtle difference: the implication (and once, 20 years ago, the de facto expectation) is that this meant that one or a few common variants cause the common disease.

Obviously, if that assumption of convenience were roughly true, then one can think of pharmaceutical or other preventive measures to target the causal variants in these genes in affected persons.  In fact, we have largely based the nearly 20-year GWAS effort on such a wedge rationale, starting with smaller-scale projects like HapMap.  Unfortunately, that was a huge success!

Why unfortunately?  Because, no matter how you define 'common', what we've clearly found, time and again, trait after trait, is that these common diseases are in each case due to effects of a different set of 'common' alleles whose effects are individually weak.  In that sense, the individual allele per se is not very predictive, because many unaffected people also carry that allele.  Every case is genetically unique so one Pharma does not fit all.  It is, I would assert, highly misrepresentative if not irresponsible to suggest otherwise, as is the common PR legerdemain.

Instead, what we know very clearly is that in many or most 'common' disease instances, since each case is caused by different sets of alleles, not only is each case causally unique, but no one allele is, in itself nearly necessary for the disease.  There isn't usually a single 'druggable' target of Pharma's dreams.  There was perhaps legitimate doubt about this 20 years ago when the adventure began, but no longer.

Indeed, it is generally rare for anything close to most of cases, compared to controls, to share any given allele, and even when that happens, the cause, as statistically estimated, say, by comparing cases and controls, is usually only slightly attributable to that allele's effects.  Even then most variation is typically not being accounted for, as measured by the trait's estimated heritability, because it seems due to a plethora of alleles too weak or rare to be detected in the sample, even if they're there and are, collectively, the greatest risk contributor. And, of course, we've not mentioned lifestyles and other environmental factors, nor the often largely non-overlapping results from different populations, nor various other factors.

The non-Mendelian Mendelian reality of life
I think that as a community we were led into this causal cul de sac by taking Mendel too literally or too hopefully.  To be sure, some traits are qualitative--they appear in two or a few distinct states, like green vs yellow peas--and these are basically the kinds of traits Mendel studied, because they were tractable.  In such cases each gene transmits in families in a regular way, that in his honor we call 'Mendelian'.  And human genetics had great success identifying them and their causal genes (cystic fibrosis is one well-known example but there have been many others).  However, common diseases are generally not caused by individual alleles at single genes.  Quantitative geneticists, such as agricultural breeders have basically known about the complexity of most traits for a century, even if specific contributing genes couldn't be identified until methods like GWAS came along 15-20 years ago.

Since we know all this now, from countless studies, it is irresponsible to hijack huge funding for more and more again of the same, based on a CV-CD promise that neither the public nor many investigators understand (or if they do, dare acknowledge).  One might go farther and suggest that this makes 'CV-CD' a semantic shell-game, that the Congress and public are still buying--bravely assuming that the administrators and scientists themselves, who are pushing this view, actually understand the genomic (and environmental) landscape.

NIH Director Collins is busy and has to worry about his institute's budget.  He may or may not know the kinds of things we've mentioned here--but he should!  His staff and his advisors should!  We have not invented them, no matter whether we've explained them fully or precisely enough.  We have no vested interest in the viewpoint we're expressing.  But the evidence shows that research should now be capitalizing, so to speak, on what we've actually learned from the genomic mapping era, rather than just doing more of the same, no matter how safe that is for careers (a structural problem that society should remedy).

Instead of ever more wheel-spinning, what we really need is new thinking, different rather than just more of the same Big Data enumeration.  Until new ideas bubble up, neither we nor anyone else can specify what they should be.  Continuing to pay for ever bigger data serves several immediate interests very well: the academic enterprise whose lifeblood includes faculty salaries and overhead funding for research done in their institution, the media and equipment suppliers who thrive on ever-biggerness, and the administrators and scientists whose imagination is too impoverished to generate some actual ideas.  More is easier, more insightful is very much harder.

So, yes, common diseases are caused by common variants--tens or hundreds of them!  Enumerating them is becoming a stale, repetitive costly business and maybe 'business' is the right word.  The public is paying for more, but in a sense getting less.  Until some day, someone thinks differently.

Sunday, May 6, 2018

"All of us" Who are 'us'?

So the slogan du jour, All Of Us, is the name of a 1.4 billion dollar initiative being launched today by NIH Director Francis Collins.  The plan is to enroll one million volunteers in this mega-effort, the goal of which is, well, it depends.  It is either to learn how to prevent and treat "several common diseases" or, according to Dr Collins who talked about the initiative here, "It's gonna give us the information we currently lack" to "allow us to understand all of those things we don't know that will lead to better health care." He's very enthusiastic about All of Us (aka Precision Medicine), calling it a "national adventure that's going to transform medical care."  This might be viewed in the context of promises in the late 1900s that by now we'd basically have solved these problems--rather than needing ever-bigger longer-term 'data'.

And one can ask how the data quality can possibly be maintained if medical records of whoever volunteers vary in their quality, verifiability, and so on.  But that is a technical issue.  There are sociological and ontological issues as well.

All of Us?
Serving 'all of us' sounds very noble and representative.  But let's see how sincere this publicly hyped promise really is.  Using very rough figures, which will serve the point, there are 320 million Americans.  So 1 million volunteers would be about 0.3% of 'all' of us.  So first we might ask: What about achieving some semblance of real inclusive fairness in our society, by making a special effort to oversample African Americans, Hispanics, and Native Americans, before the privileged, mainly white, middle class get their names on the roles?  That might make up for past abuses affecting their health and well-being.

So, OK, let's stop dreaming but at least make the sample representative of the country, white and otherwise.  Does that imply fairness?  There are, for example, about 300,000 Navajo Native Americans in the country.  If All Of Us means what it promises, there would be about 950 Navajos in the sample.  And about 56 Hopi tribespeople.  And there are, of course, many other ethnic groups that would have to be included.  Random (proportionate) sampling would include about 600,000 'white' people in the sample.

These are just crude subpopulation counts from superficial Google searching, but the point is that in no sense is the proposed self-selected sample of volunteers going to represent All Of Us in anything resembling fair distribution of medical benefits.  You can't get as much detailed genomewide (not to mention environmental) data from a few hundred sampled individuals compared to hundreds of thousands.  To be fair and representative in that sense, the sample would have to be stratified in some way rather than volunteer-based.  It seems very unlikely that the volunteers who will be included are in some real sense going to be representative of the US, rather than, say of university and other privileged communities, major cities, and so on--even if not because of intentional bias but simply because they are more likely to learn of All Of Us and to participate.

Of course, defining what is fair and just is not easy.  For example, there are far more Anglo Americans than Navajo or Hopi.  So the Anglos might expect to get most of the benefits.  But that isn't what All Of Us seems to be promising.  To get adequate information from a small group, given the causal complexity we are trying to understand, they should probably be heavily oversampled.  Even doing that would leave room for enough samples from the larger populations of Anglo and African-Americans adequate for the kind of discovery we could anticipate from this sort of Big Data study of causes of common disease.

More problems than sociology
That is the sociological problem of claiming representativeness of 'all' of us.  But of course there is a deeper problem that we've discussed many times, and that is the false implied promise of essentially blanket (miracle?) cures for common diseases.  In fact, we know very well that complex causation, of the common diseases that are the purported target of this initiative, involves tens to thousands of variable genome locations, not to mention the environmental ones that are beyond simple counting.  Further, and this is a serious, nontrivial point, we know that these sorts of contributing causes include genetic and environmental exposures in the sampled individuals' futures, and these cannot be predicted, even in principle.  These are the realities.

And, even if the project were truly representative of the US population demographically, as a sample of self-selected volunteers there remains the problem of representing diseases in the population subsets.  Presumably this is why they are focusing on "common diseases", but still the sample will have to be stratified by possible causal exposures (lifestyles, diets, etc) and ethnicity, and then they'll have to have enough controls to make case-control comparisons meaningful. So, how many common diseases, and how will they be represented (males/females, early/late onset, related to what environmental lifestyles, etc.?)?  One million volunteers isn't going to be representative or a large enough sample that has to be stratified for statistical analysis, especially if the sample also includes the ethnic diversity that the project promises.

And there's the epistemological problem of causation being too individualistic for this kind of hypothesis-free data fishing to solve--indeed, it is just that kind of research that has shown us clearly how that kind of research is not what we need now.  We need research focused on problems that really are 'genetic', and some movement of resources to new thinking, rather than perpetuating the same kind of open-ended, 'Big Data' investment.

And more
In this context, the PR seems mostly to be spin for more money for NIH and its welfare clients (euphemistically called 'universities').  Every lock on Big Money for the Big Data lobby, or perhaps belief-system, excludes funding for focused research, for example, on diseases that would seem to be tractably understood by real science rather than a massive hypothesis-free fishing expedition.

How could the 1.4 billion dollars be better spent?  A legitimate goal might be to do a trial run of a linked electronic records system as part of explicit move towards what we really need, and which would really include all of us; a real national healthcare system.  This could be openly explained--we're going to learn how to run such a comprehensive system, etc., so we don't get overwhelmed with mistakes.  But then for the very same reason, a properly representative project is what should be done.  That would involve stratified sampling, and more properly thought-out design.  But that would require new thinking about the actual biology.