The Mermaid's Tale: March 2019

A couple of decades ago, several of us, led by Luca Cavalli-Sforza, Marc Feldman, Ken Kidd, and several others (including yours truly), got together to suggest a worldwide sampling of human genetic diversity that would specifically include the diverse 'anthropological' populations (traditional tribal groups who still existed but were being surrounded or incorporated -- or worse -- by the growing, large agricultural/industrial civilizations. The idea, called the Human Genome Diversity Project (HGDP), was to collect DNA samples from hundreds of populations worldwide who would otherwise be un- or under-represented in the available data on human genomic variation. The large agricultural/industrial populations are swamping (if not literally exterminating) these more ethnically aboriginal peoples. Yet their pattern of genomic diversity is that from which the dense populations derived, and the latters' variation may tell us about the origins and nature, and perhaps adaptive fitness, interactions and so on of the larger pan-human population into which 'we' grew.

The idea of a global HDGP was stifled by two things. One was attacks by political opportunists (and played culpably by the media) who felt this global sampling was demeaning to the aboriginal populations or would be designed imperialistically to profit from those peoples by patenting findings; and secondly, by the hungry economic maw of the human genome sequencing project then in progress and preemptive.

The upshot was that the HGDP was never funded. Luca donated the set of global samples then available to him, to the France-based CEPH website, where they were given the HGDP name (that is still the case, though I think it wrong, because the data set was not a systematically global design-first-then-sample project, so it rather co-opted the HGDP name). Nonetheless, and to the good, the DNA along with analytic results from those samples are freely available to qualified researchers.

Another HGDP organizer, Ken Kidd at Yale (along with his wife, Judy, and other collaborators), has produced an excellent, publicly accessible website called ALFRED, which provides allele frequency data from populations around the world, plus documentation of the sampled population and a variety of other user-friendly features. Among other things, this is a fine tool for teaching global human diversity,

Now, a new paper by Sarah Tishkoff and others (Sirugo et al., "The Missing Diversity in Human Genetic Studies", in Cell 177, March 21, 2019) makes the case for sampling human genomic diversity, of a sort, pointing out various reasons why it would be good to address the current bias in genetics towards Europeans with global sampling of human variation. Obviously, I agree with that although many technical points could be raised about whether the inevitably smaller samples from scattered small populations could possibly be analyzed as effectively as the very large samples required to identify risk variants that are being over-peddled to us via the various 'omics and Big Data advocates.

What are the 'populations' and what does 'diversity' properly include?
The value, potential and humane importance of properly sampling humans beyond the major large populations in Europe and North America is obvious, but the new paper makes the case mainly for the larger 'mainline' populations other than Europeans. Unfortunately, though even they be numerous in the census sense, they are heterogeneous and it is unclear who, exactly, and how, current data represent them. Can we just blithely say we need to include 'Africans' to address the representativeness problem? Are, for example, African-Americans, not to mention 'Hispanic-Americans' all the same among possible samples? And the same regarding Asians. The current paper deals with these issues at least to some extent. But then what about, say, New Zealand natives, or Cherokees, other small populations, or which castes and from which parts of India must we collect data? How exhaustive should we sample and how can complex genomes effectively be parsed in this way (not to mention environments--a topic at least acknowledged by Sirugo et al.).

Francis Collins' current 'All of Us' sloganeering is, to me, a culpable mis-representation to the public, a strategy to ply huge funds out of Congress in open-ended ways, too big to terminate, a welfare project for university research and their various supporting industries and interests. The idea seems to be implicit, though unjustified, that any sort of open-ended Big Data 'omical project can be fair to small sub-groups (indeed, I would argue from various aspects of what we know already, it can't for the major ethnic groups either). So what does the promise that this is for 'All of us' actually mean, beyond transparent strategy to pry open-ended funding from Congress?

Problems with the promise in the first place
Now while I agree that increasing sampling of human diversity is important for many reasons, not least being fairness, the paper promises that it will increase or improve 'precision' medicine. To me, that is sloganeering, and avoids facing up to what Big Data 'omics have already shown us about causal complexity of the important non-Mendelian traits--complexity not only in the genomic but also environmental senses.

There are several obvious, but obviously conveniently ignored reasons for this. First, 'genetic' causation involves more than inherited genomic variation. Important variation arises during life, when cells divide. This somatic variation is genetic, but not sampled in the usual genome-sequencing way. Yet somatic variation clearly has important consequences because, a cell doesn't 'know' if its genome sequences were inherited from the individual's parents, or arose during the individual's life.

Secondly, the whole enterprise assumes that induction can lead to deduction, that is, that what we've observed in the past leads us to predict the future. It is not just inherited and somatic mutations whose future is literally unpredictable, but the same is true for lifestyle exposures. Yet lifestyle exposures are vital components of complex disease risks. They cannot be predicted, even in principle. That means past exposures do not predict future ones (to environments or mutations). This is not a dark secret, no matter how inconvenient for the 'omics prediction industries. Unlike many areas in chemistry and physics, induction does not lead to deduction in life.

What we need is deep re-thinking of the problem of genomic effects on disease and other traits. But that is not easy to arrange when careers and institutions depend on very large, very predictable, basically permanent funding is needed for the persons involved. To improve these aspects of our science, we need a different way to support it, new economics, not bigger data or more sequencing. and a side benefit of such reform, were it ever possible, would be to free up investigators' minds from surviving to surmising--new ideas.

Our "I'm first!!" era in science
I do have to note that the tendency to ignore, or be ignorant of, prior work is manifest in this paper, which does not mention the HGDP. We are in an "I'm first!" era in science. I think Shakespeare understood the clearer truth: 'What is past, is prologue'.

Good ideas need to be followed up, and properly sampling the world is one such good idea. But this paper doesn't really deal with the small, traditional aboriginal populations. In the case of the HGDP effort, there was simply a lack of support for sampling small, relatively isolated populations to build a picture of human genomic diversity out of the context from which it actually arose. But it was an effort that explicitly recognized the issues, as they stood at that time. So it is not excusable that the new paper fails to acknowledge the precedent advocating worldwide population sampling. The senior author was very familiar with that effort.

A good idea, that should not seem novel, would be for scientists to read, and cite, their predecessors who had prior recognition of an issue or problem and inevitably, even if indirectly, are leads to stimulating subsequent work. But crediting others doesn't help one's career score-counting, and it takes at least a tad of effort to find out what an ideas' ancestors may have thought, not to mention crediting them. In this case, the senior author had every reason indeed to know directly about this history. Indeed, she did her doctoral and post-doctoral work in places deeply involved in the HGPD!

Anyway, this griping aside, it is at least worth discussing in a serious way whether and how a global sampling of worldwide populations, beyond the main 'racial' groups, would be a good thing to do. I think it would. We are, after all, throwing away countless millions (or is it billions?) on proudly hypothesis-free Big Data 'omical enumerations, projects too big to stop (no matter how, by now, largely pointless). We now know the basic landscape, and it is not nearly as encouraging as its self-interested press regularly blares. Its valuable results should stimulate hard, new thinking, but as long as business as usual pays and absorbs careers, who knows when that will happen?

Even if reform is difficult because of vested interests that we've allowed to develop, it is proper to acknowledge one's intellectual ancestors.

Monday, March 25, 2019

Human Genome Diversity: important to recognize, but not a new issue

Tuesday, March 5, 2019

Tales for children (and lessons for scientists, of all ages)