Tuesday, September 10, 2013

Who begat you??

Henry Louis Gates, who has done so much to interest people in DNA-based ancestry testing, tells the story of his own first ancestry report. He was told that his maternal line probably traced back to the Nubian people in Egypt. Given that the ancestors of most African Americans left Africa by force, it's difficult for their descendants to trace their family histories further back than their arrival on these shores, so Gates was excited to learn of his North African past.

Five years and another ancestry test later, however, he was told that his maternal ancestors were in fact more likely to have been European than African. Why didn't the first test give the same results? As we recall the story, it was because the goal of first ancestry testing company was to trace African American ancestry back to Africa. They had no European samples in their database, and this forced the closest fits to be African, even when they weren't.

The quality of any ancestry results depends on the representativeness of the ancestral data. If an ancestry company only had, say, Tahitians, Mongolians and Finnish Saami samples with which to compare our DNA, we'd all look like an admixture of Tahitian, Mongolian and Saami. And much the same is true when a researcher is trying to determine the geographic history of a population, or to search for genetic signatures of natural selection.

Population ancestry determination
A recent paper in BioEssays ("SNP ascertainment bias in population genetic analyses: Why it is important, and how to correct it," Lachance and Tishkoff, online 9 July 2013, in print Sept 2013) discusses the biases that may be built into demographic and genetic analyses if researchers determine allele frequencies -- the frequency of different gene variants -- based on single nucleotide polymorphisms (SNPs) which have been found to vary in substantial frequency in other, perhaps distantly related populations. (SNPs are single sites in the genome where some people have, say, a T and others an A.) These criteria for identifying variable sites that are to be used later to examine ancestry of general samples of people can bias the ancestry estimates.

 Lachance and Tishkoff report the results of population genetic analyses using whole genome sequences of 15 African hunter-gatherers; 5 Hadza from Tanzania, 5 Pygmies from Cameroon, and 5 Sandawe from Tanzania. The sequences contained millions of variants that hadn't been seen in other populations before -- and this will be true of every population with a history of geographic isolation. Indeed, each of us contains hundreds of brand-new mutations private only to us, and others exist private to our near kin, and so on. Depending on the sample size and source, one will detect some but by no means all, genetic variation in the population (any population). Lachance and Tishkoff compared their analyses done with whole genomes with SNP-based analysis (that is, a subset of variable sites, not whole sequence, sites that were identified in some other samples), and determined that the latter are biased.

SNP ascertainment bias is the systematic deviation of population genetic statistics from theoretical expectations, and it can be caused by sampling a nonrandom set of individuals or by biased SNP discovery protocols. Unless the whole genome of every individual in a population is sequenced there will always be some form of SNP ascertainment bias.

Population statistics are affected by SNP ascertainment bias in a variety of ways. For example, populations may falsely seem to have shrunk in size, the age estimate of SNPs may be biased toward older SNPs because pre-ascertained SNPs are usually older than population specific SNPs. Measures of population heterozygosity, and its variation, can be affected because those measures depend on how much variation there is within and between populations. If you only use sites found to vary substantially in, say, Europeans (as the earliest SNP sets were), then statistically speaking Europeans are likely to seem more variable than other populations.

 Population geneticists and evolutionary biologists are often interested in determining whether signatures of natural selection can be detected in a specific population. That is, is there genetic evidence that the population has adapted to local environmental conditions? One major signal is reduced variation, because strong selection can eliminate all but the most favored variant(s) in a given relevant gene. But if you only look at common variants, you may not see this. Lachance and Tishkoff report that they identified many possible targets of natural selection when they looked at the complete genome sequences, but most of these were missed when the analyses used the pre-ascertained SNPs.

The authors suggest a number of possible ways to work around SNP ascertainment bias, the best being whole genome sequencing, which of course is still prohibitively expensive for even a moderate sample size. Others include correcting for ascertainment bias in the models and analyses, although this isn't perfect.

 The idea is that if you sequence whole genomes in enough people from enough geographic sites you can identify sites that vary without only looking in population 2 at sites you already found vary in population 1. The degree to which this is feasible without introducing your own types of sampling bias is debatable, as is the adequate sample size--and whether it is worth the expenditure. But it is at least important to realize the extent to which what you decide to collect may, in advance and even unwittingly, affect what you conclude. How and how well one can actually detect natural selection's signatures with such data is another subject, and it's a difficult one. We should not leap into costly sequencing efforts just to satisfy some curiosity on the subject, because there are many other kinds of issues that one would have to consider--and it's often not clear what the value even of a correctly identified signal is.

 And, there are considerations other than SNP ascertainment bias that affect the reliability of population genetic analyses. They include sample size and choice, among other things. But, as with personal ancestry testing, if you're trying to understand population 2 primarily with knowledge about population 1, your results may not be reliable if they don't 'triangulate' with regard to enough appropriate alternatives. And it can be hard to tell if you've done that.

 Nonetheless, and while recognition of these issues is at least 20 years old, but they are worth being aware of.

No comments: