Friday, August 29, 2014

Genomic cold fusion? Part II. Realities of mapping

Mapping to find genomic causes of a trait of interest, like a disease, is done when the basic physiology is not known—maybe we have zero ideas, or the physiology we think is involved doesn’t show obvious differences between cases and controls.  If you know the biology, you won't have to use mapping methods, because you can explore the relevant genes directly.  Otherwise, and today often, we have to go fishing, in the genome, to find places that may vary in association—statistical regularity—with the trait.

The classical way to do this is called linkage analysis.  That term generally refers to tracing cases and marker variants in known families.  If parents transmit a causal allele (variant at some place in the genome) to their children, then we can find clusters of cases in those families, but no cases in other families (assuming one cause only).  We have Mendel’s classical rules for the transmission pattern and can attempt to fit that pattern to the data—for example, to exclude some non-genetic trait sharing.  After all, family members might share many things just because they have similar interests or habits.  Even disease can be due to shared environmental exposures. Mendelian principles allow us, with enough data, to discriminate.

“Enough data” is the catch.  Linkage analysis works well if there is a strong genetic signal.  If there is only one cause, we can collect multiple families and analyze their transmission patterns jointly.  Or, in some circumstances, we can collect very large, multi-generational families (often called pedigrees) and try to track a marker allele with the trait across the generations.  This has worked very well for some very strong-effect variants conferring very high risk for very specific, even quite rate, disorders.  That is because the linkage disequilibrium—the association between a marker allele and a causal variant due to their shared evolutionary history (as described in Part I) ties the two together in these families.

But it is often very costly or impractical to collect actual large pedigrees that include many children each generation, and multiple generations.  Family members who have died cannot be studied and medical records may be untrustworthy, or family members may have moved, refuse to participate in a study, or be inaccessible for many reasons.  So a generation or so ago the idea arose that if we collect cases from a population we may also collect copies of nearby marker alleles in linkage disequilibrium—shared evolutionary history in the population—so that, as described in Part I, a marker allele has been transmitted through many generations of unknown but assumed pedigree, so that the marker will have been transmitted in the pedigree along with the causal variant.  This is implicit linkage analysis, called genomewide association analysis (GWAS), about which we’ve commented many times in the past.  GWAS look for association between marker and causal site in implicit but assumed pedigrees, and is another form of linkage analysis.

When genetic causation is simple enough, this will work.  Indeed, it is far easier and less costly to collect many cases and controls than many deep pedigrees, so that a carefully designed GWAS can identify causes that are reasonably strong.  But this may not always work, when a trait is ‘complex’, and has many different genetic and/or environmental contributing causes.

If causation is complex, families provide a more powerful kind of sample to use in searching for genetic factors.  The reason is simple: in general a single family will be transmitting fewer causal variants than a collection of separate families.  Related to this is the reason that isolate populations, like Finland or Iceland, can in principle be good places to search, because they represent very large, even if implicit, pedigrees.  Sometimes the pedigree can actually be documented in such populations.

If causation is complex, then linkage analysis in families will hopefully be better than big population samples for finding causal contributors, simply because a family will be segregating (transmitting) fewer different causal variants than a big population.  We might find the variant in linkage analysis in a big family, or an isolate population, but of course if there are many different variants, a given family may point us only to one or two of them.  For this reason, many argue that family analysis is useless for complex traits—one commenter on a previous Tweet we made from our course, likened linkage analysis for complex traits to ‘cold fusion’.  In fact, this was a mistake and is incorrect. 

Association analysis, the main alternative to linkage analysis, is just a combining of many different implicit families, for the population-history reason we’ve described here and in Part I.  The more families you combine, whether they are explicit or implicit, the more variation, including statistical ‘noise’, you incorporate.  The rather paltry findings of many GWAS are a testament to this fact, explaining as they have only a small fraction of most traits to which that method has been applied.  Worse, the greater the sample of this type, like cases vs controls, the more environmental variation you may be grouping together, again greatly watering down even the weak signal of many or, probably, by far most genetic causal factors.

In fact, if you are forced to go fishing for genetic cause, you may well be fishing in dreamland because you may simply be in denial of the implications of causal complexity.  In fact, all mapping is a form of linkage analysis.  Instead, one should tailor one’s approach to the realities of data and trait.  Some complex trait genes have been found by linkage analysis (e.g., the BRCA breast-cancer associated genes), though of course here we might quibble about the definition of 'complexity'. 

Sneering at linkage analysis because it is difficult to get big families, or  because even single deep families may themselves be transmitting multiple causes (as is often found in isolate studies, in fact), is often simply a circle-the-wagon defense of Big Data studies, that capture huge amounts of funding with relatively little payoff to date.

A biological approach?
Many linkage and association analyses are done because we don’t understand the basic biology of a trait well enough to go straight to ‘candidate’ genes to detect, prevent, or develop treatment for a trait.  Today, even though this approach has been the rule for nearly 20 years now, with little payoff, the defense is often still that more, more and even more data will solve the problem.  But if causation is too complex this can also be a costly, self-interested, weak defense.

If we have whole genome sequence on huge numbers of people, or even everyone in a population, or in many populations so we can pool data, that we will find the pot of gold (or is it cold fusion?) at the end of the rainbow.

One argument for this is to search population-wide genome sequenced biomedical data bases for variants that may be transmitted from parents to offspring, but that are so rare that they cannot generate a useful signal in huge, pooled GWAS studies.  This usually will still be in the form of linkage analysis if a marker in a given causal gene is transmitted with the trait in occasional families but the same gene is identified, even if via different families.  That is, if variation in the same gene is found to be involved in different individuals, but with different specific alleles, then one can take that gene seriously as a causal candidate.

This sometimes works, but usually only when the gene’s biology is known enough to have a reason to suspect it.  Otherwise, the problem is that so much is shared between close family members (whether implicitly or explicitly in known pedigrees) that if you don’t know the biology there will be too much to search through, too much co-transmitted variation.  Causal variation need not be in regular ‘genes’, but can be, and for complex traits seems typically to be, in regulatory or other regions of the genome, whose functional sites may not be known.  Also, we all harbor variation in genes that is not harmful, and we all carry ‘dead’ genes without problems, as many studies have now shown.

If one knows enough biology to suspect a set of genes, and finds variants of known effect (such as truncating a gene’s coding region so a normal protein isn’t made) in different affected individuals, then one has strong evidence s/he has found a target gene.  There are many examples of this for single-gene traits.  But for complex traits, even most genes that have been identified have only weak effects—the same variant most of the time is also found in healthy, unaffected individuals.  In this case, which seems often to be the biological truth, there is no big-cause gene to be found, or a gene has a big-cause only in some unusual genotypes in the rest of the genome.

Even knowing the biology doesn't say whether a given gene's protein code is involved rather than its regulation or other related factors (like making the chromosomal region available in the right cells, downregulating its messenger RNA, and other genome functions).  Even in multiple instances of a gene region, there may be many nucleotide variants observed among cases and controls.  The hunt is usually not easy even knowing the biology--and this is, of course, especially true if the trait isn't well-defined, as is often the case, or if it is complex or has many different contributors.

Big Data, like any other method, works when it works.  The question is when and whether it is worth its cost, regardless of how advantageous for investigators who like playing with (or having and managing) huge resources.  Whether or not it is any less ‘cold fusion’ than classical linkage analysis in big families, is debatable.  

Again, most searches for causal variation in the genome rest on statistical linkage between marker sites and causal sites due to shared evolutionary history.  Good study design is always important.  Dismissal of one method over another is too often little more than advocacy of a scientist’s personal intellectual or vested interests.

The problem is that complex traits are properly named:  they are complex. Better ideas are needed than what are being proposed these days.  We know that Big Data is ‘in’ and the money will pour in that direction.  From such data bases all sorts of samples, family or otherwise, can be drawn.  Simulation of strategies (such as with programs like our ForSim that we discussed in our recent Logical Reasoning course in Finland) can be done to try to optimize studies. 

In the end, however, fishing in a pond of minnows, no matter how it’s done, will only find minnows. But these days they are very expensive minnows.


Manoj Samanta said...

"In the end, however, fishing in a pond of minnows, no matter how it’s done, will only find minnows. But these days they are very expensive minnows."

I am going to call it 'genome wide fishing for minnows' (GWFM) from now on. The slogan will be -

'Try GWFM long enough and you will see lochness monster in the pond of minnows'.

Ken Weiss said...

Reply to Manoj
Well, occasionally what is a minnow in one pond is a Loch Nessie in another, and that's the context-specificity lesson, I think. Of course, anyone running a carnival, or perhaps a charivari if you know that term, has an interest in noisily proclaiming minnows to be monsters.