The classical way to do this is called linkage
analysis. That term generally refers to
tracing cases and marker variants in known families. If parents transmit a causal allele (variant
at some place in the genome) to their children, then we can find clusters of
cases in those families, but no cases in other families (assuming one cause only). We have Mendel’s classical rules for the
transmission pattern and can attempt to fit that pattern to the data—for
example, to exclude some non-genetic trait sharing. After all, family members might share many
things just because they have similar interests or habits. Even disease can be due to shared
environmental exposures. Mendelian principles allow us, with enough data, to
discriminate.
“Enough data” is the catch.
Linkage analysis works well if there is a strong genetic signal. If there is only one cause, we can collect
multiple families and analyze their transmission patterns jointly. Or, in some circumstances, we can collect
very large, multi-generational families (often called pedigrees) and try to track
a marker allele with the trait across the generations. This has worked very well for some very
strong-effect variants conferring very high risk for very specific, even quite
rate, disorders. That is because the linkage
disequilibrium—the association between a marker allele and a causal variant due
to their shared evolutionary history (as described in Part I) ties the two
together in these families.
But it is often very costly or impractical to collect actual
large pedigrees that include many children each generation, and multiple
generations. Family members who have died cannot be
studied and medical records may be untrustworthy, or family members may have
moved, refuse to participate in a study, or be inaccessible for many
reasons. So a generation or so ago the
idea arose that if we collect cases from a population we may also collect
copies of nearby marker alleles in linkage disequilibrium—shared evolutionary
history in the population—so that, as described in Part I, a marker allele has
been transmitted through many
generations of unknown but assumed
pedigree, so that the marker will have been transmitted in the pedigree along
with the causal variant. This is
implicit linkage analysis, called genomewide association analysis (GWAS), about
which we’ve commented many times in the past.
GWAS look for association between marker and causal site in implicit
but assumed pedigrees, and is another form of linkage analysis.
When genetic causation is simple enough, this will
work. Indeed, it is far easier and less
costly to collect many cases and controls than many deep pedigrees, so that a
carefully designed GWAS can identify causes that are reasonably
strong. But this may not always work,
when a trait is ‘complex’, and has many different genetic and/or environmental
contributing causes.
If causation is complex, families provide a more powerful
kind of sample to use in searching for genetic factors. The reason is simple: in general a single
family will be transmitting fewer causal variants than a collection of separate
families. Related to this is the reason
that isolate populations, like Finland or Iceland, can in principle be good
places to search, because they represent very large, even if implicit,
pedigrees. Sometimes the pedigree can
actually be documented in such populations.
If causation is complex, then linkage analysis in families will hopefully
be better than big population samples for finding causal contributors, simply
because a family will be segregating (transmitting) fewer different causal
variants than a big population. We might
find the variant in linkage analysis in a big family, or an isolate population,
but of course if there are many different variants, a given family may point us
only to one or two of them. For this
reason, many argue that family analysis is useless for complex traits—one
commenter on a previous Tweet we made from our course, likened linkage analysis
for complex traits to ‘cold fusion’. In
fact, this was a mistake and is incorrect.
Association analysis, the main alternative to linkage
analysis, is just a combining of many different implicit families, for the
population-history reason we’ve described here and in Part I. The more families you combine, whether they
are explicit or implicit, the more variation, including statistical ‘noise’,
you incorporate. The rather paltry
findings of many GWAS are a testament to this fact, explaining as they have only a
small fraction of most traits to which that method has been applied. Worse, the greater the sample of this type,
like cases vs controls, the more environmental
variation you may be grouping together, again greatly watering down even the
weak signal of many or, probably, by far most genetic causal factors.
In fact, if you are forced to go fishing for genetic cause,
you may well be fishing in dreamland because you may simply be in denial of the
implications of causal complexity. In fact, all mapping is a form of linkage analysis. Instead, one should tailor one’s approach to
the realities of data and trait. Some
complex trait genes have been found by linkage analysis (e.g., the BRCA
breast-cancer associated genes), though of course here we might quibble about the definition of 'complexity'.
Sneering at linkage analysis because it is difficult to get
big families, or because even single
deep families may themselves be transmitting multiple causes (as is often found
in isolate studies, in fact), is often simply a circle-the-wagon defense of Big
Data studies, that capture huge amounts of funding with relatively little
payoff to date.
A biological
approach?
Many linkage and association analyses are done because we
don’t understand the basic biology of a trait well enough to go straight to
‘candidate’ genes to detect, prevent, or develop treatment for a trait. Today, even though this approach has been the
rule for nearly 20 years now, with little payoff, the defense is often still
that more, more and even more data will solve the problem. But if causation is too complex this can also
be a costly, self-interested, weak defense.
If we have whole genome sequence on huge numbers of people,
or even everyone in a population, or in many populations so we can pool data,
that we will find the pot of gold (or is it cold fusion?) at the end of the
rainbow.
One argument for this is to search population-wide genome
sequenced biomedical data bases for variants that may be transmitted from
parents to offspring, but that are so rare that they cannot generate a useful
signal in huge, pooled GWAS studies.
This usually will still be in the form of linkage analysis if a marker
in a given causal gene is transmitted with the trait in occasional families but
the same gene is identified, even if via different families. That is, if variation in the same gene is
found to be involved in different individuals, but with different specific
alleles, then one can take that gene seriously as a causal candidate.
This sometimes works, but usually only when the gene’s
biology is known enough to have a reason to suspect it. Otherwise, the problem is that so much is
shared between close family members (whether implicitly or explicitly in known
pedigrees) that if you don’t know the biology there will be too much to search
through, too much co-transmitted variation.
Causal variation need not be in regular ‘genes’, but can be, and for
complex traits seems typically to be,
in regulatory or other regions of the genome, whose functional sites may not be
known. Also, we all harbor variation in
genes that is not harmful, and we all carry ‘dead’ genes without problems, as
many studies have now shown.
If one knows enough biology to suspect a set of genes, and
finds variants of known effect (such as truncating a gene’s coding region so a
normal protein isn’t made) in different affected individuals, then one has
strong evidence s/he has found a target gene.
There are many examples of this for single-gene traits. But for complex traits, even most genes that
have been identified have only weak effects—the same variant most of the time
is also found in healthy, unaffected individuals.
In this case, which seems often to be the biological truth, there is no
big-cause gene to be found, or a gene has a big-cause only in some unusual
genotypes in the rest of the genome.
Even knowing the biology doesn't say whether a given gene's protein code is involved rather than its regulation or other related factors (like making the chromosomal region available in the right cells, downregulating its messenger RNA, and other genome functions). Even in multiple instances of a gene region, there may be many nucleotide variants observed among cases and controls. The hunt is usually not easy even knowing the biology--and this is, of course, especially true if the trait isn't well-defined, as is often the case, or if it is complex or has many different contributors.
Even knowing the biology doesn't say whether a given gene's protein code is involved rather than its regulation or other related factors (like making the chromosomal region available in the right cells, downregulating its messenger RNA, and other genome functions). Even in multiple instances of a gene region, there may be many nucleotide variants observed among cases and controls. The hunt is usually not easy even knowing the biology--and this is, of course, especially true if the trait isn't well-defined, as is often the case, or if it is complex or has many different contributors.
Big Data, like any other method, works when it works. The question is when and whether it is worth
its cost, regardless of how advantageous for investigators who like playing
with (or having and managing) huge resources.
Whether or not it is any less ‘cold fusion’ than classical linkage
analysis in big families, is debatable.
Again, most searches for causal variation in the genome rest on
statistical linkage between marker sites and causal sites due to shared
evolutionary history. Good study design
is always important. Dismissal of one method
over another is too often little more than advocacy of a scientist’s personal
intellectual or vested interests.
The problem is that complex traits are properly named: they are complex. Better ideas are needed
than what are being proposed these days.
We know that Big Data is ‘in’ and the money will pour in that
direction. From such data bases all
sorts of samples, family or otherwise, can be drawn. Simulation of strategies (such as with
programs like our ForSim that we discussed in our recent Logical Reasoning
course in Finland) can be done to try to optimize studies.
In the end, however, fishing in a pond of minnows, no matter
how it’s done, will only find minnows. But these days they are very expensive minnows.
"In the end, however, fishing in a pond of minnows, no matter how it’s done, will only find minnows. But these days they are very expensive minnows."
ReplyDeleteI am going to call it 'genome wide fishing for minnows' (GWFM) from now on. The slogan will be -
'Try GWFM long enough and you will see lochness monster in the pond of minnows'.
Reply to Manoj
ReplyDeleteWell, occasionally what is a minnow in one pond is a Loch Nessie in another, and that's the context-specificity lesson, I think. Of course, anyone running a carnival, or perhaps a charivari if you know that term, has an interest in noisily proclaiming minnows to be monsters.