Friday, October 21, 2011

Identifying candidate genes; theory and practice

So, we, along with collaborators here at Penn State, and other institutions, happen to be working on a project to map genes associated with craniofacial morphology in mice and baboons.  Yes, you can laugh, given all we've said about the problems with looking for genes for traits, that we too are involved in genomewide mapping!  But, genes are involved in morphology, and mice and baboons do differ, perhaps even systematically (by lineage), so, in theory, it's not a futile search.  And, we're hoping that all our skepticism about how these kinds of studies are usually done can point us toward more useful ways to think about the problem--not to enumerate genes but to document or better understand complex genetic causal architecture.  In theory.   

We're at the point now where traits have been measured on about 1200 mice, and these mice have been genotyped at thousands of variable nucleotide marker sites across the genome, the statistics have been run, 'lod' scores generated, and we're ready to identify candidate genes associated with these traits.

The lod scores -- measures of the probability that a genetic marker, a variable locus along a chromosome, is near to a gene that contributes to variation in a trait -- can point to fairly wide or fairly narrow intervals on a chromosome.  The scores of interest are those that are statistically significant, or 'suggestive', that is, almost statistically significant.  Some intervals indicated by statistically interesting marker sites are in gene-rich regions, some in gene-poor regions of their respective chromosomes, so some are easier to mine than others.

What we do with each chromosomal interval is to identify all the genes along it, and then hunt for data on how each gene has been characterized.  We are doing this manually, but it can be, and often is automated, with a program that looks for certain kinds of genes, in specified gene families, for example.  We are choosing to do this by hand because we are trying to reduce the preconceived assumptions with which we approach the problem.  Our goal is to collect expression and function data about each gene in every interval.  Yes, that's a lot of genes, but we know that most genes are pleiotropic, expressed in many tissues, with multiple functions, most genes have not been fully characterized, and we don't want to assume things a priori, such as that a given class of gene is what we're looking for. 

But this is quite problematic. 

Let's take an interval we slogged through yesterday.  It's one that mapping of several different traits yielded significant lod scores for, so potentially of greater interest than the intervals that were only tagged by one trait.  It may, for example, reflect multiple traits that have correlated genetic causation during the development of the head.  Anyway, this interval, on mouse chromosome 8, includes 95 identified genes, and a few identified micro RNAs (DNA coding for short RNA molecules that affect the expression of other genes).  In fact, and not unusually, very little of the non-coding DNA in this region has been characterized, which is problematic itself, since some of that is surely involved in gene regulation, and who's to say that the variation we're looking at is not due to gene regulation.

Of these 95 genes, we can't find expression data for 49 of them.  Over half.  So, are they expressed in the anatomical region relevant to the head traits that mapped to this area?  We don't know.  It's a major task to study the expression of even one gene in mice, much less a host of genes.  Apparently no functional data are available for 15 of these genes, and of the rest for which function has been characterized, many are assumed to have a given function just because its DNA sequence is similar to a gene whose function is known.  Many others are characterized as "may be involved in" such and such. And, a handful have been identified to be involved in a specific function, but expression data show that they are expressed in other tissues or regions of the body -- this seems to be particularly true for 'spermatogenesis' genes, which tend to show up everywhere.*  Olfactory (odor) receptors as well.

So, given the inconsistencies, vagaries, and incompleteness of the data, how are we supposed to go about choosing candidate genes?  One way is the Drunk Under the Lamp-post approach (looking for lost keys where there's light even if that's not where they were dropped):  to grab a gene we already know about, or narrow our search to genes in families that make some kind of biological sense. That's looking for things we already know and making the comforting assumption they must be the right ones.  This will clearly work sometimes, and may be better than totally random guessing.  But, such an approach of convenience perforce ignores many plausible candidates, both that have been characterized, and those that have not.  So, even when genomic databases are complete (if that day ever arrives -- but keep in mind that different patterns of gene expression are associated with different environments, and thus that characterization may never be complete; reminding us of Wolpert's caution about yet another knock-out mouse with no apparent phenotype; "But have you taken it to the opera?") this approach may not identify all actual candidates.

And this is all true even if we believe we're looking for single genes 'for' our traits.  What about when we throw in the kinds of complexity we are always going on about here on MT?  It's fair to say that most gene mapping studies don't result in confirmed genes for traits, or a few are found but usually with relatively minor contribution relative to that of other, unidentified genes.  See most GWAS, e.g.  And if we had access to unpublished GWAS results, this would be even more convincing.

So, while we do believe that genes underlie traits in some collective aggregate ways, and that in theory they are identifiable, current methods, and the current state of (in)completeness of the databases leave us more convinced than ever that identifying genes 'for' traits is an uphill slog.


* Here's a section from an in situ experiment on a mouse embryo, taken from the extensive gene expression database, GenePaint.  The darker areas are where the gene in question is expressed.  In this case, the strongest expression can be seen in the brain, the pituitary gland, the lungs, the whiskers, the growing sections of the digits, and there is expression in other tissues as well.  The function of this gene is characterized as, "Probably plays an important role in spermatogenesis." 


James Goetz said...

"Probably plays an important role in spermatogenesis."

Hmm, this modest conclusion does not help with publicity for your lab :-)

Ken Weiss said...

Well, it shows the problems. One is the naming of genes not for a particular molecular property or basic function (such as 'secreted signaling factor' or 'such-and-such binding protein') but for the first tissue, or mutational effect that led to its discovery.