Showing posts with label gene mapping. Show all posts
Showing posts with label gene mapping. Show all posts

Tuesday, October 16, 2018

Where has all the thinking gone....long time passing?

Where did we get the idea that our entire nature, not just our embryological development, but everything else, was pre-programmed by our genome?  After all, the very essence of Homo sapiens compared to all other species, is that we use culture--language, tools, etc.--to do our business rather than just our physical biology.  In a serious sense, we evolved to be free of our bodies, our genes made us freer from our genes than most if not all other species! And we evolved to live long enough to learn--language, technology, etc.--in order to live our thus-long lives.

Yet isn't an assumption of pre-programming the only assumption by which anyone could legitimately promise 'precision' genomic medicine?  Of course, Mendel's work, adopted by human geneticists over a century ago, allowed great progress in understanding how genes lead at least to the simpler of our traits, with discrete (yes/no) manifestations, traits that do include many diseases that really, perhaps surprisingly, do behave in Mendelian fashion, and for which concepts like dominance and recessiveness been applied and that, sometimes, at least approximately hold up to closer scrutiny.

Even 100 years ago, agricultural and other geneticists who could do experiments, largely confirmed the extension of Mendel to continuously varying traits, like blood pressure or height.  They reasoned that many genes (whatever they were, which was unknown at the time) contributed individually small effects.  If each gene had two states in the usual Aa/AA/aa classroom example sense, but there were countless such genes, their joint action could approximate continuously varying traits whose measure was, say, the number of A alleles in an individual.  This view was also consistent with the observed correlation of trait measure with kinship-degree among relatives.  This history has been thoroughly documented.  But there are some bits, important bits, missing, especially when it comes to the fervor for Big Data 'omics analysis of human diseases and other traits.  In essence, we are still, a century later, conceptual prisoners of Mendel.

'Omics over the top: key questions generally ignored
Let us take GWAS (genomewide association studies) on their face value.  GWAS find countless 'hits', sites of whatever sort across the genome whose variation affects variation in WhateverTrait you choose to map (everything simply must be 'genomic' or some other 'omic, no?).  WhateverTrait varies because every subject in your study has a different combination of contributing alleles.  Somewhat resembling classical Mendelian recessiveness, contributing alleles are found in cases as well as controls (or across the measured range of quantitative traits like stature or blood pressure), where the measured trait reflects how many A's one has: WhateverTrait is essentially the sum of A's in 'cases', which may be interpreted as a risk--some sort of 'probability' rather than certainty--of having been affected or of having the measured trait value.

We usually treat risk as a 'probability,' a single value, p, that applies to everyone with the same genotype.  Here, of course, no two subjects have exactly the same genotype so some sort of aggregate risk score, adding up each person's 'hits', is assigned a p.  This, however, tacitly assumes something like that each site contributes some fixed risk or 'probability' of affection.  But this treats these values as if they were essential to the site, each thus acting as a parameter of risk.  That is, sites are treated as a kind of fixed value or, one might say 'force', relative to the trait measure in question.

One obvious and serious issue is that these are necessarily estimated from past data, that is, by induction from samples.  Not only is there sampling variation that usually is only crudely estimated by some standard statistical variation-related measure, but we know that the picture will be at least somewhat different in any other sample we might have chosen, not to mention other populations; and those who are actually candid about what they are doing know very well that the same people living in a different place or time would have different risks for the same trait.

No study is perfect, so we use some conveniently assumed well-behaved regression/correction adjustments to account for the statistical 'noise' due to factors like age, sex, and unmeasured environmental effects.  Much worse than these issues, there are clearly factors of imprecision, and the obvious major one, taboo even to think about much less to mention, that relevant future factors (mutations, environments, lifestyles) are unknowable, even in principle.  So what we really do, are forced to do, is extend what the past was like to the assumed future.  But besides this, we don't count somatic changes (mutation arising in body tissues during life, that were not inherited), because they'd mess up our assertions of 'precision', and we can't measure them well in any case (so just shut one's eyes and pretend the ghost isn't in the house!).

All of these together mean that we are estimating risks from imperfect existing samples and past life-experience, but treating them as underlying parameters so that we can extend them to future samples.  What that does is equate induction with deduction, assuming the past is rigorously parametric and will be the same in the future;  but this is simply scientifically and epistemologically wrong, no matter how inconvenient it is to acknowledge this.  Mutations, genotypes, and environments of the future are simply unpredictable, even in principle.

None of this is a secret, or new discovery, in any way.  What it is, is inconvenient truth. These things should have been enough, by themselves and without badgering investigators about environmental factors that (we know very well, typically predominate) prevent all the NIH's precision promises from being accurate ('precise'), or even to a knowable degree.   Yet this 'precision' sloganeering is being, sheepishly, aped all over the country by all sorts of groups who don't think for themselves and/or who go along lest they get left off the funding gravy train.  This is the 'omics fad.  If you think I am being too cynical, just look at what's being said, done, published, and claimed.

These are, to me, deep flaws in the way the GWAS and other 'omics industries, very well-heeled, are operating these days, to pick the public's pocket (pharma may, slowly, be awakening-- Lancet editorial, "UK life science research: time to burst the biomedical bubble," Lancet 392:187, 2018).  But scientists need jobs and salaries, and if we put people in a position where they have to sing in this way for their supper, what else can you expect of them?

Unfortunately, there are much more serious problems with the science, and they have to do with the point-cause thinking on which all of this is based.

Even a point-cause must act through some process
By far most of the traits, disease or otherwise, that are being GWAS'ed and 'omicked these days, at substantial public expense, are treated as if the mapped 'causes' are point causes.  If there are n causes, and a person has an unlucky set m out of many possible sets, one adds 'em up and predicts that person will have the target trait.  And there is much that is ignored, assumed, or wishfully hidden in this 'will'.  It is not clear how many authors treat it, tacitly, as a probability vs a certainty, because no two people in a sample have the same genotype and all we know is that they are 'affected' or 'unaffected'.

The genomics industry promises, essentially, that from conception onward, your DNA sequence will predict your diseases, even if only in the form of some 'risk'; the latter is usually a probability and despite the guise of 'precision' it can, of course, be adjusted as we learn more.  For example, it must be adjusted for age, and usually other variables.  Thus, we need ever larger and more and longer-lasting samples.  This alone should steer people away from being profiteered by DNA testing companies.  But that snipe aside, what does this risk or 'probability' actually mean?

Among other things, those candid enough to admit it know that environmental and lifestyle factors have a role, interacting with the genotype if not, usually, overwhelming it, meaning, for example, that the genotype only confers some, often modest, risk probability, the actual risk much more affected by lifestyle factors, most of which are not measured or not measured with accuracy, or not even yet identified.  And usually there is some aspect that relates to age, or some assumption about what 'lifetime' risk means.  Whose lifetime?

Aspects of such a 'probability'
There are interesting issues, longstanding issues, about these probabilities, even if we assume they have some kind of meaning.  Why do so many important diseases, like cancers, only arise at some advanced age?  How can a genomic 'risk' be so delayed and so different among people?  Why are mice, with very similar genotypes to humans (which is why we do experiments on them to learn about human disease) only live to 3 while we live to our 70s and beyond?

Richard Peto, raised some of these questions many decades ago.  But they were never really addressed, even in an era when NIH et al were spending much money on 'aging' research including studies of lifespan.  There were generic theories that suggested from an evolutionary theory why some diseases were deferred to later ages (it is called 'negative pleiotropy'), but nobody tried seriously to explain why that was from a molecular/genetic point of view.  Why do mice only live only 3 years, anyway?  And so on.

These are old questions and very deep ones but they have not been answered and, generally, are conveniently forgotten--because, one might argue, they are inconvenient.

If a GWAS score increases the risk of a disease, that has a long delayed onset pattern, often striking late in life, and highly variable among individuals or over time, what sort of 'cause' is that genotype?  What is it that takes decades for the genes to affect the person?  There are a number of plausible answers, but they get very little attention at least in part because that stands in the way of the vested interests of entrenched too-big-to-kill Big Data faddish 'research' that demands instant promises to the public it is trephining for support.  If the major reason is lifestyle factors, then the very delayed onset should be taken as persuasive evidence that the genotype is, in fact, by itself not a very powerful predictor.

Why would the additive effects of some combination of GWAS hits lead to disease risk?  That is, in our complex nature why would each gene's effects be independent of each other contributor?  In fact, mapping studies usually show evidence that other things, such as interactions are important--but they are at present almost impossibly complex to be understood.

Does each combination of genome-wide variants have a separate age-onset pattern, and if not, why not?  And if so, how does the age effect work (especially if not due to person-years of exposure to the truly determining factors of lifestyle)?  If such factors are at play, how can we really know, since we never see the same genotype twice? How can we assume that the time-relationship with each suspect genetic variant will be similar among samples or in the future?  Is the disease due to post-natal somatic mutation, in which case why make predictions based on the purported constitutive genotypes of GWAS samples?

Obviously, if long delayed onset patterns are due not to genetic but to lifestyle exposures interacting with genotypes, then perhaps lifestyle exposures should be the health-related target, not exotic genomic interventions.  Of course, the value of genome-based prediction clearly depends on environmental/lifestyle exposures, and the future of these exposure is obviously unknowable (as we clearly do know from seeing how unpredictable past exposures have affected today's disease patterns).

The point here is that our reliance on genotypes is a very convenient way of keeping busy, bringing in the salaries, but not facing up to the much more challenging issues that the easy one (run lots of data through DNA sequencers) can't address.  I did not invent these points, and it is hard to believe that at least the more capable and less me-too scientists don't clearly know them, if quietly.  Indeed, I know this from direct experience.  Yes, scientists are fallible, vain, and we're only human.  But of all human endeavors, science should be based on honesty because we have to rely on trust of each other's work.

The scientific problems are profound and not easily solved, and not soluble in a hurry.  But much of the problem comes from the funding and careerist system that shackles us.  This is the deeper explanation in many ways.  The  paint on the House of Science is the science itself, but it is the House that supports that paint that is the real problem.

A civically responsible science community, and its governmental supporters, should be freed from the iron chains of relentless Big Data for their survival, and start thinking, seriously, about the questions that their very efforts over the past 20 years, on trait after trait, in population after population, and yes, with Big Data, have clearly revealed.

Sunday, October 15, 2017

Understanding Obesity? Fat Chance!

Obesity is one of our more widespread and serious health-threatening traits.  Many large-scale mapping as well as extensive environmental/behavioral epidemiological studies of obesity have been done over recent decades.  But if anything, the obesity epidemic seems to be getting worse.

There's deep meaning in that last sentence: the prevalence of obesity is changing rapidly.  This is being documented globally, and happening rapidly before our eyes.  Perhaps the most obvious implication is that this serious problem is not due to genetics!  That is, it is not due to genotypes that in themselves make you obese.  Although everyone's genotype is different, the changes are happening during lifetimes, so we can't attribute it to the different details of each generation's genotypes or their evolution over time. Instead, the trend is clearly due to lifestyle changes during lifetimes.

Of course, if you see everything through gene-colored lenses, you might argue (as people have) that sure, it's lifestyles, but only some key nutrient-responding genes are responsible for the surge in obesity.  These are the 'druggable' targets that we ought to be finding, and it should be rather easy since the change is so rapid that the genes must be few, so that even if we can't rein in McD and KFC toxicity, or passive TV-addiction, we can at least medicate the result.  That was always, at best, wishful thinking, and at worst, rationalization for funding Big Data studies.  Such a simple explanation would be good for KFC, and an income flood for BigPharma, the GWAS industry, DNA sequencer makers, and more.....except not so good for  those paying the medical price, and those who are trying to think about the problem in a disinterested scientific way.  Unfortunately, even when it is entirely sincere, that convenient hope for a simple genetic cause is being shown to be false.

A serious parody?
Year by year, more factors are identified that, by statistical association at least and sometimes by experimental testing, contribute to obesity.  A very fine review of this subject has appeared in the mid-October 201 Nature Reviews Genetics, by Ghosh and Bouchard, which takes seriously not just genetics but all the plausible causes of obesity, including behavior and environment, and their relationships as best we know them, and outlines the current state of knowledge.

Ghosh and Bouchard provide a well-caveated assessment of these various threads of evidence now in hand, and though they do end up with the pro forma plea for yet more funding to identify yet more details, they provide a clear picture that a serious reader can take seriously on its own merits.  However, we think that the proper message is not the usual one.  It is that we need to rethink what we've been investing so heavily on.

To their great credit, the authors melded behavioral, environmental, and genetic causation in their analysis. This is shown in this figure, from their summary; it is probably the best current causal map of obesity based on the studies the authors included in their analysis:



If this diagram were being discussed by John Cleese on Monty Python, we'd roar with laughter at what was an obvious parody of science.  But nobody's laughing and this isn't a parody!   And it is by no means of unusual shape and complexity.  Diagrams like this (but with little if any environmental component) have been produced by analyzing gene expression patterns even just of the early development of the simple sea urchin.  But we seem not to be laughing, which is understandable because they're serious diagrams.  On the other hand, we don't seem to be reacting other than by saying we need more of the same.  I think that is rather weird, for scientists, whose job it is to understand, not just list, the nature of Nature.

We said at the outset of this post that 'the obesity epidemic seems to be getting worse'.  There's a deep message there, but one essentially missing even from this careful obesity paper: it is that many of the causal factors, including genetic variants, are changing before our eyes. The frequency of genetic variants changes from population to population and generation to generation, so that all samples will look different.  And, mutations happen in every meiosis, adding new variants to a population every time a baby is born.   The results of many studies, as reflected in the current summary by Ghosh and Bouchard, show the many gene regions that contribute to obesity, their total net contribution is still minor.  It is possible, though perhaps very difficult to demonstrate, that an individual site might account more than minimally for some individual carriers in ways GWAS results can't really identify.  And the authors do cite published opinions that claim a higher efficacy of GWAS relative to obesity than we think is seriously defensible; but even if we're wrong, causation is very complex as the figure shows.

The individual genomic variants will vary in their presence or absence or frequency or average effect among studies, not to mention populations.  In addition, most contributing genetic variants are too rare or weak to be detected by the methods used in mapping studies, because of the constraints on statistical significance criteria, which is why so much of the trait's heritability in GWAS is typically unaccounted for by mapping.  These aspects and their details will differ greatly among samples and studies.

Relevant risk factors will come or go or change in exposure levels in the future--but these cannot be predicted, not even in principle.  Their interactions and contributions are also manifestly context-specific, as secular trends clearly show.  Even with the set of known genetic variants and other contributing factors, there are essentially an unmanageable number of possible combinations, so that each person is genetically and environmentally unique, and the complex combinations of future individuals are not predictable.

Risk assessment is essentially based on replicability, which in a sense is why statistical testing can be used (on which these sorts of results heavily rely).  However, because these risk factor combinations are each unique they're not replicable.  At best, as some advocate, the individual effects are additive so that if we just measure each in some individual add up each factor's effect, and predict the person's obesity (if the effects are not additive, this won't work).  We can probably predict, if perhaps not control, at least some of the major risk factors (people will still down pizzas or fried chicken while sitting in front of a TV). But even the known genetic factors in total only account for a small percentage of the trait's variance (the authors' Table 2), though the paper cites more optimistic authors.

The result of these indisputable facts is that as long as our eyes are focused, for research strategic reasons or lack of better ideas, on the litter of countless minor factors, even those we can identify, we have a fat chance of really addressing the problem this way.

If you pick any of the arrows (links) in this diagram, you can ask how strong or necessary that link is, how much it may vary among samples or depend on the European nature of the data used here, or to what extent even its identification could be a sampling or statistical artifact.  Links like 'smoking' or 'medication', not to mention specific genes, even if they're wholly correct, surely have quantitative effects that vary among people even within the sample, and the effect sizes probably often have very large variance. Many exposures are notoriously inaccurately reported or measured, or change in unmeasured ways.   Some are quite vague, like 'lifestyle', 'eating behavior', and many others--both hard to define and hard to assess with knowable precision, much less predictability.  Whether their various many effects are additive or have more complex interaction is another issue, and the connectivity diagram may be tentative in many places.  Maybe--probably?--in such traits simple behavioral changes would over-ride most of these behavioral factors, leaving those persons for whom obesity really is due to their genotype, which would then be amenable to gene-focused approaches.

If this is a friable diagram, that is, if the items, strengths, connections and so on are highly changeable, even if through no fault of the authors whatever, we can ask when and where and how this complex map is actually useful, no matter how carefully it was assembled.  Indeed, even if this is a rigidly accurate diagram for the samples used, how applicable is it to other samples or to the future?Or how useful is it in predicting not just group patterns, but individual risk?

Our personal view is that the rather ritual plea for more and more and bigger and bigger statistical association studies is misplaced, and, in truth, a way of maintaining funding and the status quo, something we've written much about--the sociopolitical economics of science today.  With obesity rising at a continuing rate and about a third of the US population recently reported as obese, we know that the future health care costs for the consequences will dwarf even the mega-scale genome mapping on which so much is currently being spent, if not largely wasted.  We know how to prevent much or most obesity in behavioral terms, and we think it is entirely fair to ask why we still pour resources into genetic mapping of this particular problem.

There are many papers on other complex traits that might seem to be simple like stature and blood pressure, not to mention more mysterious ones like schizophrenia or intelligence, in which hundreds of genomewide sites are implicated, strewn across the genome.  Different studies find different sites, and in most cases most of the heritability is not accounted for, meaning that many more sites are at work (and this doesn't include environmental effects).  In many instances, even the trait's definition itself may be comparably vague, or may change over time.  This is a landscape 'shape' in which every detail is different, within and between traits, but is found in common with complex traits.  That in itself is a tipoff that there is something consistent about these landscapes but we've not yet really awakened to it or learned how to approach it.

Rather than being skeptical about these Ghosh and Bouchard's' careful analysis or their underlying findings, I think we should accept their general nature, even if the details in any given study or analysis may not individually be so rigid and replicable, and ask: OK, this is the landscape--what do we do now?

Is there a different way to think about biological causation?  If not, what is the use or point of this kind of complexity enumeration, in which every person is different and the risks for the future may not be those estimated from past data to produce figures like the one above?  The rapid change in prevalence shows how unreliable these factors must be, at prediction--they are retrospective of the particular patterns of the study subjects.  Since we cannot predict the strengths or even presence of these or other new factors, what should we do?  How can we rethink the problem?

These are the harder question, much harder than analyzing the data; but they are in our view the real scientific questions that need to be asked.

Saturday, June 17, 2017

The GWAS hoax....or was it a hoax? Is it a hoax?

A long time ago, in 2000, in Nature Genetics, Joe Terwilliger and I critiqued the idea then being pushed by the powers-that-be, that the genomewide mapping of complex diseases was going to be straightforward, because of the 'theory' (that is, rationale) then being proposed that common variants caused common disease.  At one point, the idea was that only about 50,000 markers would be needed to map any such trait in any global populations.  I and collaborators can claim that in several papers in prominent journals, in a 1992 Cambridge Press book, Genetic Variation and Human Disease, and many times on this blog we have pointed out numerous reasons, based on what we know about evolution, why this was going to be a largely empty promise.  It has been inconvenient for this message to be heard, much less heeded, for reasons we've also discussed in many blog posts.

Before we get into that, it's important to note that unlike me, Joe has moved on to other things, like helping Dennis Rodman's diplomatic efforts in North Korea (here, Joe's shaking hands as he arrives in his most recent trip).  Well, I'm more boring by far, so I guess I'll carry on with my message for today.....




There's now a new paper, coining a new catch-word (omnigenic), to proclaim the major finding that complex traits are genetically complex.  The paper seems solid and clearly worthy of note.  The authors examine the chromosomal distribution of sites that seem to affect a trait, in various ways including chromosomal conformation.  They argue, convincingly, that mapping shows that complex traits are affected by sites strewn across the genome, and they provide a discussion of the pattern and findings.

The authors claim an 'expanded' view of complex traits, and as far as that goes it is justified in detail. What they are adding to the current picture is the idea that mapped traits are affected by 'core' genes but that other regions spread across the genome also contribute. In my view the idea of core genes is largely either obvious (as a toy example, the levels of insulin will relate to the insulin gene) or the concept will be shown to be unclear.  I say this because one can probably always retroactively identify mapped locations and proclaim 'core' elements, but why should any genome region that affects a trait be considered 'non-core'?

In any case, that would be just a semantic point if it were not predictably the phrase that launched a thousand grant applications.  I think neither the basic claim of conceptual novelty, nor the breathless exploitive treatment of it by the news media, are warranted: we've known these basic facts about genomic complexity for a long time, even if the new analysis provides other ways to find or characterize the multiplicity of contributing genome regions.  This assumes that mapping markers are close enough to functionally relevant sites that the latter can be found, and that the unmappable fraction of the heritability isn't leading to over-interpretation of what is 'mapped' (reached significance) or that what isn't won't change the picture.

However, I think the first thing we really need to do is understand the futility of thinking of complex traits as genetic in the 'precision genomic medicine' sense, and the last thing we need is yet another slogan by which hands can remain clasped around billions of dollars for Big Data resting on false promises.  Yet even the new paper itself ends with the ritual ploy, the assertion of the essential need for more information--this time, on gene regulatory networks.  I think it's already safe to assure any reader that these, too, will prove to be as obvious and as elusively ephemeral as genome wide association studies (GWAS) have been.

So was GWAS a hoax on the public?
No!  We've had a theory of complex (quantitative) traits since the early 1900s.  Other authors argued similarly, but RA Fisher's famous 1918 paper is the typical landmark paper.  His theory was, simply put, that infinitely many genome sites contribute to quantitative (what we now call polygenic) traits.  The general model has jibed with the age-old experience of breeders who have used empirical strategies to improve crop, or pets species.  Since association mapping (GWAS) became practicable, they have used mapping-related genotypes to help select animals for breeding; but genomic causation is so complex and changeable that they've recognized even this will have to be regularly updated.

But when genomewide mapping of complex traits was first really done (a prime example being BRCA genes and breast cancer) it seemed that apparently complex traits might, after all, have mappable genetic causes. BRCA1 was found by linkage mapping in multiply affected families (an important point!), in which a strong-effect allele was segregating.  The use of association mapping  was a tool of convenience: it used random samples (like cases vs controls) because one could hardly get sufficient multiply affected families for every trait one wanted to study.  GWAS rested on the assumption that genetic variants were identical by descent from common ancestral mutations, so that a current-day sample captured the latest descendants of an implied deep family: quite a conceptual coup based on the ability to identify association marker alleles across the genome identical by descent from the un-studied shared remote ancestors.

Until it was tried, we really didn't know how tractable such mapping of complex traits might be. Perhaps heritability estimates based on quantitative statistical models was hiding what really could be enumerable, replicable causes, in which case mapping could lead us to functionally relevant genes. It was certainly worth a try!

But it was quickly clear that this was in important ways a fool's errand.  Yes, some good things were to be found here and there, but the hoped-for miracle findings generally weren't there to be found. This, however, was a success not a failure!  It showed us what the genomic causal landscape looked like, in real data rather than just Fisher's theoretical imagination.  It was real science.  It was in the public interest.

But that was then.  It taught us its lessons, in clear terms (of which the new paper provides some detailed aspects).  But it long ago reached the point of diminishing returns.  In that sense, it's time to move on.

So, then, is GWAS a hoax?
Here, the answer must now be 'yes'!  Once the lesson is learned, bluntly speaking, continuing on is more a matter of keeping the funds flowing than profound new insights.  Anyone paying attention should by now know very well what the GWAS etc. lessons have been: complex traits are not genetic in the usual sense of being due to tractable, replicable genetic causation.  Omnigenic traits, the new catchword, will prove the same.

There may not literally be infinitely many contributing sites as in the original statistical models, be they core or peripheral, but infinitely many isn't so far off.  Hundreds or thousands of sites, and accounting for only a fraction of the heritability means essentially infinitely many contributors, for any practical purposes.  This is particularly so since the set is not a closed one:  new mutations are always arising and current variants dying away, and along with somatic mutation, the number of contributing sites is open ended, and not enumerable within or among samples.

The problem is actually worse.  All these data are retrospective statistical fits to samples of past outcomes (e.g., sampled individuals' blood pressures, or cases' vs controls' genotypes).  Past experience is not an automatic prediction of future risk.  Future mutations are not predicable, not even in principle.  Future environments and lifestyles, including major climatic dislocations, wars, epidemics and the like are not predictable, not even in principle.  Future somatic mutations are not predictable, not even in principle.

GWAS almost uniformly have found (1) different mapping results in different samples or populations, (2) only a fraction of heritability is accounted for by tens, hundreds, or even thousands of genome locations and (3) even relatively replicable 'major' contributors, themselves usually (though not always) small in their absolute effect, have widely varying risk effects among samples.

These facts are all entirely expectable based on evolutionary considerations, and they have long been known, both in principle, indirectly, and from detailed mapping of complex traits.  There are other well-known reasons why, based on evolutionary considerations, among other things, this kind of picture should be expected.  They involve the blatantly obvious redundancy in genetic causation, which is the result of the origin of genes by duplication and the highly complex pathways to our traits, among other things.  We've written about them here in the past.  So, given what we now know, more of this kind of Big Data is a hoax, and as such, a drain on public resources and, perhaps worse, on the public trust in science.

What 'omnigenic' might really mean is interesting.  It could mean that we're pressing up ever more intensely against the log-jam of understanding based on an enumerative gestalt about genetics.  Ever more detail, always promising that if we just enumerate and catalog just a bit (in this case, the authors say we need to study gene regulatory networks) more we'll understand.  But that is a failure to ask the right question: why and how could every trait be affected by every part of the genome?  Until someone starts looking at the deeper mysteries we've been identifying, we won't have the transormative insight that seems to be called for, in my view.

To use Kuhn's term, this really is normal science pressing up against a conceptual barrier, in my view. The authors work the details, but there's scant hint they recognize we need something more than more of the same.  What is called for, I think is young people who haven't already been propagandized about the current way of thinking, the current grantsmanship path to careers.

Perhaps more importantly, I think the situation is at present an especially cruel hoax, because there are real health problems, and real, tragic, truly genetic diseases that a major shift in public funding could enable real science to address.

Thursday, January 8, 2015

Genomewide mapping and a correlation fallacy

When there isn't an adequate formal theory for determining cause and effect, we often must rely on searches for statistical associations between variables that we, for whatever reason, think might cause an outcome and the occurrence of the outcome itself.  One criterion is that the putative cause must arise before its effect, that is, the outcome of interest.  That time-order is sometimes not clear in the kinds of data we collect, but we would normally say we're lucky in genetics because a person is 'exposed' to his or her genotype from the moment of conception.  Everything that might cause an outcome, say a disease, comes after that.  So gene mapping searches for correlations between inherited genomic variation and variation in the outcome.  But the story is not as crystal clear as is typically presented.

In genomewide mapping studies, like case-control GWAS (or QTL mapping for quantitative traits), we divide the data into categories, based on say two variants (SNP alleles), A and B, at some genome position X. Then, if the outcome--say some disease under investigation--is more common among A-carriers than among B-carriers at some chosen statistical significance level, it is common to infer or even to assert that the A-allele is a causal factor for the disease (or, less often put this way, that B is causally protective).  The usual story is that the difference is far from categorical, that is, the A-bearing group is simply at a higher probabilistic risk of manifesting the trait.

However, the usually unstated inference is that the presence of SNP A has some direct, even if only probabilistic, effect in causing the outcome.  The gene may, for example, be involved in some signaling pathway related to the disease, so that variation in A affects the way the pathway, as a whole, protects or fails to protect the person.

Strategies to protect against statistical artifact
We know that in most cases many, or even typically most people with the disease do not carry the A allele, because the relative risks associated with the A allele are usually quite modest.  So the correlation might be false, because as we know and too often overlook, correlation does not in itself imply causation.  One way to test for potential confusion is to compare AA, AB, and BB genotypes to see if the 'dose' (number of copies) of A is correlated with some aspect of the disease.  Usually there isn't enough data to resolve such differences with much convincing statistical rigor.

Another approach to protect against false associations is to extend the study and see if the same association is retained.  But this is often very costly because of sample size demands or, if the study is in a small population, perhaps impossible.  Likewise, people exit and enter study groups, changing the mix of variation, risking obscuring signal.  One way to try to show systematic effects is to do a meta-analysis, by pooling studies.  If the overall correlation is still there, even if individual studies have come to different risk estimates, one may have more confidence.  This is, to my understanding, usually not done by regressing the allele frequency with risk, which seems like something that should be done, but there is heterogeneity in method, genotyping, accuracy, and size among studies so this is likely to be problematic.

An issue that seems often, if not usually to have been overlooked
An upshot of this typical kind of finding of very weak effect sizes is that the disorder is the result of any of a variety of genomic backgrounds (genomewide genotypes) as well as lifestyle exposures that aren't being measured.  The background differences may complement the A's effects in varying ways so that the net effect is real, but on average weak.  That's why non-A carriers have almost the same level of risk (again, that is, the net effect size of A is small).

But the problem arises when the excess risk in the A-carriers is assumed to be due to that allele.  In fact, and indeed very likely, is that even in many affected A-carriers the disease may have arisen because of risky variables in networks other than the one involving the 'A/B' gene.  That is why nearly as high a proportion of non-A-carriers are affected.  Because of independent assortment and recombination among genes and chromosomes, the same distribution of backgrounds will be found in the A-bearing cases (though none of the genome-types will be the same even between any two individuals).  In those A-individuals their outcome may be due entirely to variants other than the A allele itself, for example, because of variants in genes in other networks.  That is, some, many, or even most A-bearing cases may not, in fact, be affected because of the A allele.

This seems to me very likely to be a common phenomenon, given what we know about the complex genotypic variation at the thousands of potentially relevant sites typed in genomewide analysis like GWAS in any sample. One well-known issue that GWAS methods can and often do correct for, is that some factors related to population structure (origins and marriage patterns among the sampled individuals) can induce false correlations.  But even after that correction, given the true underlying causal complexity, it is likely that for some SNP sites it is only chance distributions of different complex genotypes between the A- and non-A SNP genotypes that suffice to generate the weak statistical effects, when so many sites in the genome are tested.

Suppose the A allele's estimated effects raise the risk of the disease to 5% in the A-carriers, and let's assume for the moment the convenient fiction that there is no error in this estimate.  It may be the case that 5% of the A-bearing cases are due to the presence of the very-strong A-allele, and are doomed to the disease, whereas the other 95% of A-bearing are risk-free.  Or it could be that every A-carrier's risk is elevated by some fraction so the average is 5%.  Given that almost as many cases are usually seen in non-A carriers, such uniformity seems unlikely to be true.  Almost certainly, at least in principle, is that the A-carriers have a distribution of risk, for whatever background genomic or environmental or stochastic reasons, but that whose average is 5%.  These alternative interpretations are very difficult to test, and when does anyone actually bother?

The problem relates to intervention strategies
For many years, we have know from all sorts of mapping studies that most identified sites have very weak effects.  We know that in many cases environmental factors are vastly more important (because, for example, the disease prevalence has changed dramatically in the last few decades).  But the justification (rationale, or excuse) for continuing the huge Big Data approach is that it will at least identify 'druggable target' genes so the culpable pathway can be intervened in.  Hoorah! for Big Pharma--even if the gene itself isn't that important, a network has many intervention points.

However, to the extent that the potential correlation fallacy discussed here is at play, targeting genetically based therapy at the A allele may fail not because the targeting doesn't work but because most A-carriers are affected mainly or exclusively for other reasons.  If the inferential fallacy is not addressed, think of how long and how costly it would be to end up doing better.

The correlation fallacy discussed here doesn't even assume that the A-allele is not a true risk factor, which as we noted above may often be the case if the results in Big Data studies are largely statistical artifacts.  The issue is that the A effect is unlikely to be very strong, because otherwise it would be easier to see or show up in family studies (as some rare alleles do, in fact), but simply because most individuals in both the A and non-A categories are affected for completely unrelated reasons.   Again what we know about recombination, somatic mutation, independent assortment, and the complexity of gene and gene-environmental interactions, suggests that this simply must often be true.   The correlation fallacy may pervasively lurk behind widely proclaimed discoveries from genome mapping.

Friday, December 19, 2014

Survivorship bias and genetics

I was a mathematics major as an undergraduate.  However, not then or since have I been anything that one could call a mathematician.  At least, I hope I learned something about trying to think logically about life even if I never do equations.  But this interest led me to read a new book I was told of, called How Not to be Wrong: the Power of Mathematical Thinking, by Jordan Ellenberg (2014, NY, Penguin Press).

This is a popular rather than technical book, but it shows in interesting and serious ways how mathematical thinking can lead to improved understanding of the real world.  I think it has relevance to an important area in current evolutionary and biomedical or agricultural genetics.  So I thought I'd write a post about it.

Survivorship bias
Ellenberg begins his book with an illustration of how abstract logical thinking can solve important real-world problems in subtle ways.  In WWII a mathematics research group was asked by the Army to help them locate armor plating on fighter aircraft.  The planes were returning to base with scattered bullet holes from enemy fire and the idea was to put some protective plating where it would do the most good without adding cumbersome mileage-eating weight.  The mathematician suggested to put the plating where the bullet holes weren't.  This seemed strange until he explained that this was because the bullet holes that were observed hadn't done much damage: bullets hitting elsewhere had brought the plane down so it was never observed because the plane never returned to base.  The engine compartment was the case in point: a shot to the engine was fatal to the aircraft, but to the wings and body, much less so.

This is a case of survivorship bias.  It can apply widely, and evolution and genetic causation provide instances where it seems likely to be a useful principle.  As geneticists we ask, what the genes whose variation causes variation in adaptive or biomedically interesting outcomes.  This is what genome mapping in its various forms is intended to identify.

Ironically, it seems, when we do experiments involving development or testing of genetic mechanisms by, say, knocking out a gene, or when we observe the major gene-usage switches that occur when some part of an embryo's body are forming, we can identify specific genes that seem to be very important.

Several pieces of evidence can suggest they are important.  One is the finding that the same gene is used in similar roles in very distantly related species (often, even, between humans and flies or even more distant species). It's usage has been conserved.  Secondly, there is usually far less variation within or between species in such genes than in what we believe to be non-functional or marginally functional parts of genomes. This seems to suggest that variation hasn't been tolerated by natural selection.  Thirdly, many congenital diseases in plants and animals including humans have proven to be due to the effects of variants, often newly arisen mutations, in a specific gene.  Most cases of diseases like Cystic Fibrosis, Phenylketonuria, Muscular Dystrophy, or Tay Sachs Disease are of this sort.  Some congenital traits like, say, eye or skin color, are also due to inheriting specific variants in at least relatively few genes.

Such findings at least indirectly fueled the fervor for mapping every trait one can define, with grand promises of discovering the genes 'for' the trait.  Conscientious investigators justified expensive mapping efforts by showing that their trait of interest had substantial heritability, for example, because trait-values were to a substantial extent correlated among close relatives in predicted patterns.  However, for most traits like diabetes, cancer, heart disease, or behavioral characteristics, such findings are few and far between.

Despite a welter of PR spin to the contrary, instead of dramatic findings of the expected (and promised) sort, what was found was that the traits were affected by variation in tens, hundreds, or even thousands of different parts of the genome. Even taking all these together, they typically only accounted for a fraction--usually a small fraction--of the estimated heritability.

What is this 'missing heritability'?

Evolutionary survivorship bias
A central theoretical idea is that a fundamental genomic criterion for showing biological function is sequence conservation.  Most evolution is purifying: what has been put together over billions of years is risky to change.  So most mutations in clearly functional areas of DNA are either neutral or deleterious.  As a result, more variation accumulates in non- or weakly functional DNA than in important genes.

What that means is that the variation we see misses what existed heretofore and hence is not a representative sample of all the variation that arises.  The idea can be that most variation in genomes is of major importance.  As a result, the tendency is to assume that non-conserved areas of the genome are non-functional.  This may be true, but it may be that our belief that conservation equals function is a corollary of a belief in strong Darwinian natural selection in molding traits.  In fact, most genomic variation is not of the highly conserved sort, but our analysis and explanation of functional genomics is biased by our predilection for ignoring less-conserved variation.

This can be seen as a kind of survivorship bias in that we assume that variation in non-conserved genome areas just doesn't survive for very long--isn't conserved because it has no function.  That's a kind of circular reasoning and has been, for example, highly contentious in the interpretation of the ENCODE project's objective to identify all causal elements in the genome, and in questions about whether selectively neutral variation exists at all. The same conceptual bias leads to reconstructions of evolutionary adaptive history that centers on the conserved genes as if they were the genes that were involved.  Finally, important genes that were involved in a trait's evolution to its current state may no longer be involved, and hence not be considered because their role did not survive to be identified today.

Biomedical survivorship bias
The same sort of bias in ascertaining the spectrum of causal variation exists on the shorter life-time scale of biomedical genetics.   There is a big discrepancy between the clearly key role of genes identified in experimental and developmental genetics, and in the deeply conserved nature of those genes, and the general lack of 'hits' in those genes when genomewide mapping is done on traits those genes affect.

How can a gene be central to the development of the basis of a trait, and yet not be found in mapping to identify variation that causes failures of the trait?  Indeed, the basic finding of GWAS and most other mapping approaches is that the tens or hundreds or thousands of genome 'hits' have individually trivial effects.

The answer may lie in survivorship bias.  Like the lethality of bullets to the engine of a fighter, most variation in the main genes, those whose sequence is more highly conserved, is lethal to the embryo or manifest in pathology so clear that it never is the subject of case-control or other sorts of Big Data mapping.  In other words, genome mapping may systematically be inevitably constrained to find small effects!  That's exactly the opposite of what's been promised, and the reason is that the promises were, psychologically or strategically, based on extrapolation of the findings of strong, single-gene effects causing severe pediatric disease--a legacy of Mendel's carefully chosen two-state traits.

To the extent this is a correct understanding, then genomewide mapping as it's now being done is, from an evolutionary genomic perspective, necessarily rainbow-chasing.  Indeed, a possibility is that most adaptive evolution is itself also due to the effects of minor variants, not major ones.  Once the constraining interaction of the major genetic factors is in place, mostly what can nudge organisms in this direction or that, whether adaptively or in relation to complex, non-congenital disease, is based on assembled effects of individually very minor variants.  In turn, that could be why slow, gradualism was so obviously the way evolution worked to Darwin, and why it generally still seems that way today.

Survivorship bias is a kind of mis-understanding of statistics and sampling that careful reasoning can illuminate.  It is so easy to collect biased samples, and so hard to do otherwise, and consequently so easy to make convenient, but erroneous inferences.  Science is a complex business and it's an unending challenge to do it right--even to know when we are doing it right!

Wednesday, December 11, 2013

Sometime geneticist Joe Terwilliger on genetics

We may recently have given the false impression that geneticist Joe Terwilliger gives less priority to science, or at least good science, than to other perhaps more frivolous pursuits (he is Abe Lincoln every February, for example, and tuba player the rest of the time -- unless he's cleaning up bean debacles as a diplomat, or being a basketball and language coach to Dennis Rodman), so we wanted to help correct any such misconceptions here.  Perhaps to that end, Joe (now known in South Korea, we're afraid, as "sometime geneticist Joe Terwilliger") suggested we republish a blog post he first posted on his own short-lived blog in 2008.  He recently dug this up again and says that few could disagree, even 5 years later.


Joe as Abe on the balcony (but not of Ford's Theater)

The point is that sometimes there is a lot of convenient hard-of-hearing even in science, which fancies itself to be an objective search for truth. Some of the details in Joe's post are out-of-date but we, and he, think that the basic thrust is not.  In a sense that makes the conclusion all the more cogent, because the same modes of thinking about genomic causation are still predominant, despite the vastly costly but essentially consistent results in the five years since 2008.  And, as Joe points out, he and Ken had much the same message in 2000.

One not-so-subtle change, we will note, is that promises by NIH Director Francis Dr Collins, and many others in presumably responsible positions, have steadily altered  their due date, which recedes into the distance like, say, an oasis as you grope for water, the fences if you want your pitchers to have a better earned run average, or a preacher's promises of ultimate salvation, if you weekly plunk coins into the basket.

So perhaps the lesson is that under these circumstances, rather than just dismiss critics, science--actual science as it's supposed to be--should feel a need to take stock of what it's doing.  But we leave it to you to judge.

And if you hear about Joe in other contexts in weeks to come, remember that he was a sometime geneticist here first:


The Rise and Fall of Human Genetics and the Common Variant - Common Disease Hypothesis
By Joe Terwilliger
Nov 2008

There is an enormity of positive press coverage for the Human Genome Project and its successor, the HapMap Project, even though within the field the initial euphoric party when the first results came out has already done a full 180 to be replaced by the hangover that inevitably follows such excesses.

For those of you not familiar with the history of this field and the controversies about its prognosis which were present from the outset, I refer you to a review paper I and a colleague wrote back in 2000 at the height of the controversy - Nature Genetics 26, 151 - 157 . The basic gist of the argument put forward for the HapMap project was the so-called common variant/common disease hypothesis (CV/CD) which proposed that "most of the genetic risk for common, complex diseases is due to disease loci where there is one common variant (or a small number of them)" [Hum Molec Genet 11:2417-23]. Under those circumstances it was widely argued that using the technologies being developed for the HapMap project, that one would be able to identify these genes using "genome-wide association studies" (GWAS), basically by scoring the genotype for each individual in a cross sectional study for each of 500,000 to 1,000,000 individual marker loci - the argument being that if common variants explained a large fraction of the attributable risk for a given disease, that one could identify them by comparing allele frequencies at nearby common variants in affected vs unaffected individuals. This point was contested by researchers only with regard to how many markers you might have to study for this to work if that model of the true state of nature applied. Many overly optimistic scientists initially proposed 30,000 such loci would be sufficient, and when Kruglyak suggested it might take 500,000 such markers people attacked his models, yet today the current technological platforms use 1,000,000 and more markers, with products in the pipelines to increase this even more, because it quickly became clear that the earlier models of regular and predictable levels of linkage disequiblrium were not realistic, something that should have been clear from even the most basic understanding of population genetics, or even empirical data from lower organisms.

Today such studies are widespread, having been conducted for virtually every disease under the sun, and yet the number of common variants with appreciable attributable fractions that have been identified is miniscule. Scientists have trumpetted such results as have been found for Crohn's disease, in which 32 genes were detected using panels of thousands of individuals genotyped at hundreds of thousands of markers - this sounds great until you start looking at the fine print, in which it is pointed out that all of these loci put together explain less than 10% of the attributable risk of disease, and for various well-known statistical reasons, this is a gross overestimate of the actual percentage of the variance explained. Most of these loci individually explain far less than half a percent of the risk, meaning that while this may be biologically interesting, it has no impact at all on public health as most of the risk remains unexplained. This is completely opposite to the CV/CD theory proposed as defined above. In fact, this is about the best case for any complex trait studied, with virtually every example dataset I have personally looked at there is absolutely nothing discovered at all.

At the beginning of the euphoria for such association studies, the example "poster child" used to justify the proposal was the relationship between variation at the ApoE gene and risk of Alzheimer disease. In an impressively gutsy paper recently, a GWAS study was performed in Alzheimer disease and published as an important result, with a title that sent me rolling on the floor in tears laughing: "A high-density whole-genome association study reveals that APOE is the major susceptibility gene for sporadic late-onset Alzheimer's disease" [ J Clin Psychiatry. 2007 Apr;68(4):613-8 ] - in an amazingly negative study they did not even have the expected number of false positive findings - just ApoE and absolutely nothing else... And the authors went on to describe how important this result was and claimed this means they need more money to do bigger studies to find the rest of the genes. Has anyone ever heard of stopping rules, that maybe there aren't any common variants of high attributable fraction??? This was a claim that Ken Weiss and I put forward many times over the past 15 years, and Ken has been making this point for a decade before that even, in his book, "Genetic variation and human disease", which anyone working in this field should read if they are not familiar with the basic evolutionary theory and empirical data which show why noone should ever have expected the CV/CD hypothesis to hold...

In many other fields, the studies that have been done at enormous expense have found absolutely nothing, and in what Ken Weiss calls a form of Western Zen (in which no means yes), the failure of one's research to find anything means they should get more money to do bigger studies, since obviously there are things to find but they did not have big enough studies with enough patients or enough markers - it could not possibly be that their hypotheses are wrong, and should be rejected... It is a truly bizarre world where failure is rewarded with more money - but when it comes to promising upper-middle-aged men (i.e. Congress) that they might not die if they fund our projects, they are happy to invest in things that have pretty much now been proven not to work...

While in a truly bizarre propaganda piece, Francis Collins, in a parting sycophantic commentary (J Clin Invest. 2008 May;118(5):1590-605) claimed that the controversy about the CV/CD hypothesis was "... ultimately resolved by the remarkable success of the genetic association studies enabled by the HapMap project." He went on to list a massive table of "successful" studies, including loci for such traits as bipolar, Parkinson disease and schizophrenia, and of course the laughable success of ApoE and Alzheimer disease. To be objective about these claims, let me quote from what researchers studying those diseases had to say.

Parkinson disease: "Taken together, studies appear to provide substantial evidence that none of the SNPs originally featured as PD loci (sic from GWAS studies) are convincingly replicated and that all may be false positives...it is worth examining the implications for GWAS in general." Am J Hum Genet 78:1081-82

Schizophrenia: "...data do not provide evidence for involvement of any genomic region with schizophrenia detectable with moderate [sic 1500 people!] sample size" Mol Psych 13:570-84

Bipolar AND Schizophrenia: "There has been great anticipation in the world of psychaitric research over the past year, with the community awaiting the results of a number of GWAS's... Similar pictures emerged for both disorders - no strong replications across studies, no candidates with strong effect on disease risk, and no clear replications of genes implicated by candidate gene studies." - Report of the World Congress of Psychiatric Genetics.

Ischaemic stroke: "We produced more than 200 million genotypes...Preliminary analysis of these data did not reveal any single locus conferring a large effect on risk for ischaemic stroke." Lancet Neurol. 2007 May;6(5):383-4.

And the list goes on and on of traits for which nothing was found, with the authors concluding they need more money for bigger studies with more markers. It is really scary that people are never willing to let go of hypotheses that did not pan out. Clearly CV/CD is not a reasonable model for complex traits. Even the diseases where they claim enormous success are not fitting with the model - they get very small p-values for associations that confer relative risks of 1.03 or so - not "the majority of the risk" as the CV/CD hypothesis proposed.

One must recall that in the intial paper proposing GWAS by Risch and Merikangas (Science 1996 Sep 13;273(5281):1516-7) - a paper which, incidentally, pointed out that one always has more power for such studies when collecting families rather than unrelated individuals - the authors stated that "despite the small magnitude of such (sic: common variants in)genes, the magnitude of their attributable risk (the proportion of people affected due to them) may be large because they are quite frequent in the population (sic: meaning >>10% in their models), making them of public health significance." The obvious corollary of this is that if they are not quite frequency, they are NOT having high attributable fraction and are therefore NOT of public health significance.

And yet, you still have scientists claiming that the results of these studies will lead to a scenario in which "we will say to you, 'suppose you have a 65% chance of getting prostate cancer when you're 65. If you start taking these pills when you're 45, that percent will change to 2". Amazing claims when the empirical evidence is clear that the majority of the risk of the majority of complex diseases is not explained by anything common across ethnicities, or common in populations... (Leroy Hood, quoted in the Seattle Post-Intelligencer). Francis Collins recently claimed that by 2020, "new gene-based designer drugs will be developed for ... ALzheimer disease, schizophrenia and many other conditions", and by 2010, "predictive genetic tests will be available for as many as a dozen common conditions". This does not jibe with the empirical evidence... In Breast Cancer for example, researchers claimed that knowledge of the BRCA1 and BRCA2 genes (which confer enormously high risk of breast cancer to carriers) was uninteresting as it had such a small attributable fraction in the population. Of course now they have performed GWAS studies and examined tens of thousands of individuals and have identified several additional loci which put together have a much smaller attributable fraction than BRCA1 and BRCA2, yet they claim this proves how important GWAS is. Interesting how the arguments change to fit the data, and everything is made to sound as if it were consistent with the theory.

I suggest that people go back and read "How many diseases does it take to map a gene with SNPs?" (2000) 26, 151 - 157. There are virtually no arguments we made in that controversial commentary 8 years ago which we could not make even stronger today, as the empirical data which has come up since then basically supports our theory almost perfectly, and refutes conclusively the CV/CD hypothesis, despite Francis Collins' rather odd claims to the contrary...

In the end, these projects will likely continue to be funded for another 5 or 10 years before people start realizing the boy has been crying wolf for a damned long time... This is a real problem for science in America, however, as NIH is spending big money on these rather non-scientific technologically-driven hypothesis-free projects at the expense of investigator-initiated hypothesis-driven science. Even more tragically training grants are enormously plentiful meaning that we are training an enormous number of students and postdocs in a field for which there will never be job opportunities for them, even if things are successful. Hypothesis-free science should never be allowed to result in Ph.D. degrees if one believes that science is about questioning what truth is and asking questions about nature, while engineering is about how to accomplish a definable task (like sequencing the genome quickly and cheaply). The mythological "financial crisis" at NIH is really more a function of the enormous amounts of money going into projects that are predetermined to be funded by political appointees and government bureaucrats rather than the marketplace of ideas through investigator-initiated proposals. Enormous amounts of government funding into small numbers of projects is a bad idea - one which began with Eric Lander's group at MIT proposing to build large factories for the sequencing of the genome rather than spreading it across sites, with the goal of getting it done faster (an engineering goal) instead of getting more sites involved so that perhaps better scientific research could have come along the way. This has led to a scenario years later in which the factories now want to do science and not just engineering, which is totally contrary to their raison d'etre, and leads to further concentrations of funding in small numbers of hands when science is better served, perhaps by a larger number of groups receiving a smaller amount of money so that more brains are working in different directions thinking of novel and innovative ideas not reliant on pure throughput. Human genetics has transformed from a field with low funding, driven by creative thinking into a field driven by big money and sheep following whatever shepherd du jour is telling them they should do (i.e. innovative means doing what they current trend is rather than something truly original and creative). This is bad for science, and also is bad science. GWAS has been successful technologically, and it has resoundingly rejected the CV/CD hypothesis through empirical data. If we accept this and move on, we can put the HapMap and HGP where it belongs, in the same scientific fate as the Supercollider, and let us get back to thinking instead of throwing money at problems that are fundamentally biological and not technological!


(most notably in terms of the big money NIH is sending into these non-scientific technologically-driven hypothesis-free studies, rather than investigator initiated hypothesis-driven science - one of the main causes of the "funding crisis" at NIH where a tiny portion of new grants are funded - get rid of the big science that is not working - like the supercollider! - and there is no funding crisis)

Wednesday, December 19, 2012

Is it 'progress' to identify 100s of genes for a trait? If not...what is it?

What is known
Crohn's disease (CD), an inflammatory bowel disease, has a large genetic component, but specific genes, as for most such complex diseases, have been elusive.  A paper in this month's American Journal of Human Genetics, "Refinement in Localization and Identification of Gene Regions Associated with Crohn Disease," Elding et al, reports that they believe they are zeroing in on the answer.

The gene most closely associated with Crohn's disease is one that plays a role in the immune response, NOD2 which codes for a protein that recognizes peptidoglycans, or bacterial molecules, and stimulates the immune system to respond.  It makes sense that genes involved in immunity would be involved, as the disease is inflammatory in nature, perhaps due to an impaired innate immune system which leads to chronic inflammatory response by the adaptive immune system to microbes in the gut.

A number of genomewide association studies (GWAS) of Crohn's disease have been done, but none has identified genes with large explanatory power.  A recent meta-analysis of six studies (reported here) identified 32 new loci associated with Crohn's, which, added to the 31 that had been identified in 2010, brought the total to 71.  This doesn't mean 71 single genes but instead stretches of chromosomes that generally contain multiple genes -- sometimes hundreds -- and 'gene' may mean other kinds of function than protein coding, such as regulatory, directly functional RNA, and so on.  The 2010 report explained 20% of the variation in the disease, and the additional 32 brought that total to 23.2%, which indicated that most of these loci represented genes with very small effect, and that many more loci were left to be found. And what about the 77% that is still unexplained?

The Elding et al. paper reports use of a "mapping approach that localizes causal variants based on genetic maps in linkage disequilibrium units (LDU maps)." That means chromosome locations, but not specific to any nucleotide or functional element.  The authors confirm 66 of the previously reported 71 loci, and narrow in on "more precise location estimates" in those intervals (that is, they come closer to identifying candidate genes rather than just chromosomal intervals). They identified 78 additional regions that were statistically significant, and which provide "strong evidence for 144 genes." They also found 56 "nominally significant signals, but with more stringent and precise colocalization." So, this paper reports 200 gene regions in total associated with Crohn's disease, most of which, the authors say, unambiguously implicate single genes. The reason for that inference isn't clear, since clusters of DNA units can function together.  Again, many of these loci contain genes involved in the immune system. The authors suggest that "The precise locations and the evidence that some genes reflect phenotypic subgroups will help identify functional variants and will lead to greater insight of CD etiology."

The immune system is complex and involves many components so that mapping that 'hits' in a region that has some immune system elements might happen by chance if you have 200 hits.  Also, the immune system is involved in response to external threats (like viruses and bacteria) as well as to internal problems (recognizing and repairing damaged tissues), so the reason for 'immune' involvement is unclear -- and a challenge to determining what is responding to what.

The authors have previously demonstrated genetic heterogeneity within the NOD2 gene region -- that is, that different genes explain risk in different people. They also found independent involvement of a nearby gene, CYLD. They further demonstrated the importance of precise definition of the phenotype to identify loci that might explain risk in multiple cases.  In this new paper they use a high-resolution linkage disequilibrium map, basically meaning that their test markers are closely spaced so that implicated regions are fairly short, which helps to identify genes in the implicated region of the chromosome, and fine-tune phenotype definition as well.  They were able to replicate 66 of the previous 71 gene locations, and identified an additional 134 signals, many of which contain genes.  One might always quibble with this or the other statistical issue, but the overall conclusion is unlikely to change.  In the authors words:
This is a major step forward in identifying the relevant genes and functional variants and thus elucidating the genetics of CD etiology. The very large numbers of genes [we identify] confirm that CD is truly polygenic and complex in nature. Many genes show functions that are compatible with involvement in immune and/or inflammatory processes as well as integrity of the intestinal epithelium and differentiation.  
But
In fact, do we know anything more than we did before this study?  It has long been clear that CD is polygenic and complex, with immune and/or inflammatory gene involvement.  Whether 71 genes are involved or 200, it means that this disease is another instance of many pathways leading to a complex trait.

Much money and effort by the highest quality investigators has been expended on this disease.  This means that by now we probably can't claim the complexity is an artifact of imprecise methods.  The complexity seems to be real.  Each CD individual is affected by a different combination of variants at their two copies of these 200 genes -- and, if this is 'simple' complexity, another 600 unidentified genes to top up the current 23% explained causation.

This also doesn't include any serious identification of environmental contributions.  That is, environments like, say, diet at some point in  life, may have different effects on different genotypes, so that a given genotype may have no harmful consequences in one lifestyle and be harmful in another.  And if there are complex interactions among the different contributing genomic regions, then enumerating the variants at 200 (much less 800) genome regions will not make prediction or perhaps not even treatment very genome-dependent.

It is likely that after all of this, some genetic variants will have predictive power or medical treatment relevance, and that of course will be genuine progress.  But it leaves the question that so many traits leave: how do we actually deal usefully with this sort of routine-level complexity?

Wednesday, November 21, 2012

Hold the presses! Growth factor genes do NOT affect head shape!

We tend to wax rather cynical about mapping studies (GWAS and the like) that find many miniscule effects and only an occasional major, replicable one.  We feel that this approach has played out (or been spent out) and that better, more thoughtful and focused science is in order these days.

One typical daily fact is the breathless announcement of finding a gene 'for' some trait or other, be it disease or even whether you voted for Romney or were one of the misguided bearers of the 'liberal' gene.  Extra!  Extra! Read all about it!  New gene for xxxx discovered!!!

We, too, are guilty!
Well, we must confess:  we are also involved in gene mapping.  Though, in fact we've confessed this before (here, e.g.), if in an MT sort of way.  In our work, we are using head shape data on a huge set of baboons who are deceased members of a huge and completely known genealogy housed in Texas, and  a large study of over 1000 mice from a well-controlled cross between two inbred parental strains.  Now these data sets are about as well-controlled as you can get.  In each case, marker loci across the genome were typed on all animals and the GWAS-like approach was taken to identify markers in genome areas in which variation was associated with various head dimensions (as shown in the figure).  These are then regions that affect the trait, and are worth exploration.

We regularly see  proclamations of discoveries of genes 'for' craniofacial shape dimensions.  Two recent papers report Bmp genes, thought to be involved in head development, and many others have reported FGF (fibroblast growth factor) results of major shapes.  These findings have been so casually, or even perhaps carelessly, accepted as to have become part of the standard lore.

But based on our large, very well-controlled study of two species we beg to differ!

But, no--this time Extra! Extra!  We really mean it: NO!!
After careful examination of signals from our mapping of baboons, and separately of the inbred mouse cross, we have come to the startling finding, with high levels of statistical significance rarely matched by other GWAS on this subject, that neither Bmp nor FGF genes are involved in head-shape development!

Scanning the genome markers separately in both large data sets shows clearly that there is no effect of these genes on any head shape dimension.  None!

In fact, this is not so unusual a finding, as many studies simply fail to find even what were thought to be clear-cut causal genes.  Yet, these findings are not eagerly sought nor published by Nature, Science, The New York Times (the leading scientific outlets these days--we do not include People in our list, despite its typically similar level of responsible fact-reporting). 

We are using lots of exclamation points, but we are totally serious.  Though we shout from the rooftops that we have found no evidence for these genes' involvement in head shape, we will have no hearers.  It seems just not in anyone's interest to see their previous, hard-won findings debunked. 

Or is there a different kind of lesson here?

What evolution leads us to expect
The lack of reporting of negative results is something of a scandal, because negative evidence tends to disprove previous findings by not supporting them.  That seems important, but even to skeptics like us there are important issues that should be understood.

A widely proclaimed tenet of modern science is called 'falsificationism'. The idea is that all the positive findings in the world don't prove that something is true, but a single failure to find it proves that the idea was just plain wrong.  Just because the sun has risen every day so far does not by itself prove it will rise tomorrow.  Night does not cause day!

But some negative findings don't really falsify previous positive findings.  There can be sampling issues or experimental failure and so on.  So our not finding FGF effects on head shape doesn't falsify previous findings.  We could be wrong.

But even if we are wholly right, as we're pretty sure we are, this does not in any way whatsoever undermine prior findings, that seemed quite solid, that these genes are in fact involved.  This is because standard ideas like falsification are based on a kind of science in which repeat experiments are expected to give statistically similar results: rolling dice, working out planetary orbits or chemical reactions are examples of things that follow natural laws and are reliable.

Here, however, we're dealing with life, and life is not replicable in that sense!  Each sample really is different: different sets of people have different sets of genotypes.  A negative finding is not a refutation of a prior assertion.  Negative findings are expected in this area of science!

The reason we did not find FGF effects in our baboon or mouse studies is not that the FGF genes were uninvolved in head shape during development but that there was no relevant variation in those genes in our particular sample.  Neither of our inbred mouse strains carried variants at those genes that affect head traits.  But they did both use FGF genes in making their own heads, and the intercross animals did, too.

Even if a gene is involved in a trait in a functional way, there is no reason to expect that the same gene varies in a given study at all, or enough to have a detectable effect.  Indeed, sometimes such genes are so central that they aren't free to vary without deleterious or fatal consequences.  It's one of the problems of mapping that it is not a direct search for function. Instead it is just a search for variation that may lead us to function. If by bad luck our sample has no variation in a particular gene, it can't lead us to those genes by case-control kinds of statistical comparisons.

Dang it!  Despite how important and under-appreciated these points are, this means that our negative findings aren't really negative and won't be of any interest to the Times

However, this doesn't mean that mindless GWASing is OK after all.  We've explained our view on this many times before.

But it does mean we need to contact the Editor at People.

Friday, February 3, 2012

Ectopic thinking?

Our gene mapping project, which we first blogged about here, is starting to get interesting.  And not necessarily for the reasons that we'd hoped.  You might remember that our project is looking for genes involved in variation in specific craniofacial traits in the F34 generation of descendants of a cross between two inbred mouse lines.  We've measured a handful of traits in over a thousand mouse skulls, and mapped their genetic effects by looking for genetic variation across the genome that might be associated with variation in the traits.

Sparing you the gory details, we'll just say that the chromosomal intervals that one or more of the traits mapped to span 30% of the genome, and 10% of all coding genes. That's 2400 genes or so that could potentially be of interest in affecting head shape in just these particular mice, with the restricted genetic variation they have (because they are descendants of only two inbred parental mouse strains).  That's a lot of genes to wade through to figure out which might be most likely to be involved in the traits we're looking at.

One way to prioritize candidate genes from such a study is to look for the genes in every interval that you know from prior work to be involved in your trait of interest.  Or to identify genes in families that include genes involved in your trait of interest -- these would then be considered guilty by association.

But this means most genes don't have a fighting chance of being considered, because you don't happen to know anything about them, or because nobody knows anything about them, or because what's known about them only partially represents what they do. 

To try to minimize this, many people automate the search, with programs that cull the genes that the literature indicates might be of interest, or that seem to be expressed where you want them to be.  So, this might solve the problem of no one knowing everything about all genes, but it doesn't solve the problem of nothing being known about so many genes, or that there's only partial knowledge.  And it doesn't solve the problem of having to tell the program what to look for, which means you're constraining it in the same way you would if you were doing the search by hand, looking for specific families of genes.  Nor, of course, does it solve the problem of what's happening in all the non-coding DNA that flanks all those genes. 

Thus, we decided that the least biased way to comb the data was to go through all the genes in all the intervals by hand.   We're still making sense of all that, not least because we are hoping not to be constrained by the usual ideas about statistical significance, but we've learned some interesting things along the way.

For example, one of the intervals of interest is loaded with olfactory receptor (OR) genes.  Olfactory receptors reside on the cell surface of olfactory receptor neurons, and are involved in odorant detection.  ORs form the largest family of genes in many genomes -- about 1000 different genes -- and they cluster in sets of genes in various locations on a number of chromosomes.  ORs have a distinctive expression pattern, with only one expressed per neuron in the tissue lining the nose, where they each are sensitive to particular aspects of molecules the animal inhales, and hopes to smell.  How expression of the remaining 999 genes in each cell is blocked is still not known.

ORs are an interesting example of something we've blogged about before, but that continually surprises us.  One of the ways we're evaluating the possible role of all these genes in development of the traits we're looking at is to look at where they are expressed in the developing embryo.  We initially thought this would be helpful for narrowing the search, but it turns out that about 95% or even more of genes (for which there are expression data) are expressed in the head (80% alone in the brain), so it's turning out that expression isn't all that helpful for narrowing the search.  But it does mean we've looked at images of gene expression for around 2000 genes.

Olfr66, GenePaint, E14.5
And ORs are a good example of how what we think we know can inhibit our understanding.  Here, e.g., are the expression results for olfactory receptor 66 (Olfr66) in a developing mouse (at embryonic day 14.5).  Just to orient you if you're not used to looking at such images, it's a single front to back section, the snout halfway down the image and pointing to the left, and the tail at the bottom.  The dark blue is a stain showing cells where the gene is expressed at this particular stage of development.  It's no surprise to see it in the olfactory epithelium in the snout, but notice that it's also in the axial skeleton (vertebral column), probably in cartilage cells that will soon become bone.

What's it doing there?  These are olfactory receptors!  You don't smell with your backbone!  In fact, a lot of ORs are known to be expressed outside the olfactory region, particularly in the testes, but also in the spleen, the thyroid, salivary glands, the uterus, the skin, and other tissues.  A 2006 paper is of interest in this regard, not only because it documents non-olfactory related expression, but because of its title -- "Widespread ectopic expression of olfactory receptor genes".  Ectopic expression, meaning expression where it's not supposed to be. 

But it's only not supposed to be expressed in the axial skeleton because that's not where its name says it will be, not because Nature says so!  People named these genes!  And, there is some discussion in the paper about how ORs might be involved in chemotaxis of sperm as they try to reach and penetrate the egg -- how they direct their movement, based on chemicals in their environment.  Which is equivalent to assuming they are essentially carrying out their olfactory function in the testes, where a different form of molecular reaction than odorant-detection is going on.  But, what about in cartilage, in the image above?  It's hard to imagine chemotaxis has anything to do with OR function here.

Well then, maybe it's an experimental artifact -- maybe the experiment picked up expression of a gene sort of like Olfr66, but not quite, along with Olfr66?  Maybe.  But, then we'd have to explain away all the expression studies showing non olfactory expression of many ORs, and it's rather unlikely that it's all due to experimental artifact.  This is how our own assumptions constrain what we know or even want to know about the function of so many genes.  Maybe Olfr66 has a function we don't yet understand.  As do other ORs.  And, by extension, so many other genes. 

But calling unexpected expression 'ectopic', or naming genes based on only a single role, or in their involvement in disease, when they have other perfectly normal functions, are ways of building in assumptions that, once accepted, can keep us from recognizing that there's a lot we don't yet understand about genes.

Monday, December 5, 2011

Next to ... meaningless

We posted a few weeks ago about a gene mapping project we're involved in, and the myriad difficulties in identifying genes that could be responsible for the variation in craniofacial shape in two strains of mice.   I'm still slogging through map intervals, though I am willing to say out loud that I can now see the end.

Brca1 expression, developing mouse
embryo. Image from GenePaint
I haven't revised my sense of what this all means: The function of many genes is still unknown -- indeed, most genes are expressed in multiple tissues throughout the body, so often it's surely true that a gene's multiple functions are still unknown.  And, because many genes are named for a disease they have been discovered to be associated with, or for a single tissue in which they are expressed at a specific time, even though they can have many other functions, it's hard to know if they are relevant or not (e.g., a gene in one of the intervals I was characterizing last week was named for its involvement in follicle maturation, but it turns out it's expressed in many tissues in the developing mouse embryo, including the head, so it's hard to rule it out as possibly contributing to the development of traits we are interested in).

I also found the Brca1 gene in one of the intervals I was looking at, the "breast cancer 1" gene.  Candidate?  This is one of the genes that, when mutated, is associated with high risk of early onset breast cancer.  But, in spite of its name, it's not a gene 'for' breast cancer.  It's a DNA repair gene, and it's expressed in a variety of organs in the body, at various developmental stages.  In the picture to the left, the darker stain in the section of mouse embryo represents Brca1 expression; it's in the brain, the facial region, the liver, the intestine, the lungs, and so on.

And it's bad enough to be named after a function that you only have when you're mutated, that ignores your major purpose in life.  But, here's the height of indignity for a poor gene: the gene that sits next to Brca1 on mouse chromosome 11 is called Nbr1, which is short for "next to BRCA1" gene.  What a way to go through life!  This gene was first identified in the mid 1990s, and was of interest because of its chromosomal proximity to Brca1.  But its function is still not known.  If you look it up in GeneCards, which is a compendium of what's known about tens of thousands of genes, you find this:
The protein encoded by this gene was originally identified as an ovarian tumor antigen monitored in ovarian cancer. The encoded protein contains a B-box/coiled coil motif, which is present in many genes with transformation potential, but the function of this protein is unknown. This gene is located on a region of chromosome 17q21.1 that is in close proximity to tumor suppressor gene BRCA1. Three alternatively spliced variants encoding the same protein have been identified for this gene. (provided by RefSeq)
And, listed under "Function":
Acts probably as a receptor for selective autophagosomal degradation of ubiquitinated targets.
That is, it may be involved in the recycling of proteins within cells.  Maybe.

This in itself has nothing whatever to do with breast cancer, nor with being 'near' to breast cancer whatever that might mean!  Clusters of genes near to each other on their respective chromosome often comprise gene families -- that is, genes that arose by miscopying of an existing gene so that in the descendant cell there are two adjacent copies of the ancestral gene.  This kind of thing is a regular if rare occurrence and is, indeed, a major way that genomes evolve:  duplicate genes can initially provide redundancy for the parent gene's functions, but one of the two can then diverge through mutational changes to take on a new function.  But the functions need not be relevant even to the same tissue.

And unrelated genes that happen to be nearby on a chromosome can sometimes, but certainly need not, be expressed in the same cell types nor involved in the same functions.  So being 'near' to  a 'breast cancer' gene and hence even mildly suspected of being related to breast cancer, or any cancer, is a nefarious case of guilt by association!

Too often we are driven by labels rather than using them in a neutrally informative way.  There are pressures to coin a name for a gene you've discovered, just as there are incentives to give a new species name to a fossil you've discovered.  Our reward system that is too often based on publicity and marketing is to some extent responsible, along with the normal everyday vanity to which we are perhaps all subject.

But it's not good for science.

These are but two examples of why identifying candidate genes for traits of interest is difficult.  And, these are genes we know something about.  When nothing's known, neither where a gene is expressed or what it does, that gene can't be considered a candidate for anything, even though it certainly does something.  And that's true for a lot of genes.