Wednesday, May 22, 2013

Who, me? I don't believe in single-gene causation! (or do I?). Part I. What does it mean?

We were told in no uncertain terms the other day that no one believes in single gene causation anymore.  Genetic determinism is passé, and everyone knows that most traits are complex, caused by multiple genes, gene x environment interaction, or if they're really sophisticated, epigenetics or the microbiome. But is this the view that investigators actually follow?  That's not so clear.

We all throw around the word 'complex' as if we actually believe it and perhaps even understand it.  Of course, (nearly) everyone recognizes that some traits are 'complex' meaning that one can't find a single clear-cut deterministic cause, the way being hit on the side of the head with a baseball bat by itself can send one out for the count.  But in fact the hunt for the gene (or, alright if you insist!) genes 'for' a trait is still on. You name your trait: cancer, diabetes, IQ, ability to dunk a basketball, or get into Harvard without a Kaplan course, and someone's still looking for the gene that causes it.

It is this push to find genes for traits, despite all sorts of denials, that fuels the GWAS and similar fires.  Caveats notwithstanding (and usually offered just to provide technical escape lest one is wrong), that is what the promised 'personalized genomic medicine' in its various forms and guises is all about.

So let's take a careful, and hopefully even thoughtful look at the idea of genetic causation.  We are personally (it must be obvious!) quite skeptical of what we think are excessive claims of genetic determinism (or, now, microbiomial determinism), but they are still being made, so let's tease a few of them apart.

Single gene causation does exist, at least sometimes (doesn't it?)!
First, though, what do we mean by the word 'causation'?  Generally, we think people mean that gene X or risk factor Y is sufficient to cause trait Z.  But, it might also mean that gene X or risk factor Y are necessary but not sufficient causes of trait Z.  The baseball bat might have caused your concussion, but in fact someone had to swing it.

There are well-documented single risk factors, genetic and otherwise, that everyone accepts 'cause' some disease in a very meaningful sense. Examples are some alleles (variant states) of the CFTR gene and Cystic Fibrosis (CF), BRCA1 and 2 variants and breast cancer, or smoking and lung cancer.  Having the alleles associated with CF or breast cancer, or being a long time smoker do put people at high risk of disease.

But, for these and other examples there are usually healthy people walking around with serious mutations in the gene, or heavy smokers who enjoyed their cigarettes well into old-age.  Gene X or factor Y aren't sufficient to cause trait Z.  There are also hundreds of genetic variants found in patients that are assumed to be causal, but for elusive reasons (for example, mutations in non-coding regions near to the CFTR coding regions that have no known function).  What is it about gene X that causes trait Z?  We don't know, but gene X looks damaged in this person, so it must be causal.

The CF case is interesting.  This is an ion channel disease.  Ion channels are gated openings on the cell surface that pass sodium, potassium, calcium and other ions into and out of the cell in response to local circumstances.  CF is characterized by abnormal passage of chloride and sodium through ion channels, causing thick viscous mucous and secretions, primarily in the lungs but with involvement of the pancreas as well.  If the ion channel is badly built, or doesn't get to the surface of cells lining various organs like pancreas and lungs, then the cell cannot control its water content, secretion, or absorption, and the person with the malfunctioning channels has CF.  Again, gene X causes trait Z.

But there are gradations in channel malfunction, and gradations in severity of the disease, and we have no way to know how many people are walking around with variants but no actual disease.  Here, we can say that when it happens, CFTR mutations do cause the trait in the usual way.  But what about when there are mutations but no disease?  Gene X doesn't cause the disease after all?  Or disease and none of the known causal mutations?  Wait, we thought gene X caused the disease?  Could we be assuming single gene causation, and looking only at the CFTR gene, rather than at many other aspects of the genome that may affect ion channels in the same cells or, indeed, may cause the trait in a way we could understand if we but identified them?  This is an open question--but it applies to many other purportedly single-gene diseases.  Gene X and some other gene/s, or some environmental factor cause the disease in at least some instances.  Is it simple or isn't it?

The BRCA story is also interesting.  A BRCA1 variant associated with disease does not lead directly to cancer.  Instead, BRCA1 is a gene that detects and repairs genomic mutations in breast (and other) cells.  If you have a dysfunctional BRCA1 genotype, you are at risk of some one breast cell acquiring a set of mutations that don't get detected and repaired.  What causes those mutations?  Some happen when cells divide, so the activity of breast cells affects the rate of mutational accumulation.  Other lifestyle factors do as well (parity, age of childbearing, lactation and apparently things like diet and exercise). And a person with a causal BRCA mutation lives perfectly healthfully for decades, which if you think in classical Mendelian terms, would not happen if s/he had a 'bad' gene.  BRCA doesn't exactly cause cancer, but it allows it to be caused.  Gene X plus time plus environmental risk factors cause the disease.  Though, we all believe it's a single gene, BRCA1 or 2, that causes cancer.

The obvious non-genetic instance, smoking and lung cancer, is similar but not exactly the same.  Smoking is, among other things, a mutagen: it damages genes.  So one, if not the major, reason for the association is that the mutations caused by smoke can damage genes in lung cells that lead those cells to proliferate out of control.  The reason the risk is probabilistic -- that is, a smoker doesn't have a 100% chance of getting lung cancer -- is that it's impossible to know how many or which mutations a given person's smoking has led to.  In fact, smoking is only an indirect cause, since it is mutant genes in lung cells that, after accumulating in an unlucky way, start the tumor.  Still, in this case, knowing how much a person has smoked can allow one to estimate in some probabilistic way the relative risk of lung cancer due to enough mutations having arisen in at least one lung cell. Still, many who smoke don't get cancer, and many get cancer who don't smoke.  Since smoking, and a few other such risk factors (e.g., exposure to asbestos, and some toxic chemicals) have strong effects,  even if probabilistic, everyone is generally comfortable with thinking of them as causal.  Risk factor Y causes trait Z.  But, in fact, risk factor Y plus time plus unlucky mutations cause trait Z.

Deterministic genotype 'pseudo-single-gene' causation?
More problematic are 'complex' traits, that clearly are not due to simple single gene variants--they don't follow patterns of trait appearance in families that would be consistent with simple Mendelian inheritance.  The number of such traits is legion, and is driving the GWAS industry as we have many times commented on.  They are typically common in the population.  They are complex because we feel--know, really--that many different factors combine somehow to generate the risk.  Most instances are not due to one factor alone, though some may be in the sense of BRCA and CFTR.  So, we do something like genomewide association studies to try to identify all the potentially causal parts of the genome (and, similarly, but less definitively as a rule, lifestyle factors, too).  Here, we'll assume that all of the genomic variants that might contribute are known and can be typed in every person (this is, as everyone knows, far, far from being true at present, if it's even possible).

Advocates can deny any element of single-gene thinking in GWAS reports, where hundreds of loci are claimed to have been found, but these are treated as causal, and major journals are filled to the gills with papers with titles to the effect of "five novel genes for xyz-itis".  This is the slippery slope of simple causation thinking.

If multiple factors contribute, what we know is that most do so only probabilistically in the above senses.  That there are other things at work is rather obvious for the many traits that even in those at high risk don't arise until later or even late in life.  And, in reality, it is nearly always true that cases of the disease are associated with individually unique combinations of the risk factors.  So, your personalized risk is computed as some kind of combination, like the sum, of your estimated risk at each of the putative sites, R= R1+R2+R3...., where at each site one allele is given the minimal risk (or, perhaps, zero) and the other allele the risk estimate by the difference in prevalence of the trait in cases vs in controls.  This might be considered the very opposite of single-gene causation, but conceptually it's pretty much the same, because it treats your aggregate as a single kind of risk score, as if it were acting as a unit.  The idea would be completely analogous to the risk associated with a specific variant at the CFTR or BRCA gene.  Your genotype as a whole would be viewed essentially as a single cause.

These are examples of causal rhetoric.  But these causes are probabilistic.  What does that mean?  It means that you are not 100% protected from getting the trait, nor 100% doomed.  Your fortune is estimated by some number in between.  We call it a probability, estimated either from the presence of one risk-factor or from the fixed set that you inherited.  But what does that probability mean, and how do we arrive at the value and how reliable is it?  Indeed, how often can we not even know how reliable it is?

These are topics for tomorrow.


Anonymous said...

As far as I can tell, this argument boils down to:

1. It's better to say "contributes to" than "causes" when discussing probabilistic events (e.g. almost all genetic associations with disease.)

2. Most mutations aren't fully penetrant.

I don't know a single GWAS proponent who would disagree with either point. Am I missing something here?

Ken Weiss said...

We'll describe more about these points in the next two days' posts in this series. I'd say that probabilistic causation is essentially the same as incomplete penetrance, and essentially skirts the issue of mechanism that makes the concept challenging. That doesn't make it a wrong concept, of course.

Whether you will think you're missing something, well, our next two installments will answer that if you read them, and since this is a blog and not statements issued by God, who knows what your reaction will be?

Anne Buchanan said...

People use the term "incomplete penetrance" as though it actually has biological meaning. But it was developed as a statistical way to make perplexing family data fit the Mendelian model. It's a fudge factor to fit data, not a biological explanation.

Anonymous said...

I know it's a statistical concept rather than a biological one (which isn't to say there aren't plenty of biological processes that can explain different cases of incomplete penetrance). I've not heard it described as a "fudge factor," though. Genuinely curious: Where can I read about the history of the concept, to learn more about how it was developed?

Also, thanks for installment 2; looking forward to 3.

Anne Buchanan said...

Thanks. We try to address the penetrance question in tomorrow's post (part 3), but if it's inadequate let us know. We'll get back to you with references on the history of the concept.

Ken Weiss said...

I have books only going back to the '60s. The issue, in human genetics, had to do with genetic counseling, to give potential parents the risk that their child would be affected by some genetic trait given what was known about the family.

Clearly the notion arose at some point in human or other Mendelian analysis, in which 'single gene' traits were being anlayzed and it became possible to make predictions of genotypes and work out Mendelian probabilities in situations less clear-cut than Mendel's.

The hand-waving reason, that is, that incomplete penetrance was about manifestation of a phenotype in people with a genotype, was acknowledged clearly in the best texts. But that was an explicit acknowledgment that the concept was about proportions observed in retrospective data. Nobody could, at the time, attribute molecular reasons to this in most, if not nearly all, cases.

But in every source I have in my library, and I think that represents the view from the 60s on at least, the penetrance probability is as we describe in part III (Friday's installment) of this series: it is a fixed probability parameter estimated as the same for all individuals observed with the given genotype.

Essentially, this treats it formally as a fixed, universal, probabilistic property of the genotype alone.

In segregation analysis, it allowed meaningful estimation of genetic transmission as was done at the time. For example, likelihood ratios were standard ways to test models of inheritance, comparing (say) dominant to recessive or sex-linked. If one instance of non-penetrance were observed, then the likelihood of a dominant model is zero, no matter how convincing that otherwise appeared to be in family data. That would be infinitely less likely than any other model one might want to test--a clearly nonsensical situation!

So the penetrance parameter was introduced to allow segregation analysis to be done if there was even a single non-penetrant case. That's a fudge factor!

But these assumptions

Ken Weiss said...

One last thought. (I left a trailing phrase in my previous comment, so just ignore it)

I have traced the earliest use of the term in human genetics that I can find, a paper in the Am J Hum Genet in 1957 by Curt Stern. He was a leader in this field and in experimental genetics, so I am confident his view reflects what was thought then (and, indeed, is how the term is still used--even if carelessly!).

It is as I said earlier, a single parameter conferred on an allele or genotype. Good papers using the term back then did acknowledge that there might be many reasons, including other genes, but they kept to a single value for all instances, a frequentist value (i.e., fraction of non-penetrant cases in known carriers, etc.).