Wednesday, October 8, 2014

The height of folly: are the causes of stature numerous but at least 'finite'?

In 1926, geneticist Thomas Hunt Morgan wrote this about stature:
A man may be tall because he has long legs, or because he has a long body, or both. Some of the genes may affect all parts, but other genes may affect one region more than another. The result is that the genetic situation is complex and, as yet, not unraveled. Added to this is the probability that the environment may also to some extent affect the end-product.
                                  (TH Morgan, The Theory of the Gene, p 294, 1926):
His point, of course, was not about stature per se but about the difficulty of identifying genes 'for' traits because there are many pathways to a trait, and they aren't all genetic.  This was understood eighty-eight years ago, and yet we have had study after study, ever larger, merging smaller studies, and all sorts of fancy statistics to account for various internal complications in genome sampling, and still the results pour forth as if we haven't learned what we need to know about this and many traits like it.


The most recent is an extensive study in (where else?) Nature Genetics, by a page-load of authors. In summary, the authors found, pooling all the data from many different independent studies in different populations, that at their most stringent statistical acceptance ('significance') level, 697 independent (uncorrelated) spots scattered across the genome each individually contributed in a significance-detectable way to stature.  The individual contributions were generally very small, but together about 20% of the overall estimated genetic component of stature was accounted for.  However, using other criteria and analysis, including lowering the acceptable significance level, and depending on the method, up to 60% of the heritability could be accounted for by around 9,500 different 'genes'.  Don't gasp!  This kind of complexity was anticipated by many previous studies, and the current work confirms that.

Many issues are swept under the rug here, however--that is, relegated to a sometimes obscure warren of tunneling of Supplemental information.  The individuals were all of European descent so that genetic contributions in other populations are not included.  The analysis was adjusted for sex and age.  Subjects were all, presumably, 'normal' in stature (i.e., no victims of Marfan or dwarfism), and all healthy enough to participate as volunteers.  The majority were older than 18, but the range was 14 to 103, and 45% were male, 55% female.  The data were also adjusted for family relationships.

These are very capable authors--with such a major size crew, consisting of more 'authors' than locations in the genome, there must be at least some who know their business (and have actually read and understood the paper they purportedly authored).  In all seriousness, however, there is no reason to make serious criticism of their methods or their findings.  They pooled many different data sources and had to adjust for various sampling and data-quality aspects.  But it is comforting that the study confirms the earlier projections in general terms (and, we think, the data from earlier studies are part of the current meta-analysis).

But if there's no reason to question the study itself, there is a deeper issue.

The deeper issue
The authors rather glibly come to what they seem to feel (did they take a poll?) is the comforting conclusion that while number of causally contributing genes is huge, it isn't 'infinite'.  But that is at most technically true and in fact is farther from the real truth than the authors intended, or  perhaps even realized.

First, if by 'gene' the authors meant coding genes, then their 9500 is nearly half of all such genes in our genome.  Of course the definition of 'gene' is vague and debatable these days so we'll let that problem pass.  Still, 9500 is a lot, and it's just the proverbial tip (see below).

Secondly, stature is affected to a substantial extent by non-sequence-based variables, such as diet, early childhood diseases, maternal nutrition and so on.  There is no clear limit to what or how many such influences there are, and they are often quantitative, which means exposure can be measured in principle to an infinitesimal degree and hence there are infinitely many exposures and infinitely many statures.  But this isn't much of a problem in Europeans, but if we're talking about human stature, we have to recognize the causes not included here.

But, third, there are about 3 billion nucleotides in the human genome, each of which can take on 4 possible nucleotides.  Each person is diploid, so  there are 10 possible site-specific genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT) at each site, and thus 10 to the 3-billionth power possible genomewide genotypes.  That may be more than all the stars, or even all the atoms in the universe, and of course the vast majority of these possible combinations of contributing variants will never occur in real people, and probably don't affect stature (under any conditions?), but we basically have no way of knowing which those are, or of showing that they don't matter.

Well, fine you say, that's a really big number of contributors--but it's not infinite!  That isn't true.


Actually, it is infinite!
By saying the number of causes of stature variation is finite they essentially mean that they are enumerable and that there is an end to the counting, by which time all causes of stature will have been accounted for.  That is simply false.

There are deletions and insertions of open-ended locations, numbers and size in our genomes.  There is no known upper limit to genome size (as some huge-genomed species show), and hence no limit to the number of insertions that may occur sometime or somewhere.  And, as Morgan recognized so long ago, there are many ways to be short or tall.  Not all genome sites need have an effect on stature, but we don't know that and the more we look the more we find, so in practice the number of genomic influences on stature, not counting environmental factors, is in fact not limited or countable: it can keep on growing indefinitely.  It is literally not finite, but is infinite!

One must also remember that by far the greatest effects on stature are not even being counted!  These studies only include adults in a 'normal' range at normal adult ages (old people, for example, lose height substantially as their posture sags and their intervertebral disks shrink).  The authors report that across their age range they didn't see dramatic mapping differences, but in pooled studies with different numbers at different ages, this has to be a weak test.  More importantly, however, is that different cohorts, decades apart, have very different statures.  If the same study were done in Japan, mean height has changed by several inches in the last 75 years, and of course in most of the world stature is impaired by chronic infectious disease, nutritional deficits, and the like.  Thus, sex and cohort (when you are living, such as now vs the Middle Ages in Europeans, the subjects of this study) are regressed out and don't count and the effects being reported are just those relatively small modifications of these major causes.  So when one speaks of cause, this study is retro-fitting genomic data to existing individuals, which is by no means the same as using genomes to predict stature of future individuals--and that raises questions about the finiteness of the whole process.

Further, even if a mere 10,000 positions in the genome affect stature, there are still 10 to the 10000th power possible contributing genotypes, vastly more than the number of people alive today (large but finite), but not measurable even in principle in people who have died or are yet to be born--many millions of which events have happened since the studies in this paper were done.

Most of the causal variant sites are likely to be exceedingly rare in the population, because most variables in general are due to very recent mutations, and that means first that they aren't in the authors' samples, and likely won't be in anybody's samples, even if some bewildered agency keeps funding such studies.  And the variants' effects won't be detectable because a very rare variant in large samples like these cannot generate statistically significant result.

Worse, you have roughly, say, a trillion cells in your body.  All their genomes are descendant from the dual set of chromosomes in the fertilized egg by which you started life, but each cell division creates a few new mutations that if not lethal to that cell are transmitted to the cell lineage into which that cell divides.  So you have 6 billion nucleotides, each potentially variable in one or any unspecified number of a trillion cells.  Your 'genotype' is the aggregate of these possible variants--and that is far more potential variation than in the whole 7 billion living humans, each of whom has a net genotype of his/her trillion cells.  Since cells in a person are constantly dying and being lost, while new ones are produced, not even a single person's genotype is in any sense countable or enumerable. And again, we have no way to know which variants in which cells in which tissues (in which lifestyle and other environments) affect stature.

Well, fine, you say, but the number of causes is at least manageable.  But isn't true, either!

And not even countably infinite!
We can push this even further.  In the 1800s the German mathematician Georg Cantor pioneered the study of infinity.  He identified various levels or types of infinity.  The smallest, if such a term can be used, was countable infinity.  Like the integers, 0,1,2,...., one can count or enumerate them, even if one never reaches the end.  Further, one can pair them in a way, for example, the even and odd numbers are an equal-sized infinite series, and can be paired such as 1-2, 3-4,5-6,.... one by one, without limit.

But then there is a next level of uncountable infinity.  The range of 'real' (decimal) numbers from zero to 1 is like that.  One cannot match them up with countably infinite numbers, such as integers.  That's because one can always cut an a real-number interval like that between 0 and 1 into more and more smaller pieces.  Their number accumulates faster than the number of integers.  We might start by matching this way:  1 - 0.01, 2 - 0.02, 3 - 0.03 and so on.  But we can also chop the 0-1 interval into smaller values 0.001, 0.002, 0.003 etc.  And if we tried to pair them up with the integers we could quickly chop even finer: 0.0001, 0.00011, 0.00012...  These uncountably infinite 'real' numbers will always be ahead of the countably infinite numbers we'd try to pair them up with.

This is a second level of infinity.  In fact, if you start thinking of DNA as a sequence of nucleotides, even with some of the complications mentioned above, you might think that the contributors to stature were countably infinite, one per nucleotide.  But it's worse. That's because stature itself is infinitesimally measurable, significance cutoff values are infinitesimally divisible, and DNA is not just a digital sequence: it has 4-dimensional properties that involve other molecules (epigenetic modification, for example), and levels of gene expression depend quantitatively on the efficiency of binding of regulatory proteins and what happens in one part of DNA depends on what's happening on other parts at a given time.  For such reasons, stature does not come in countably infinite number of values and so neither can its causes.  From a causal point of view, stature is uncountably infinite.

Another way to see this is that even with the enumerably infinite (or even with the authors' claim that stature is causally 'finite') we cannot discriminate in the most proper causal way.  That is because open-ended, unknown (probably unknowable) numbers or sets or combinations of causes can yield the same stature value (which, of course, changes during life, historic period, and sex).  We cannot infer genomic cause from stature (an inference I've called Detectance in the past).  Nor can we predict stature from genotype (the Predictance of a genotype), because there is more to stature than genes and a continuous range of possible stature values for a given genotype (just as we won't be able to predict an individual's facial appearance from genes, but that's another story).  And here we're not considering computing power or measurement error, etc.  And the prediction can't be done in the usual statistical way, because that requires replicability but each person's genotype (as estimated from a blood sample, not to mention all his/her cells) is unique.  And each person's stature is changing during life.

What to do
In essence, we're not just playing word games here.  The number of causes, even just the genetic causes, of stature variation is truly infinite.  It is misleading of the authors to try to reassure readers that at least the number is finite.  That is in essence a tactic, perhaps inadvertent, that justifies the enumeration-approach form of business as usual.

In fact, the causes of stature have no limit (at least not until the death of our species).  That means that we basically cannot understand such traits comprehensively by enumeration or cause-counting approaches.  It's no more enumerable than, say, the values in a magnetic or gravitational field which are different in every infinitesimally small location.

The wise TH Morgan realized these issues in early 20th century form without needing to have all the expensive and extensive data that we are amassing.  But his statement was generic and one might say it called for confirmation.  We have had many other sorts of confirmation for similar traits, but perhaps what's been reported for stature closes the book on the basic question.

So now, if the science is to advance beyond a pretense of causal enumerability, what we need to do is develop some new, quantitative rather than enumerative causal concepts.  How we should do that is unknown, unclear, debatable,.... and in our business-as-usual environment, probably unfundable.


Thiago Carvalho said...

Great post.

You might like this in the New Yorker, some good anecdotes that illustrate some of your points:

Ken Weiss said...

I hadn't seen that story (we get the New Yorker). Anyone who has visited a medieval suits of armor or clothing museum knows that stature is not just the reading of a genomic program.

keith grimaldi said...

BBC Men's average height 'up 11cm since 1870s'

Daniel M Parker said...

Yes, great post!

Ken Weiss said...

Reply to Keith G
A Swiss canton measured stature in the 1880's (or some time thereabouts) and again in the late 1900s. A paper in the Am J Phys Anth (as I recall which journal) compared them. There was of course a major shift upward in mean height.

So, if this is a response to environmental change then, just as if it were natural selection, one would expect only a few 'major' genes, ones that responded to (say) caloric or fat intake, would be easily found by a GWAS. But the opposite is true, as the new paper shows.

Likewise, there is no real difference in the tails of the stature distribution as the more recent (and, I think, new) papers found.

So simplistic models don't work and that should tell us something that should change our behavior (our approach). But the opposite is the new paper shows!

keith grimaldi said...

I agree with you Ken. GWAS is statistically and scientifically "tight" but overall is "simple". Thousands of causal variants - what does that mean? Does it mean that each one of those thousands adds or subtracts a micrometer from my eventual height?

Or does it mean that in any particular individual *some* of these variants may influence eventual height?

Would it be for example that the actual subset of these genes in an individual that influenced height in that individual depended a lot on the individuals environment from conception onwards? Since almost all the atoms in my body have come from food eaten and a bit from the stuff I breath I would suppose that what and how much I eat over the years of growth would affect which genes "caused" me to be my present height. Add that to infinity. Maybe a contributor to the poor predictability of common variants.

Your article reminded me of Hoagy Carmichael

Ken Weiss said...

Yes, I agree, Keith. We know that some alleles in some genes have major effects (e.g., Marfan's), and undoubtedly even within the 'normal' range (however we define that), some genes contribute more quantitatively in a given, or perhaps in a range, of environments. But we have many precedents to suggest that even these may not have similar effects in different genomic backgrounds (e.g., if the same allele were present in East Africans, Mbuti's, Mayans, Medieval Europeans.......

So it is only convenience and the System by which we seem unable to change our belief in enumerability. That is, in my view, what Mendel did 'to' us along with the good things he did 'for' us: he made us think genomic causation is relatively deterministic and point-like.

Anonymous said...

'Thomas Henry Morgan'

You meant Hunt, right?

Anne Buchanan said...

Of course. Thanks! Now changed.

Anonymous said...

"In all seriousness, however, there is no reason to make serious criticism of their methods or their findings."

I disagree. The main trick these people are lately pulling up to get the numbers from typical 2% to 60% is from this 2010 paper -

There are so many parameters in this calculation that they can easily get any percent of heritability explained by playing with the parameters.

Ken Weiss said...

Reply to Anonymous
Peter Visscher, whom I've not met, is a very capable statistician who (I think) views the world as a statistical problem. These days, if someone very cleverly shows ways that tractably enumerative approaches to causation can have at least plausible predictive value is going to go far. Statistical methods usually involve assumptions and ad hoc practices that can sometimes get out of hand, as you say.

But even his work shows the reality of complexity in a grand scale. If you want to be a conventional thinker, his ideas will lead to various ways to deal with it, that seem reassuringly applicable.

I think it distracts from the need to dig deeper.

Laurie McGinness said...

I think there are some problems with this analysis. Not the least being that nucleotides do not occur randomly but in large highly evolved units with very limited percentage variation within them. That alone makes a nonsense of the figure given for potential variation. If genetic determinists are often guilty of over-statement then so is this piece.

Ken Weiss said...

Reply to Laurie
So, the number of causes of stature is only half infinity? Your comment, as written, shows a misperception both of the term 'random' in regard to mutational locations, and the implication a 'very limited percentage' of variation means no variation.

The point is that the causation of stature, itself a vague measure when age, sex, population and the like have to be regressed out, is not a digitally enumerable measure, and the genomic variation, only incompletely ascertained by the methods used, is not a closed set and I think not in any practical sense enumerable. And this is only in regard to the retrospective analysis of these particular data, not 'stature' as a human trait. Even the fraction of 'heritability' is rather misleading in terms of what that value really means.

But the main point is that when one sees a statement that, in effect, a mere many thousands of variants contribute, mostly very very small amounts, that this should not be taken as reassuring that the kind of wholesale enumeration approach is meaningful or useful, or is how we should be thinking about complex quantitative traits of this sort.

Anonymous said...

Typo - you missed a "the"
while number of causally contributing genes is huge

Ken Weiss said...

Thanks. Blog posts...quickly written, poorly edited (me bad!). Hopefully (sometimes) cogent.