Showing posts with label height. Show all posts
Showing posts with label height. Show all posts

Wednesday, October 8, 2014

The height of folly: are the causes of stature numerous but at least 'finite'?

In 1926, geneticist Thomas Hunt Morgan wrote this about stature:
A man may be tall because he has long legs, or because he has a long body, or both. Some of the genes may affect all parts, but other genes may affect one region more than another. The result is that the genetic situation is complex and, as yet, not unraveled. Added to this is the probability that the environment may also to some extent affect the end-product.
                                  (TH Morgan, The Theory of the Gene, p 294, 1926):
His point, of course, was not about stature per se but about the difficulty of identifying genes 'for' traits because there are many pathways to a trait, and they aren't all genetic.  This was understood eighty-eight years ago, and yet we have had study after study, ever larger, merging smaller studies, and all sorts of fancy statistics to account for various internal complications in genome sampling, and still the results pour forth as if we haven't learned what we need to know about this and many traits like it.

Source

The most recent is an extensive study in (where else?) Nature Genetics, by a page-load of authors. In summary, the authors found, pooling all the data from many different independent studies in different populations, that at their most stringent statistical acceptance ('significance') level, 697 independent (uncorrelated) spots scattered across the genome each individually contributed in a significance-detectable way to stature.  The individual contributions were generally very small, but together about 20% of the overall estimated genetic component of stature was accounted for.  However, using other criteria and analysis, including lowering the acceptable significance level, and depending on the method, up to 60% of the heritability could be accounted for by around 9,500 different 'genes'.  Don't gasp!  This kind of complexity was anticipated by many previous studies, and the current work confirms that.

Many issues are swept under the rug here, however--that is, relegated to a sometimes obscure warren of tunneling of Supplemental information.  The individuals were all of European descent so that genetic contributions in other populations are not included.  The analysis was adjusted for sex and age.  Subjects were all, presumably, 'normal' in stature (i.e., no victims of Marfan or dwarfism), and all healthy enough to participate as volunteers.  The majority were older than 18, but the range was 14 to 103, and 45% were male, 55% female.  The data were also adjusted for family relationships.

These are very capable authors--with such a major size crew, consisting of more 'authors' than locations in the genome, there must be at least some who know their business (and have actually read and understood the paper they purportedly authored).  In all seriousness, however, there is no reason to make serious criticism of their methods or their findings.  They pooled many different data sources and had to adjust for various sampling and data-quality aspects.  But it is comforting that the study confirms the earlier projections in general terms (and, we think, the data from earlier studies are part of the current meta-analysis).

But if there's no reason to question the study itself, there is a deeper issue.

The deeper issue
The authors rather glibly come to what they seem to feel (did they take a poll?) is the comforting conclusion that while number of causally contributing genes is huge, it isn't 'infinite'.  But that is at most technically true and in fact is farther from the real truth than the authors intended, or  perhaps even realized.

First, if by 'gene' the authors meant coding genes, then their 9500 is nearly half of all such genes in our genome.  Of course the definition of 'gene' is vague and debatable these days so we'll let that problem pass.  Still, 9500 is a lot, and it's just the proverbial tip (see below).

Secondly, stature is affected to a substantial extent by non-sequence-based variables, such as diet, early childhood diseases, maternal nutrition and so on.  There is no clear limit to what or how many such influences there are, and they are often quantitative, which means exposure can be measured in principle to an infinitesimal degree and hence there are infinitely many exposures and infinitely many statures.  But this isn't much of a problem in Europeans, but if we're talking about human stature, we have to recognize the causes not included here.

But, third, there are about 3 billion nucleotides in the human genome, each of which can take on 4 possible nucleotides.  Each person is diploid, so  there are 10 possible site-specific genotypes (AA, AC, AG, AT, CC, CG, CT, GG, GT, TT) at each site, and thus 10 to the 3-billionth power possible genomewide genotypes.  That may be more than all the stars, or even all the atoms in the universe, and of course the vast majority of these possible combinations of contributing variants will never occur in real people, and probably don't affect stature (under any conditions?), but we basically have no way of knowing which those are, or of showing that they don't matter.

Well, fine you say, that's a really big number of contributors--but it's not infinite!  That isn't true.

Source


Actually, it is infinite!
By saying the number of causes of stature variation is finite they essentially mean that they are enumerable and that there is an end to the counting, by which time all causes of stature will have been accounted for.  That is simply false.

There are deletions and insertions of open-ended locations, numbers and size in our genomes.  There is no known upper limit to genome size (as some huge-genomed species show), and hence no limit to the number of insertions that may occur sometime or somewhere.  And, as Morgan recognized so long ago, there are many ways to be short or tall.  Not all genome sites need have an effect on stature, but we don't know that and the more we look the more we find, so in practice the number of genomic influences on stature, not counting environmental factors, is in fact not limited or countable: it can keep on growing indefinitely.  It is literally not finite, but is infinite!

One must also remember that by far the greatest effects on stature are not even being counted!  These studies only include adults in a 'normal' range at normal adult ages (old people, for example, lose height substantially as their posture sags and their intervertebral disks shrink).  The authors report that across their age range they didn't see dramatic mapping differences, but in pooled studies with different numbers at different ages, this has to be a weak test.  More importantly, however, is that different cohorts, decades apart, have very different statures.  If the same study were done in Japan, mean height has changed by several inches in the last 75 years, and of course in most of the world stature is impaired by chronic infectious disease, nutritional deficits, and the like.  Thus, sex and cohort (when you are living, such as now vs the Middle Ages in Europeans, the subjects of this study) are regressed out and don't count and the effects being reported are just those relatively small modifications of these major causes.  So when one speaks of cause, this study is retro-fitting genomic data to existing individuals, which is by no means the same as using genomes to predict stature of future individuals--and that raises questions about the finiteness of the whole process.

Further, even if a mere 10,000 positions in the genome affect stature, there are still 10 to the 10000th power possible contributing genotypes, vastly more than the number of people alive today (large but finite), but not measurable even in principle in people who have died or are yet to be born--many millions of which events have happened since the studies in this paper were done.

Most of the causal variant sites are likely to be exceedingly rare in the population, because most variables in general are due to very recent mutations, and that means first that they aren't in the authors' samples, and likely won't be in anybody's samples, even if some bewildered agency keeps funding such studies.  And the variants' effects won't be detectable because a very rare variant in large samples like these cannot generate statistically significant result.

Worse, you have roughly, say, a trillion cells in your body.  All their genomes are descendant from the dual set of chromosomes in the fertilized egg by which you started life, but each cell division creates a few new mutations that if not lethal to that cell are transmitted to the cell lineage into which that cell divides.  So you have 6 billion nucleotides, each potentially variable in one or any unspecified number of a trillion cells.  Your 'genotype' is the aggregate of these possible variants--and that is far more potential variation than in the whole 7 billion living humans, each of whom has a net genotype of his/her trillion cells.  Since cells in a person are constantly dying and being lost, while new ones are produced, not even a single person's genotype is in any sense countable or enumerable. And again, we have no way to know which variants in which cells in which tissues (in which lifestyle and other environments) affect stature.

Well, fine, you say, but the number of causes is at least manageable.  But isn't true, either!

And not even countably infinite!
We can push this even further.  In the 1800s the German mathematician Georg Cantor pioneered the study of infinity.  He identified various levels or types of infinity.  The smallest, if such a term can be used, was countable infinity.  Like the integers, 0,1,2,...., one can count or enumerate them, even if one never reaches the end.  Further, one can pair them in a way, for example, the even and odd numbers are an equal-sized infinite series, and can be paired such as 1-2, 3-4,5-6,.... one by one, without limit.

But then there is a next level of uncountable infinity.  The range of 'real' (decimal) numbers from zero to 1 is like that.  One cannot match them up with countably infinite numbers, such as integers.  That's because one can always cut an a real-number interval like that between 0 and 1 into more and more smaller pieces.  Their number accumulates faster than the number of integers.  We might start by matching this way:  1 - 0.01, 2 - 0.02, 3 - 0.03 and so on.  But we can also chop the 0-1 interval into smaller values 0.001, 0.002, 0.003 etc.  And if we tried to pair them up with the integers we could quickly chop even finer: 0.0001, 0.00011, 0.00012...  These uncountably infinite 'real' numbers will always be ahead of the countably infinite numbers we'd try to pair them up with.

This is a second level of infinity.  In fact, if you start thinking of DNA as a sequence of nucleotides, even with some of the complications mentioned above, you might think that the contributors to stature were countably infinite, one per nucleotide.  But it's worse. That's because stature itself is infinitesimally measurable, significance cutoff values are infinitesimally divisible, and DNA is not just a digital sequence: it has 4-dimensional properties that involve other molecules (epigenetic modification, for example), and levels of gene expression depend quantitatively on the efficiency of binding of regulatory proteins and what happens in one part of DNA depends on what's happening on other parts at a given time.  For such reasons, stature does not come in countably infinite number of values and so neither can its causes.  From a causal point of view, stature is uncountably infinite.

Another way to see this is that even with the enumerably infinite (or even with the authors' claim that stature is causally 'finite') we cannot discriminate in the most proper causal way.  That is because open-ended, unknown (probably unknowable) numbers or sets or combinations of causes can yield the same stature value (which, of course, changes during life, historic period, and sex).  We cannot infer genomic cause from stature (an inference I've called Detectance in the past).  Nor can we predict stature from genotype (the Predictance of a genotype), because there is more to stature than genes and a continuous range of possible stature values for a given genotype (just as we won't be able to predict an individual's facial appearance from genes, but that's another story).  And here we're not considering computing power or measurement error, etc.  And the prediction can't be done in the usual statistical way, because that requires replicability but each person's genotype (as estimated from a blood sample, not to mention all his/her cells) is unique.  And each person's stature is changing during life.

What to do
In essence, we're not just playing word games here.  The number of causes, even just the genetic causes, of stature variation is truly infinite.  It is misleading of the authors to try to reassure readers that at least the number is finite.  That is in essence a tactic, perhaps inadvertent, that justifies the enumeration-approach form of business as usual.

In fact, the causes of stature have no limit (at least not until the death of our species).  That means that we basically cannot understand such traits comprehensively by enumeration or cause-counting approaches.  It's no more enumerable than, say, the values in a magnetic or gravitational field which are different in every infinitesimally small location.

The wise TH Morgan realized these issues in early 20th century form without needing to have all the expensive and extensive data that we are amassing.  But his statement was generic and one might say it called for confirmation.  We have had many other sorts of confirmation for similar traits, but perhaps what's been reported for stature closes the book on the basic question.

So now, if the science is to advance beyond a pretense of causal enumerability, what we need to do is develop some new, quantitative rather than enumerative causal concepts.  How we should do that is unknown, unclear, debatable,.... and in our business-as-usual environment, probably unfundable.

Wednesday, February 3, 2010

Don't throw out those Victorian-era calipers just yet!

Type 2 diabetes
Is this a trend? We've seen two recent papers that say that non-genetic factors are better predictors of a trait than are known genes. One study, published in the British Medical Journal (14 Jan, 2010, Talmud et al.) and described here, has to do with prediction of type 2 diabetes. Researchers followed thousands of government workers in London for nearly 20 years, and found that two different risk assessment tools, the Cambridge and the Framingham models based on environment and health measures, were significantly better at predicting risk than a model based on twenty or so risk alleles.
When the researchers assessed the so-called Cambridge and Framingham type 2 diabetes risk models, which are based on non-genetic factors such as age, sex, family history, waist circumference, body mass index, smoking behavior, cholesterol levels and so on, they found that both predicted risk of the disease better than a genetic risk model based on 20 common, independently inherited risk SNPs.

The Cambridge model had 19.7 percent sensitivity for detecting type 2 diabetes cases in the Whitehall cohort based on a five percent false positive rate, while the Framingham model had 30.6 percent sensitivity. The gene count score, meanwhile, detected 6.5 percent of cases at a five percent false positive rate and 9.9 percent of cases at a 10 percent false positive rate.
Of course, sex is genetically determined, and so may waist circumference, body mass index or cholesterol be, at least to some extent, so it's arguable that once the genes for these traits are known, it will be possible to predict diabetes risk with somewhat more precision. But even if we assume that genes for these traits are major, and will be found -- and regular readers of this blog will know that we would be dubious about this -- age, diet, smoking behavior, activity levels, and so on are not genetic factors, yet have major effects on risk.

To most of us, this might suggest that we hold off sending our DNA to one of the companies that do genetic risk prediction.  Even though the senior director for research at one of the direct to consumer genetic testing companies says that they may not be able to predict risk very precisely for everyone, but for someone at high risk, they do a good job. But, how important is that, really?  People at high risk of genetic diseases generally belong to families in which the disease is already known, or at which one of the few well-documented major mutations is present. And anyway, type 2 diabetes can often be prevented with diet and exercise, so perhaps people in high risk families would be better off spending their money on nutritionists or personal trainers rather than genetic testing.

Height
The other paper, published in the European Journal of Human Genetics, Feb 18, 2009, with the charming title, "Predicting human height by Victorian and genomic methods" (Aulchenko et al.), compares prediction based on 54 genetic loci that have been shown to have strong statistical association with height, and prediction based on the height of a subject's parents.
In a population-based study of 5748 people, we find that a 54-loci genomic profile explained 4–6% of the sex- and age-adjusted height variance, and had limited ability to discriminate tall/short people, as characterized by the area under the receiver-operating characteristic curve (AUC). In a family-based study of 550 people, with both parents having height measurements, we find that the Galtonian mid-parental prediction method explained 40% of the sex- and age-adjusted height variance, and showed high discriminative accuracy.
Ironically, given how difficult it is to find a large enough suite of predictive genes, height has been shown by numerous studies to be one of the most heritable human traits, with heritability on the order of 80% or more.
It can be expected that once all loci involved in human height are shown, the discriminative accuracy of the genomic approach may surpass that of the Galtonian [Victorian] approach. However, it will be a tall order to find all these variants, at least using the current methodology consisting of (meta-analysis) of genome-wide association studies, tailored to capture common variants.
Partly this is because the variants described to date will turn out to be those with the greatest effect. Indeed, one estimate is that thousands of genes contribute to traits like stature or diabetes, most with very small effect. And, too, if height is like other traits, and there's no reason to expect it won't be, there will be many genes that are found only in some populations or even in single families.

Now, the story is not all negative for the predictive value of genetics; another paper reflects recent efforts to tally the many small effects that can be gleaned by creative use of GWAS data (Evans et al., Human Molecular Genetics, vol.18, pages 3524-3521). The argument is a sophisticated statistical one, but the general conclusion is that when you do this you gain some information beyond what you get from the already known stronger-effect genotypes, but that gain is not great, and it is at present so complex as to have no obvious clinical value in most cases.

And these analyses were case-control studies, in which environments were essentially unmeasured (matching was done for age and sex and perhaps some other variables, but not extensively for the many variables that makes the overall fraction of genetic control far less than 100%, and usually less than 50%).

The bottom line is: watch what you eat, and walk, don't ride, to work!


(Thanks to Eric Schmidt for alerting us to the diabetes story.)