The Mermaid's Tale: Big Data

Showing posts with label Big Data. Show all posts

Tuesday, August 13, 2019

Big Data, Big 'omics-everything....and the sorry state of biomedical research support

People on the federal grant take often speak of 'Big Data' with a nearly lascivious joy. Big Data is code for biomedical research projects that once begun are too large and costly to terminate, whether or not they did, or do, deliver seriously important truths--truths worthy of their cost.

But when careers depend on it, Big Data and the associated Everything-'omics are as much a fad and ploy for job security for medical school researchers as they have anything to do with science. Huge open-ended projects need claim nothing that can be rigorously tested, and such fishing expeditions are likely to find at least something that can be blared to the public as major discoveries. We see it every day, it seems.

One can criticize the Big Hunger for Big Data projects, but they are survival tactics for biomedical researchers, many of whom only get their salaries from external funds, and for those without any actual ideas or means to actually find something profoundly new. Those 'means' should include freedom from regular bottom-line accountability: hard problems can't be solved on some material-based schedule. That is because the biological world is complex!

There is, as a rule, not one gene, or even a consistent few genes, that 'cause' traits we care to understand. Environment--a complex and vague term--interacts with organisms to yield their traits. This is how, via evolutionary processes, we got here in the first place. Interactions, redundancies, and other complexities have been built into our biology for countless millions of years. Indeed, complexity is protection against vulnerability for survival, and organisms have obviously evolved complexity for that among other reasons (including that redundancy and complexity make room for new innovations to evolve without lethally threatening current systems). So we should be surprised when we don't find complex redundancy--indeed we have clearly been documenting its universality. And we should stop promising that we'll find the 'genes for' complex traits, like heart disease or obesity etc.

But complexity is not good for the research system
The current research funding system started roughly after WWII, when funds were plentiful and the army of investigators and their staffs and administrators was far smaller. This is an historical, sociological fact rather than one about the causative nature of the world that we want to understand.

Research used to be much more about solving scientific questions by forming and testing hypotheses, experiment, and so on, than it was about salaries, overhead, and the status of having a Big research group. But many biological problems are complex--really complex not just as a self-promoting adjective, and to address them should involve patience as well as funding that is stable and doesn't rest on rushing stories to the news media and other means of hyping findings. Careerism as we see it today is not compatible with this, and the currently gross waste on the fads for Big Fishing Expeditions shows this.

Science today has become pretty much a self-sustaining System, from bureaucrats at all levels to investigators who must sing (i.e., claim Big Results) for their supper. It sows dishonor and misrepresentation (as well as the occasional desperate scientific fraud). We all know this but somehow seem powerless to redress things and restore research to the realm of science, where the science is the substance not the window-dressing for fiscal needs as it so often today. When the system is hyper-competitive, who can afford to confront it?

Things can be changed--even if won't be easy
Reform is never easy, as established conditions are very inertial. Vested interests, like grant-dependent salaries, must be faded out or even, sometimes, uprooted. But over recent decades we've clearly allowed those conditions and empires to be built, indeed we often were part of their being built. But it is time for deep, difficult reform. We need to fund Big Ideas, Big Efforts at Hard Problems, not just facile sloganized Big Data or 'omics-everything.

But is there the will--and the bravery--for reform? Where is it, or where can it be nurtured? The unrest must be stirred to action.

Friday, October 19, 2018

Nyah, nyah! My study's bigger than your study!!

By Ken Weiss

It looks like a food-fight at the Precision Corral! Maybe the Big Data era is over! That's because what we really seem to need (of course) is even bigger GWAS or other sorts of enumerative (or EnumerOmics studies, because then (and only then) will we really realize how complex traits are caused, so that we can produce 'precision' genomic medicine to cure all that ails us. After all, there no such thing as enough 'data' or a big (and open-ended) enough study. Of course, because so much knowledge....er, money....is at stake, such a food-fight is not just children in a sand box, but purported adults, scientists even, wanting more money from you, the taxpayer (what else?). The contest will never end on its own. It will have to be ended from the outside, in one way or another, because it is predatory: it takes resources away from what might be focused, limited, but actually successful problem-solving research.

The idea that we need larger and larger GWAS studies, not to mention almost any other kind of 'omics enumerative study, reflects the deeper idea that we have no idea what to do with what we've got. The easiest word to say is "more", because that keeps the fiscal flood gates open. Just as preachers keep the plate full by promising redemption in the future--a future that, like an oasis to desert trekkers, can be a mirage never reached, scientists are modern preachers who've learned the tricks of the trade. And, of course, since each group wants its flood gates to stay wide open it must resist any even faint suggestion that somebody else's gates might open wider.

There is a kind of desperate defense, as well as food fight, over the situation. This, at least, is one way to view a recent exchange between an assertion by Boyle et al. (Cell 169(7):1177-86, 2018**) that some few key genes perhaps with rare alleles scattered across the genome are the 'core' genes responsible for complex diseases, but that lesser often indirect or incidental genes across the genome provide other pathways to affect a trait, and are detected in GWAS. If a focus on this model were to take place, it might threaten the gravy train of more traditional, more mindless, Big Data chasing. As a plea to avoid that is Wray et al.'s falsely polite spitball in return (Cell 173:1573-80, 2018**) urging that things really are spread all over the genome, differently so in everyone. Thus, of course, the really true answer is some statistical prediction method, after we have more and even larger studies.

Could it be, possibly, that this is at its root merely a defense of large statistical data bases and Big Data per se, expressed as if it were a legitimate debate about biological causation? Could it be that for vested interests, if you have a well-funded hammer everything can be presented as if it were a nail (or, rather, a bucket's worth of nails, scattered all over the place)?

Am I being snide here?
Yes, of course. I'm not the Ultimate Authority to adjudicate about who's right, or what metric to use, or how many genome sites, in which individuals, can dance on the head of the same 'omics trait. But I'm not just being snide. One reason is that both the Boyle and Wray papers are right, as I'll explain.

The arguments seem in essence to assert that complex traits are due either to many genetic variants strewn across the genome, or to a few rare larger-effect alleles here and there complemented by nearby variants that may involve indirect pathways to the 'main' genes, and that these are scattered across the genome ('omnigenic'). Or that we can tinker with GWAS results and various technical measurements from them to get the real truth?

We are chasing our tails these days in an endless-seeming circle to see who can do the biggest and most detailed enumerative study, to find the most and tiniest of effects, with the most open-ended largesse, while Rome burns. Rome, here, are the victims of the many diseases which might be studied with actual positive therapeutic results by more, focused, if smaller, studies. Or, in many cases, by a real effort at revealing and ameliorating the lifestyle exposures that typically, one might say overwhelmingly, are responsible for common diseases.

If, sadly, it were to turn out that there is no more integrative way, other than add-'em-up, by which genetic variants cause or predispose to disease, then at least we should know that and spend our research resources elsewhere, where they might do good for someone other than universities. I actually happen to think that life is more integratively orderly than its effects typically being enumeratively additive, and that more thoughtful approaches, indeed reflecting findings of the decades of GWAS data, might lead to better understanding of complex traits. But this seemingly can't be achieved by just sampling extensively enough to estimate 'interactions'. The interactions may, and I think probably, have higher-level structure that can be addressed in other ways.

But if not, if these traits are as they seem, and there is no such simplifying understanding to be had, then let's come clean to the public and invest our resources in other ways to improve our lives before these additive trivia add up to our ends when those supporting the work tire of exaggerated promises.

Our scientific system, that we collectively let grow like mushrooms because it was good for our self interests, puts us in a situation where we must sing for our supper (often literally, if investigators' salary depends on grants). No one can be surprised at the cacophony of top-of-the-voice arias ("Me-me-meeeee!"). Human systems can't be perfect, but they can be perfected. At some point, perhaps we'll start doing that. If it happens, it will only partly reflect the particular scientific issues at issue, because it's mainly about the underlying system itself.

**NOTE: We provide links to sources, but, yep, they are paywalled --unless you just want to see the abstract or have access to an academic library. If you have the looney idea that as a taxpayer you have already paid for this research so private selling of its results should be illegal--sorry!--that's not our society.

Tuesday, October 16, 2018

Where has all the thinking gone....long time passing?

By Ken Weiss

Where did we get the idea that our entire nature, not just our embryological development, but everything else, was pre-programmed by our genome? After all, the very essence of Homo sapiens compared to all other species, is that we use culture--language, tools, etc.--to do our business rather than just our physical biology. In a serious sense, we evolved to be free of our bodies, our genes made us freer from our genes than most if not all other species! And we evolved to live long enough to learn--language, technology, etc.--in order to live our thus-long lives.

Yet isn't an assumption of pre-programming the only assumption by which anyone could legitimately promise 'precision' genomic medicine? Of course, Mendel's work, adopted by human geneticists over a century ago, allowed great progress in understanding how genes lead at least to the simpler of our traits, with discrete (yes/no) manifestations, traits that do include many diseases that really, perhaps surprisingly, do behave in Mendelian fashion, and for which concepts like dominance and recessiveness been applied and that, sometimes, at least approximately hold up to closer scrutiny.

Even 100 years ago, agricultural and other geneticists who could do experiments, largely confirmed the extension of Mendel to continuously varying traits, like blood pressure or height. They reasoned that many genes (whatever they were, which was unknown at the time) contributed individually small effects. If each gene had two states in the usual Aa/AA/aa classroom example sense, but there were countless such genes, their joint action could approximate continuously varying traits whose measure was, say, the number of A alleles in an individual. This view was also consistent with the observed correlation of trait measure with kinship-degree among relatives. This history has been thoroughly documented. But there are some bits, important bits, missing, especially when it comes to the fervor for Big Data 'omics analysis of human diseases and other traits. In essence, we are still, a century later, conceptual prisoners of Mendel.

'Omics over the top: key questions generally ignored
Let us take GWAS (genomewide association studies) on their face value. GWAS find countless 'hits', sites of whatever sort across the genome whose variation affects variation in WhateverTrait you choose to map (everything simply must be 'genomic' or some other 'omic, no?). WhateverTrait varies because every subject in your study has a different combination of contributing alleles. Somewhat resembling classical Mendelian recessiveness, contributing alleles are found in cases as well as controls (or across the measured range of quantitative traits like stature or blood pressure), where the measured trait reflects how many A's one has: WhateverTrait is essentially the sum of A's in 'cases', which may be interpreted as a risk--some sort of 'probability' rather than certainty--of having been affected or of having the measured trait value.

We usually treat risk as a 'probability,' a single value, p, that applies to everyone with the same genotype. Here, of course, no two subjects have exactly the same genotype so some sort of aggregate risk score, adding up each person's 'hits', is assigned a p. This, however, tacitly assumes something like that each site contributes some fixed risk or 'probability' of affection. But this treats these values as if they were essential to the site, each thus acting as a parameter of risk. That is, sites are treated as a kind of fixed value or, one might say 'force', relative to the trait measure in question.

One obvious and serious issue is that these are necessarily estimated from past data, that is, by induction from samples. Not only is there sampling variation that usually is only crudely estimated by some standard statistical variation-related measure, but we know that the picture will be at least somewhat different in any other sample we might have chosen, not to mention other populations; and those who are actually candid about what they are doing know very well that the same people living in a different place or time would have different risks for the same trait.

No study is perfect, so we use some conveniently assumed well-behaved regression/correction adjustments to account for the statistical 'noise' due to factors like age, sex, and unmeasured environmental effects. Much worse than these issues, there are clearly factors of imprecision, and the obvious major one, taboo even to think about much less to mention, that relevant future factors (mutations, environments, lifestyles) are unknowable, even in principle. So what we really do, are forced to do, is extend what the past was like to the assumed future. But besides this, we don't count somatic changes (mutation arising in body tissues during life, that were not inherited), because they'd mess up our assertions of 'precision', and we can't measure them well in any case (so just shut one's eyes and pretend the ghost isn't in the house!).

All of these together mean that we are estimating risks from imperfect existing samples and past life-experience, but treating them as underlying parameters so that we can extend them to future samples. What that does is equate induction with deduction, assuming the past is rigorously parametric and will be the same in the future; but this is simply scientifically and epistemologically wrong, no matter how inconvenient it is to acknowledge this. Mutations, genotypes, and environments of the future are simply unpredictable, even in principle.

None of this is a secret, or new discovery, in any way. What it is, is inconvenient truth. These things should have been enough, by themselves and without badgering investigators about environmental factors that (we know very well, typically predominate) prevent all the NIH's precision promises from being accurate ('precise'), or even to a knowable degree. Yet this 'precision' sloganeering is being, sheepishly, aped all over the country by all sorts of groups who don't think for themselves and/or who go along lest they get left off the funding gravy train. This is the 'omics fad. If you think I am being too cynical, just look at what's being said, done, published, and claimed.

These are, to me, deep flaws in the way the GWAS and other 'omics industries, very well-heeled, are operating these days, to pick the public's pocket (pharma may, slowly, be awakening-- Lancet editorial, "UK life science research: time to burst the biomedical bubble," Lancet 392:187, 2018). But scientists need jobs and salaries, and if we put people in a position where they have to sing in this way for their supper, what else can you expect of them?

Unfortunately, there are much more serious problems with the science, and they have to do with the point-cause thinking on which all of this is based.

Even a point-cause must act through some process
By far most of the traits, disease or otherwise, that are being GWAS'ed and 'omicked these days, at substantial public expense, are treated as if the mapped 'causes' are point causes. If there are n causes, and a person has an unlucky set m out of many possible sets, one adds 'em up and predicts that person will have the target trait. And there is much that is ignored, assumed, or wishfully hidden in this 'will'. It is not clear how many authors treat it, tacitly, as a probability vs a certainty, because no two people in a sample have the same genotype and all we know is that they are 'affected' or 'unaffected'.

The genomics industry promises, essentially, that from conception onward, your DNA sequence will predict your diseases, even if only in the form of some 'risk'; the latter is usually a probability and despite the guise of 'precision' it can, of course, be adjusted as we learn more. For example, it must be adjusted for age, and usually other variables. Thus, we need ever larger and more and longer-lasting samples. This alone should steer people away from being profiteered by DNA testing companies. But that snipe aside, what does this risk or 'probability' actually mean?

Among other things, those candid enough to admit it know that environmental and lifestyle factors have a role, interacting with the genotype if not, usually, overwhelming it, meaning, for example, that the genotype only confers some, often modest, risk probability, the actual risk much more affected by lifestyle factors, most of which are not measured or not measured with accuracy, or not even yet identified. And usually there is some aspect that relates to age, or some assumption about what 'lifetime' risk means. Whose lifetime?

Aspects of such a 'probability'
There are interesting issues, longstanding issues, about these probabilities, even if we assume they have some kind of meaning. Why do so many important diseases, like cancers, only arise at some advanced age? How can a genomic 'risk' be so delayed and so different among people? Why are mice, with very similar genotypes to humans (which is why we do experiments on them to learn about human disease) only live to 3 while we live to our 70s and beyond?

Richard Peto, raised some of these questions many decades ago. But they were never really addressed, even in an era when NIH et al were spending much money on 'aging' research including studies of lifespan. There were generic theories that suggested from an evolutionary theory why some diseases were deferred to later ages (it is called 'negative pleiotropy'), but nobody tried seriously to explain why that was from a molecular/genetic point of view. Why do mice only live only 3 years, anyway? And so on.

These are old questions and very deep ones but they have not been answered and, generally, are conveniently forgotten--because, one might argue, they are inconvenient.

If a GWAS score increases the risk of a disease, that has a long delayed onset pattern, often striking late in life, and highly variable among individuals or over time, what sort of 'cause' is that genotype? What is it that takes decades for the genes to affect the person? There are a number of plausible answers, but they get very little attention at least in part because that stands in the way of the vested interests of entrenched too-big-to-kill Big Data faddish 'research' that demands instant promises to the public it is trephining for support. If the major reason is lifestyle factors, then the very delayed onset should be taken as persuasive evidence that the genotype is, in fact, by itself not a very powerful predictor.

Why would the additive effects of some combination of GWAS hits lead to disease risk? That is, in our complex nature why would each gene's effects be independent of each other contributor? In fact, mapping studies usually show evidence that other things, such as interactions are important--but they are at present almost impossibly complex to be understood.

Does each combination of genome-wide variants have a separate age-onset pattern, and if not, why not? And if so, how does the age effect work (especially if not due to person-years of exposure to the truly determining factors of lifestyle)? If such factors are at play, how can we really know, since we never see the same genotype twice? How can we assume that the time-relationship with each suspect genetic variant will be similar among samples or in the future? Is the disease due to post-natal somatic mutation, in which case why make predictions based on the purported constitutive genotypes of GWAS samples?

Obviously, if long delayed onset patterns are due not to genetic but to lifestyle exposures interacting with genotypes, then perhaps lifestyle exposures should be the health-related target, not exotic genomic interventions. Of course, the value of genome-based prediction clearly depends on environmental/lifestyle exposures, and the future of these exposure is obviously unknowable (as we clearly do know from seeing how unpredictable past exposures have affected today's disease patterns).

The point here is that our reliance on genotypes is a very convenient way of keeping busy, bringing in the salaries, but not facing up to the much more challenging issues that the easy one (run lots of data through DNA sequencers) can't address. I did not invent these points, and it is hard to believe that at least the more capable and less me-too scientists don't clearly know them, if quietly. Indeed, I know this from direct experience. Yes, scientists are fallible, vain, and we're only human. But of all human endeavors, science should be based on honesty because we have to rely on trust of each other's work.

The scientific problems are profound and not easily solved, and not soluble in a hurry. But much of the problem comes from the funding and careerist system that shackles us. This is the deeper explanation in many ways. The paint on the House of Science is the science itself, but it is the House that supports that paint that is the real problem.

A civically responsible science community, and its governmental supporters, should be freed from the iron chains of relentless Big Data for their survival, and start thinking, seriously, about the questions that their very efforts over the past 20 years, on trait after trait, in population after population, and yes, with Big Data, have clearly revealed.

Monday, September 3, 2018

Luigi Luca Cavalli-Sforza (1922-2018), worth remembering

By Ken Weiss

Luigi Luca Cavalli-Sforza has died, at age 96. Who? I wonder how many readers of this blog, or in general how many anthropologists or human geneticists of less than middle-age, know who he was. We are not, these days, in the habit of crediting the past. But Luca, at least, is worth remembering--for those who remember him.

Source: www2.unipr.it

Born in Italy, Luca received his MD there and then, when WWII ended, studied population genetics with RA Fisher in England. He spent most of his career at Stanford, where he taught and did his creatively integrative and theoretical work on human variation and its evolution. He collaborated with many colleagues from around the world. He was, from my graduate-student years on, probably the leading human population geneticist in the world. He developed numerical methods for analyzing human allele-frequency variation and relating the pattern of that variation to global population history, relating that variation to other kinds of data. In particular, he was interested in the relationship of language patterns to genetics, and the causal relationship between cultural dynamics--such as the spread of agriculture--and genetic diversity, and he developed ways to analyze how the latter could be used to help reconstruct the former.

I was incredibly fortunate to have been able to spend a sabbatical in Luca's lab at Stanford, in the late '70s. I had no particular 'project' to work on but, typically for him, he hosted me anyway. It was enough for me to know that he and his associates were leaders in human population genetics, as a science per se, but also that they were so original and creative in relating genetic variation to cultural, language, and technological history. His was a synthetic view. He was technically original and advanced, but based on innovative and integrative thinking.

Luca wrote much, but his two most memorable and durable books, that encapsulated much or most of his interests are (1) The Genetics of Human Populations, with co-author Walter Bodmer (W. Freeman, 1970; and a subsequent watered-down version with author order reversed), and (2) the massive History and Geography of Human Genes, with P. Menozzi and A. Piazza (Princeton Press, 1994). Both are still available, I think, the former in a Dover reprint. The Genetics of Human Populations was a digestible, but sophisticated version of population genetics theory and method, suitable for understanding human origins, allele frequency variation, and evolution. Many anthropologists and others learned their trade from this book. The second book was the final word on traditional allele-frequency (rather than DNA sequence) based reconstructions of human global variation.

Luca worked on topics too numerous to go over here. But this is very well described by John Hawks' fine summary of Luca's work and Wikipedia: Luigi Luca Cavalli-Sforza provides references. For anyone even remotely interested in the history of anthropological genetics and its contribution to human evolution and culture history, it will be worth the effort to be familiar with these foundational contributions.

Luca was a central figure in the attempt to organize a worldwide, systematic sampling of human variation (called the Human Genome Diversity Project, or HGDP). That project never took place as such, because, economically it came into funding conflict with the Human Genome Project, to generate a sequence of a representative complete human genome, and, politically because scurrilous accusations were leveled against the HGDP by those who saw it as a project to categorize people exploitively, much as racism does; this was grotesquely false and opportunistically culpable on the part of jealous or ignorant critics and scandal-thirsty journalists. However, the stir provided NIH with a safe excuse not to fund the HGDP. Instead of a formal, globally systematic project, Luca used the heterogeneous blood-group and protein variation data already collected over many years by various investigators around the world, to show global patterns of human gene frequencies. His tome (#2 above) presented these data, much of which are still available. Luca and colleagues developed methods for analyzing the pattern in relation to historical or prehistorical (assumed) human demographic behavior.

As science history often goes, this approach was soon to be pre-empted by DNA sequencing technology, and individual genome sequences supplanted protein and antibody-based allele-frequencies as the primary data for studying human variation and evolution (Ken Kidd at Yale, and a lifelong colleague of Luca's and an organizer of the HGDP effort, has maintained a very useful site for allele frequency and other data). In this sense, historically, Luca's Big Volume was the last word on the earlier technology. But of course similar attempts to reconstruct not just history itself but to integrate that with other aspects of human existence--are actively being pursued by many people, and this can now be extended far back in time thanks to the ability to extract DNA from fossils. Nonetheless, in terms of the history of anatomically modern humans, the basic outlines in Luca et al.'s book, based on sample allele frequencies, I think still generally hold.

Ephemerality's children
I'm not sure how long Luca will be remembered. What I write here is a paean to a wonderful person and terrific scientist. But there is no single 'discovery' nor Cavalli-Sforza 'theorem' or the like, that will be, by being named for him, his lasting legacy. He was not a grandstander, didn't play to the media, and his students and colleagues are now very senior. The present formula in science has little interest in crediting the past (it's not good for careerism), and that generally also often means not reading its lore, either. As happens in science, on technology supplants prior ones, and DNA sequence and other 'Big Data' and 'omics clearly and rightly have co-opted the less informative data types of Luca's era. In some sense this does vitiate earlier methods as well as data. The new data have also enabled publication that is very technical, but in part for the public glamor of technology is less closely tied to deeper, integrative, thought. Nor was Luca the first to use population genetic data to look at human local or global history--studies of blood group variation, for example, well antedate his work, even if generally more crude in method and detail.

In that sense, Luca helped set the stage, but his timing was all wrong. Ah, well. Had he been starting now, with the technologies and database resources currently available, and the much more direct data of DNA sequence (and other 'omics), rather than allele frequencies from population samples, he would have made perhaps a more durable mark, and I suggest that he would have done so more deeply than was possible from the earlier data to which history limited his attention, and not in the hasty rush to print that is now so prevalent.

I fear these may be the hard realities of history. But thoughtfulness, intelligence and, not least, personal grace only come along sporadically.

Cavalli was a gem of his time.

(This post has been edited to make minor typological corrections)

Sunday, May 6, 2018

"All of us" Who are 'us'?

By Ken Weiss

So the slogan du jour, All Of Us, is the name of a 1.4 billion dollar initiative being launched today by NIH Director Francis Collins. The plan is to enroll one million volunteers in this mega-effort, the goal of which is, well, it depends. It is either to learn how to prevent and treat "several common diseases" or, according to Dr Collins who talked about the initiative here, "It's gonna give us the information we currently lack" to "allow us to understand all of those things we don't know that will lead to better health care." He's very enthusiastic about All of Us (aka Precision Medicine), calling it a "national adventure that's going to transform medical care." This might be viewed in the context of promises in the late 1900s that by now we'd basically have solved these problems--rather than needing ever-bigger longer-term 'data'.

And one can ask how the data quality can possibly be maintained if medical records of whoever volunteers vary in their quality, verifiability, and so on. But that is a technical issue. There are sociological and ontological issues as well.

All of Us?
Serving 'all of us' sounds very noble and representative. But let's see how sincere this publicly hyped promise really is. Using very rough figures, which will serve the point, there are 320 million Americans. So 1 million volunteers would be about 0.3% of 'all' of us. So first we might ask: What about achieving some semblance of real inclusive fairness in our society, by making a special effort to oversample African Americans, Hispanics, and Native Americans, before the privileged, mainly white, middle class get their names on the roles? That might make up for past abuses affecting their health and well-being.

So, OK, let's stop dreaming but at least make the sample representative of the country, white and otherwise. Does that imply fairness? There are, for example, about 300,000 Navajo Native Americans in the country. If All Of Us means what it promises, there would be about 950 Navajos in the sample. And about 56 Hopi tribespeople. And there are, of course, many other ethnic groups that would have to be included. Random (proportionate) sampling would include about 600,000 'white' people in the sample.

These are just crude subpopulation counts from superficial Google searching, but the point is that in no sense is the proposed self-selected sample of volunteers going to represent All Of Us in anything resembling fair distribution of medical benefits. You can't get as much detailed genomewide (not to mention environmental) data from a few hundred sampled individuals compared to hundreds of thousands. To be fair and representative in that sense, the sample would have to be stratified in some way rather than volunteer-based. It seems very unlikely that the volunteers who will be included are in some real sense going to be representative of the US, rather than, say of university and other privileged communities, major cities, and so on--even if not because of intentional bias but simply because they are more likely to learn of All Of Us and to participate.

Of course, defining what is fair and just is not easy. For example, there are far more Anglo Americans than Navajo or Hopi. So the Anglos might expect to get most of the benefits. But that isn't what All Of Us seems to be promising. To get adequate information from a small group, given the causal complexity we are trying to understand, they should probably be heavily oversampled. Even doing that would leave room for enough samples from the larger populations of Anglo and African-Americans adequate for the kind of discovery we could anticipate from this sort of Big Data study of causes of common disease.

More problems than sociology
That is the sociological problem of claiming representativeness of 'all' of us. But of course there is a deeper problem that we've discussed many times, and that is the false implied promise of essentially blanket (miracle?) cures for common diseases. In fact, we know very well that complex causation, of the common diseases that are the purported target of this initiative, involves tens to thousands of variable genome locations, not to mention the environmental ones that are beyond simple counting. Further, and this is a serious, nontrivial point, we know that these sorts of contributing causes include genetic and environmental exposures in the sampled individuals' futures, and these cannot be predicted, even in principle. These are the realities.

And, even if the project were truly representative of the US population demographically, as a sample of self-selected volunteers there remains the problem of representing diseases in the population subsets. Presumably this is why they are focusing on "common diseases", but still the sample will have to be stratified by possible causal exposures (lifestyles, diets, etc) and ethnicity, and then they'll have to have enough controls to make case-control comparisons meaningful. So, how many common diseases, and how will they be represented (males/females, early/late onset, related to what environmental lifestyles, etc.?)? One million volunteers isn't going to be representative or a large enough sample that has to be stratified for statistical analysis, especially if the sample also includes the ethnic diversity that the project promises.

And there's the epistemological problem of causation being too individualistic for this kind of hypothesis-free data fishing to solve--indeed, it is just that kind of research that has shown us clearly how that kind of research is not what we need now. We need research focused on problems that really are 'genetic', and some movement of resources to new thinking, rather than perpetuating the same kind of open-ended, 'Big Data' investment.

And more
In this context, the PR seems mostly to be spin for more money for NIH and its welfare clients (euphemistically called 'universities'). Every lock on Big Money for the Big Data lobby, or perhaps belief-system, excludes funding for focused research, for example, on diseases that would seem to be tractably understood by real science rather than a massive hypothesis-free fishing expedition.

How could the 1.4 billion dollars be better spent? A legitimate goal might be to do a trial run of a linked electronic records system as part of explicit move towards what we really need, and which would really include all of us; a real national healthcare system. This could be openly explained--we're going to learn how to run such a comprehensive system, etc., so we don't get overwhelmed with mistakes. But then for the very same reason, a properly representative project is what should be done. That would involve stratified sampling, and more properly thought-out design. But that would require new thinking about the actual biology.

Thursday, October 13, 2016

Genomic causation....or not

By Ken Weiss

By Ken Weiss and Anne Buchanan

The Big Story in the latest Nature ("A radical revision of human genetics: Why many ‘deadly’ gene mutations are turning out to be harmless," by Erika Check Hayden) is that genes thought to be clearly causal of important diseases aren't always (the link is to the People magazine-like cover article in that issue.) This is a follow-up on an August Nature paper describing the database from which the results discussed in this week's Nature are drawn. The apparent mismatch between a gene variant and a trait can be, according to the paper, the result of technical error, a mis-call by a given piece of software, or due to the assumption that the identification of a given mutation in affected but not healthy individuals means the causal mutation has been found, without experimentally confirming the finding--which itself can be tricky for reasons we'll discuss. Insufficient documentation of 'normal' sequence variation has meant that the frequency of so-called causal mutations hasn't been available for comparative purposes. Again we'll mention below what 'insufficient' might mean, if anything.

People in general and researchers in particular need to be more than dismissively aware of these issues, but the conclusion that we still need to focus on single genes as causal of most disease, that is, do MuchMoreOfTheSame, which is an implication of the discussion, is not so obviously justified. We'll begin with our usual contrarian statement that the idea here is being overhyped as if it were new, but we know that except for its details it clearly is not, for reasons we'll also explain. That is important because presenting it as a major finding, and still focusing on single genes as being truly causal vs mistakenly identified, ignores what we think the deeper message needs to be.

The data come from a mega-project known as ExAC, a consortium of researchers sharing DNA sequences to document genetic variation and further understand disease causation, and now including data from approximately 60,000 individuals (in itself, rather small compared to the need for purpose). The data are primarily exome sequences, that is, from protein-coding regions of the human genome, not from whole genome sequences, again a major issue. We have no reason at all to critique the original paper itself, which is large, sophisticated, and carefully analyzed as far as we can tell; but the excess claims about its novelty are we think very much hyperbolized, and that needs to be explained.

Some of the obvious complicating issues
We know that a gene generally does not act alone. DNA in itself is basically inert. We've been and continue to be misled by examples of gene causation in which context and interactions don't really matter much, but that leads us still to cling to these as though they are the rule. This reinforces the yearning for causal simplicity and tractability. Essentially even this ExAC story, or its public announcements, doesn't properly acknowledge causal context and complexity because it is critiquing some simplistic single-gene inferences, and assuming that the problems are methodological rather than conceptual.

There are many aspects of causal context that complicate the picture, that are not new and we're not making them up, but which the Bigger-than-Ever Data pleas don't address:

1. Current data are from blood-samples and that may not reflect the true constitutive genome because of early somatic mutation, and this will vary among study subjects,

2. Life-long exposure to local somatic mutation is not considered nor measured,

3. Epigenetic changes, especially local tissue-specific ones, are not included,

4. Environmental factors are not considered, and indeed would be hard to consider,

5. Non-Europeans, and even many Europeans are barely included, if at all, though this is beginning to be addressed,

6. Regulatory variation, which GWAS has convincingly shown is much more important to most traits than coding variation, is not included. Exome data have been treated naively by many investigators as if that is what is important, and exome-only data have been used a major excuse for Great Big Grants that can't find what we know is probably far more important,

7. Non-coding regions, non-regulatory RNA regions are not included in exome-only data,

8. A mutation may be causal in one context but not in others, in one family or population and not others, rendering the determination that it's a false discovery difficult,

9. Single gene analysis is still the basis of the new 'revelations', that is, the idea being hinted at that the 'causal' gene isn't really causal....but one implicit notion is that it was misidentified, which is perhaps sometimes true but probably not always so,

10. The new reports are presented in the news, at least, as if the gene is being exonerated of its putative ill effects. But that may not be the case, because if the regulatory regions near the mutated gene have no or little activity, the 'bad' gene may simply not be being expressed. Its coding sequence could falsely be assumed to be harmless,

11. Many aspects of this kind of work are dependent on statistical assumptions and subjective cutoff values, a problem recently being openly recognized,

12. Bigger studies introduce all sorts of statistical 'noise', which can make something appear causal or can weaken its actual apparent cause. Phenotypes can be measured in many ways, but we know very well that this can be changeable and subjective (and phenotypes are not very detailed in the initial ExAC database),

13. Early reports of strong genetic findings have well known upward bias in effect size, the finder's curse that later work fails to confirm.

Well, yes, we're always critical, but this new finding isn't really a surprise
To some readers we are too often critical, and at least some of us have to confess to a contrarian nature. But here is why we say that these new findings, like so many that are by the grocery checkout in Nature, Science, and People magazines, while seemingly quite true, should not be treated as a surprise or a threat to what we've already known--nor a justification of just doing more, or much more of the same.

Gregor Mendel studied fully penetrant (deterministic) causation. That is what we now know to be 'genes', in which the presence of the causal allele (in 2-allele systems) always caused the trait (green vs yellow peas, etc.; the same is true of recessive as dominant traits, given the appropriate genotype). But this is generally wrong, save at best for the exceptions such as those that Mendel himself knowingly and carefully chose to study. But even this was not so clear! Mendel has been accused of 'cheating' by ignoring inconsistent results. This may have been data fudging, but it is at least as likely to have been reacting to what we have known for a century as 'incomplete penetrance'. (Ken wrote on this a number of years ago in one of his Evolutionary Anthropology columns.) For whatever reason--and see below--the presence of a 'dominant' gene or 'recessive' homozyosity at a 'causal' gene doesn't always lead to the trait.

In most of the 20th century the probabilistic nature of real-world as opposed to textbook Mendelism has been completely known and accepted. The reasons for incomplete penetrance were not known and indeed we had no way to know them as a rule. Various explanations were offered, but the statistical nature of the inferences (estimates of penetrance probability, for example) were common practice and textbook standards. Even the original authors acknowledge incomplete penetrance, but this essentially shows that what the ExAC consortium is reporting are details but nothing fundamentally new nor surprising. Clinicians or investigators acting as if a variant were always causal should be blamed for gross oversimplification, and so should hyperbolic news media.

Recent advances such as genomewide association studies (GWAS) in various forms have used stringent statistical criteria to minimize false discovery. This has led to mapped 'hits' that satisfied those criteria only accounting for a fraction of estimated overall genomic causation. This was legitimate in that it didn't leave us swamped with hundreds of very weak or very rare false positive genome locations. But even the acceptable, statistically safest genome sites showed typically small individual effects and risks far below 1.0. They were not 'dominant' in the usual sense. That means that people with the 'causal' allele don't always, and in fact do not usually, have the trait. This has been the finding for quantitative traits like stature and qualitative ones like presence of diabetes, heart attack-related events, psychiatric disorders and essentially all traits studied by GWAS. It is not exactly what the ExAC data were looking at, but it is highly relevant and is the relevant basic biological principle.

This does not necessarily mean that the target gene is not important for the disease trait, which seems to be one of the inferences headlined in the news splashes. This is treated as a striking or even fundamental new finding, but it is nothing of that sort. Indeed, the genes in question may not be falsely identified, but may very well contribute to risk in some people under some conditions at some age and in some environments. The ExAC results don't really address this because (for example) to determine when a gene variant is a risk variant one would have to identify all the causes of 'incomplete penetrance' in every sample, but there are multiple explanations for incomplete penetrance, including the list of 1 - 13 above as well as methodological issues such as those pointed out by the ExAC project paper itself.

In addition, there may be 'protective' variants in the other regions of the genome (that is, the trait may need the contribution of many different genome regions), and working that out would typically involve "hyper astronomical" combinations of effects using unachievable, not to mention uninterpretable, sample sizes--from which one would have to estimate risk effects of almost uncountable numbers of sequence variants. If there were, say, 100 other contributing genes, each with their own variant genotypes including regulatory variants, the number of combinations of backgrounds one would have to sort through to see how they affected the 'falsely' identified gene is effectively uncountable.

Even the most clearly causal genes such as variants of BRCA1 and breast cancer have penetrance far less than 1.0 in recent data (here referring to lifetime risk; risk at earlier ages is very far from 1.0). The risk, though clearly serious, depends on cohort, environmental and other mainly unknown factors. Nobody doubts the role of BRCA1 but it is not in itself causal. For example, it appears to be a mutation repair gene, but if no (or not enough) cancer-related mutations arise in the breast cells in a woman carrying a high-risk BRCA1 allele, she will not get breast cancer as a result of that gene's malfunction.

There are many other examples of mapping that identified genes that even if strongly and truly associated with a test trait have very far from complete penetrance. A mutation in HFE and hemochromatosis comes to mind: in studies of some Europeans, a particular mutation seemed always to be present, but if the gene itself were tested in a general data base, rather than just in affected people, it had little or no causal effect. This seems to be the sort of thing the ExAC report is finding.

The generic reason is again that genes, essentially all genes, work only in their context. That context includes 'environment', which refers to all the other genes and cells in the body and the external or 'lifestyle' factors, and also age and sex as well. There is no obvious way to identify, evaluate or measure the effects of all possibly relevant lifestyle effects, and since these change, retrospective evaluation has unknown bearing on future risk (the same can be said of genomic variants for the same reason). How could these even be sampled adequately?

Likewise, volumes of long-existing experimental and highly focused results tell the same tale. Transgenic mice, for example, in which the same mutation is introduced into their 'same' gene as in humans, very often show little or no, or only strain-specific effects. This is true in other experimental organisms. The lesson, and it's by far not a new or very recent one, is that genomic context is vitally important, that is, it is person-specific genomic backgrounds of a target gene that affect the latter's effect strength--and vice versa: that is, the same is true for each of these other genes. That is why to such an extent we have long noted the legerdemain being foist on the research and public communities by the advocates of Big Data statistical testing. Certainly methodological errors are also a problem, as the Nature piece describes, but they aren't the only problem.

So if someone reports some cases of a trait that seem too often to involve a given gene, such as the Nature piece seems generally to be about, but searches of unaffected people also occasionally find the same mutations in such genes (especially when only exomes are considered), then we are told that this is a surprise. It is, to be sure, important to know, but it is just as important to know that essentially the same information has long been available to us in many forms. It is not a surprise--even if it doesn't tell us where to go in search of genetic, much less genomic, causation.

Sorry, though it's important knowledge, it's not 'radical' nor dependent on these data!
The idea being suggested is that (surprise, surprise!) we need much more data to make this point or to find these surprisingly harmless mutations. That is simply a misleading assertion, or attempted justification, though it has become the intentional industry standard closing argument.

It is of course very possible that we're missing some aspects of the studies and interpretations that are being touted, but we don't think that changes the basic points being made here. They're consistent with the new findings but show that for many very good reasons this is what we knew was generally the case, that 'Mendelian' traits were the exception that led to a century of genetic discovery but only because it focused attention on what was then doable (while, not widely recognized by human geneticists, in parallel, agricultural genetics of polygenic traits showed what was more typical).

But now, if things are being recognized as being contextual much more deeply than in Francis' Collins money-strategy-based Big Data dreams, or 'precision' promises, and our inferential (statistical) criteria are properly under siege, we'll repeat our oft-stated mantra: deeply different, reformed understanding is needed, and a turn to research investment focused on basic science rather than exhaustive surveys, and on those many traits whose causal basis really is strong enough that it doesn't really require this deeper knowledge. In a sense, if you need massive data to find an effect, then that effect is usually very rare and/or very weak.

And by the way, the same must be true for normal traits, like stature, intelligence, and so on, for which we're besieged with genome-mapping assertions, and this must also apply to ideas about gene-specific responses to natural selection in evolution. Responses to environment (diet etc.) manifestly have the same problem. It is not just a strange finding of exome mapping studies for disease. Likewise, 'normal' study subjects now being asked for in huge numbers may get the target trait later on in their lives, except for traits basically present early in life. One can't doubt that misattributing the cause of such traits is an important problem, but we need to think of better solutions that Big Big Data, because not confirming a gene doesn't help, or finding that 'the' gene is only 'the' gene in some genomic or environmental backgrounds is the proverbial and historically frustrating needle in the haystack search. So the story's advocated huge samples of 'normals' (random individuals) cannot really address the causal issue definitively (except to show what we know, that there's a big problem to be solved). Selected family data may--may--help identify a gene that really is causal, but even they have some of the same sorts of problems. And may apply only to that family.

The ExAC study is focused on severe diseases, which is somewhat like Mendel's selective approach, because it is quite obvious that complex diseases are complex. It is plausible that severe, especially early onset diseases are genetically tractable, but it is not obvious that ever more data will answer the challenge. And, ironically, the ExAC study has removed just such diseases from their consideration! So they're intentionally showing what is well known, that we're in needle in haystacks territory, even when someone has reported big needles.

Finally, we have to add that these points have been made by various authors for many years, often based on principles that did not require mega-studies to show. Put another way, we had reason to expect what we're seeing, and years of studies supported that expectation. This doesn't even consider the deep problems about statistical inference that are being widely noted and the deeply entrenched nature of that approach's conceptual and even material invested interests (see this week's Aeon essay, e.g.). It's time to change, but doing so would involve deeply revising how resources are used--of course one of our common themes here on the MT--and that is a matter almost entirely of political economy, not science. That is, it's as much about feeding the science industry as it is about medicine and public health. And that is why it's mainly about business as usual rather than real reform.

Monday, August 1, 2016

FAS - Fishy Association Studies

By Ken Weiss

On Saturday, July 19, 1879, the brilliant opera

composer, Richard Wagner, "had a bad night;

he thinks that...he ate too much trout."

Quoted from Cosima Wagner's Diary, Vol. II, 1878-83.

As I was reading Cosima Wagner's doting diary of life with her famous husband, I chanced across the above quote that seemed an appropriate, if snarky, way to frame today's post. The incident she related exemplifies how we routinely assign causation even to one-off events in daily life. Science, on the other hand, purports to be about causation of a deeper sort, with some sufficient form of regularity or replicability.

Cause and effect can be elusive concepts, especially difficult to winnow out from observations in the complex living world. We've hammered on about this on MT over the years. The best science at least tries to collect adequate evidence in order to infer causation in credible rather than casual ways. There are, for example, likely to be lots of reasons, other than eating trout, that could explain why a cranky genius like Wagner had a bad night. It is all too easy to over-interpret associations in causal terms.

By such thinking, the above figures (from Wikimedia commons) might be interpreted as having the following predictive power:
One fish = bad night
Two fish = total insomnia
Many fish = hours of nightmarish dissonance called Tristan und Isolde!

Too often, we salivate over GWAS (genomewide association studies) results as if they justify ever-bigger and longer studies. But equally too often, these are FAS, fishy association studies. That is what we get when the science community doesn't pay heed to the serious and often fundamental difficulties in determining causation that may well undermine their findings and the advice so blithely proffered to the public.

We are not the only ones who have been writing that the current enumerative, 'Big Data', approach to biomedical and even behavior genetic causation leaves, to say the least, much to be desired. Among other issues, there's too much asserting conclusions on inadequate evidence, and not enough recognition of when assertions are effectively not that much more robust than saying one 'ate too much trout'. Weak statistical associations, so typically the result of these association studies, are not the same as demonstrations of causation.

The idea of mapping complex traits by huge genomewide case-control or population sample studies is a captivating one for biomedical researchers. It's mechanical, perfectly designed to be done by huge computer database analysis by people who may never have seen the inside of a wet lab (e.g., programmers and 'informatics' or statistical specialists who have little serious critical understanding of the underlying biology). It's often largely thought-free, because that makes the results safe to publish, safe for getting more grants, and so on; but more than being 'captivating' it is 'capturing'.... a hog-trough's share of research resources.

The promise, not even always carefully hedged with escape-words lest it be shown to be wrong, is that from your genome your future biomedical (and behavioral) traits can be known. A recent article in the July 28 issue of the Journal of the American Medical Association (JAMA), Joyner et al. describes the stubborn persistence of under-performing but costly research that becomes entrenched, a perpetuation that NIH's misnomered 'precision based genomic medicine' continues or even expands upon. Below is our riff on the article, but it's open-source so you can read the points they make and judge for yourself if we have the right 'take' on what they say. It is one of many articles that have been making similar points....in case anyone is listening.

The problem is complex causation
The underlying basic problem is the complex nature of causation of 'complex' traits, like many if not most behavioral or chronic or late-onset diseases. The word complex, long-used for such traits, refers not to identified causes but to the fact that the outcomes clearly did not have simple, identified causes. It seemed clear that their causation was due mainly to countless combinations of many individually small causal factors, some of which were inherited; but the specifics were usually unknown. Computer and various DNA technologies made it possible, in principle, to identify and sort through huge numbers of possible causes or at least statistically associated factors, including DNA sequence variants. But underlying this source for this approach has been the idea, always a myth really, that identifying some enumerated set of causes in a statistical sample would allow accurate prediction of outcomes. This has proven not to be the case nearly as generally as has been promised.

To me, the push to do large-scale huge-sample, survey-based genomewide risk analysis was at least partly justified, at least in principle, years ago when there might have been some doubt about the nature of the causal biology underlying complex traits, including the increasingly common chronic disease problems that our aging population faces. But the results are in, and in fact have been in for quite a long time. Moreover, and a credit to the validity of the science, is that the results support what we had good reason to know for a long time. The results show that this approach is not, or at least clearly no longer the optimal way to do science in this area or contribute to improving public health (and much of the same applies to evolutionary biology as well).

I think it fair to say that I was making these points, in print, in prominent places, starting as long ago as nearly 30 years, in books and journal articles (and more recently here on MT), that is, ever since the relevant actual data were beginning to appear. But neither I nor my collaborators were the original discoverers of this insight: instead, the basic truth has been known in principle and in many empirical experimental (such as agricultural breeding) and observational contexts, for nearly a century! Struggling with the inheritance of causal elements ('genes' as they were generically known), the 1930s' 'modern synthesis' of evolutionary biology reconciled (1) Darwin's idea of gradual evolution, mainly of quantitative traits, with the experimental evidence of the quantitative nature of their inheritance, and (2) the discrete nature of inheritance of discrete causal elements first systematically demonstrated by Mendel for selected 2-state traits. That was a powerful understanding but in too many ways it has thoughtlessly been taken to imply that all traits, not just genes, are usefully 'Mendelian', due to substantial, enumerable, strongly causal genetic agents. That has always been the exception, not the rule.

A view is possible that is not wholly cynical
We have been outspoken about the sociocultural aspect of modern research, which can be understood by what one might call the FTM (Follow the Money) approach, in some ways a better way to understand where we are than looking at the science itself. Who has what to gain by the current approaches? Our understanding is aided by realizing that the science is presented to us by scientists and journalists, supplier industries and bureaucrats, who have vested interests that are served by promoting that way of doing business.

FTM isn't the only useful perspective, however. A less cynical, and yet still appropriate way to look at this is in terms of diminishing returns. The investment in the current way of doing science in this (and other areas) is part of our culture. From a scientific point of view, the first forays into a new way or approach, or a theoretical idea, yield quick and, by definition, new results. Eventually, it becomes more routine and the per-study yield diminishes. We asymptotically approach what we can glean from the approach. Eventually some chance insight will yield some forms of better and more powerful approaches, whatever they'll be.

If current approaches were just yielding low-cost incremental gain, or were being done in well-off investigators' basement labs, it would be a normal course of scientific-history, and nobody would have reason to complain. But that isn't how it works these days. These days understanding via FTM is important: the science establishment's hands are in all our pockets, and we should expect more in return than the satisfaction that the trough has been feeding many very nice careers (including mine), in universities, journalism, and so on. How, when, and where a properly increased expectation of science for societal benefits will be fulfilled is not predictable, because facts are elusive and Nature often opaque. However, simply more-of-the-same, at its current costs, with continuing entrenched justification, isn't the best way for public resources to be used.

There will always be a place for 'big data' resources. A unified system of online biomedical records would save a lot of excess repeat-testing and other clinical costs, if every doctor you consult could access those records. The records could potentially be used for research purposes, to the (limited) extent that they could be informative. For a variety of conditions that would be very useful and cost-effective indeed; but most of those would be relatively rare.

Continuing to pour research funds into the idea that ever more 'data' will lead to dramatic improvements of 'precision' medicine is far more about the health of entrenched university labs and investigators than that of the general citizenry. Focused laboratory work that is more rigorously supported by theory or definitive experiment, with some accountability (but no expectations nor promises of miracles) is in order, given what the GWAS etc. era, plus a century of evolutionary genetics, has shown. There are countless areas, especially many serious early onset diseases, for which we have a focused, persuasive, meaningful understanding of causation and where resources should now be invested more heavily.

Intentionally open-ended beetle collecting ventures joined at the hip to promises of 'precision' without those promising even knowing what that word means (but hinting that it means 'perfection'), or glorifying the occasional seriously good findings as if they are typical or as though more focussed, less open-ended research wouldn't be a better investment, is not a legitimate approach. Yet that is largely what is going on today. The scientists, at least the smart ones, know this very well and say so (in confidence, of course).

Understanding complex causation is complex, and we have to face up to that. We can't demand inexpensive or instant or even predictable answers. These are inconvenient facts few want to face up to. But we and others have said this ad nauseam before, so here we wanted to point out the current JAMA paper as yet another formal and prominently published realization of the costly inertia in which we are embedded, and by highly capable authors. In any aspect of society, not just science, prying resources loose from the hands of a small elite is never easy, even when there are other ways to use those resources that might have better payoff for all of us.

Usually, such resource reallocation seems to require some major new and imminent external threat, or some unpredicted discovery, which I think is far more likely to come from some smaller operation where thinking was more important than cranking out yet another mass-scale statistical survey of Big Data sausage. Still, every push against wasteful inertia, like the Joyner et al. JAMA paper, helps. Indeed, those many whose careers are entrapped by that part of the System have the skills and neuronal power to do something better if circumstances enabled it to happen more readily. To encourage that, perhaps we should stop paying so much attention to Fishy stories.

Wednesday, March 16, 2016

The statistics of Promissory Science. Part I: Making non-sense with statistical methods

By Ken Weiss

Statistics is a form of mathematics, a way devised by humans for representing abstract relationships. Mathematics comprises axiomatic systems, which make assumptions about basic units such as numbers; basic relationships like adding and subtracting; and rules of inference (deductive logic); and then elaborates these to draw conclusions that are typically too intricate to reason out in other less formal ways. Mathematics is an awesomely powerful way of doing this abstract mental reasoning, but when applied to the real world it is only as true or precise as the correspondence between its assumptions and real-world entities or relationships. When that correspondence is high, mathematics is very precise indeed, a strong testament to the true orderliness of Nature. But when the correspondence is not good, mathematical applications verge on fiction, and this occurs in many important applied areas of probability and statistics.

You can't drive without a license, but anyone with R or SAS can be a push-button scientist. Anybody with a keyboard and some survey generating software can monkey around with asking people a bunch of questions and then 'analyze' the results. You can construct a complex, long, intricate, jargon-dense, expansive survey. You then choose who to subject to the survey--your 'sample'. You can grace the results with the term 'data', implying true representation of the world, and be off and running. Sample and survey designers may be intelligent, skilled, well-trained in survey design, and of wholly noble intent. There's only one little problem: if the empirical fit is poor, much of what you do will be non-sense (and some of it nonsense).

Population sciences, including biomedical, evolutionary, social and political fields are experiencing an increasingly widely recognized crisis of credibility. The fault is not in the statistical methods on which these fields heavily depend, but in the degree of fit (or not) to the assumptions--with the emphasis these days on the 'or not', and an often dismissal of the underlying issues in favor of a patina of technical, formalized results. Every capable statistician knows this, but of course might be out of business if openly paying it enough attention. And many statisticians may be rather disinterested or too foggy in the philosophy of science to understand what goes beyond the methodological technicalities. Jobs and journals depend on not being too self-critical. And therein lie rather serious problems.

Promissory science
There is the problem of the problems--the problems we want to solve, such as in understanding the cause of disease so that we can do something about it. When causal factors fit the assumptions, statistical or survey study methods work very well. But when causation is far from fitting the assumptions, the impulse of the professional community seems mainly to increase the size, scale, cost, and duration of studies, rather than to slow down and rethink the question itself. There may be plenty of careful attention paid to refining statistical design, but basically this stays safely within the boundaries of current methods and beliefs, and the need for research continuity. It may be very understandable, because one can't just quickly uproot everything or order up deep new insights. But it may be viewed as abuse of public trust as well as of the science itself.

The BBC Radio 4 program called More Or Less keeps a watchful eye on sociopolitical and scientific statistical claims, revealing what is really known (or not) about them. Here is a recent installment on the efficacy (or believability, or neither) of dietary surveys. And here is a FiveThirtyEight link to what was the basis of the podcast.

The promotion of statistical survey studies to assert fundamental discovery has been referred to as 'promissory science'. We are barraged daily with promises that if we just invest in this or that Big Data study, we will put an end to all human ills. It's a strategy, a tactic, and at least the top investigators are very well aware of it. Big long-term studies are a way to secure reliable funding and to defer delivering on promises into the vague future. The funding agencies, wanting to seem prudent and responsible to taxpayers with their resources, demand some 'societal impact' section on grant applications. But there is in fact little if any accountability in this regard, so one can say they are essentially bureaucratic window-dressing exercises.

Promissory science is an old game, practiced since time immemorial by preachers. It boils down to promising future bliss if you'll just pay up now. We needn't be (totally) cynical about this. When we set up a system that depends on public decisions about resources, we will get what we've got. But having said that, let's take a look at what is a growing recognition of the problem, and some suggestions as to how to fix it--and whether even these are really the Emperor of promissory science dressed in less gaudy clothing.

A growing at least partial awareness
The problem of results that are announced by the media, journals, universities, and so on but that don't deliver the advertised promises is complex but widespread, in part because research has become so costly, that some warning sirens are sounding when it becomes clear that the promised goods are not being delivered.

One widely known issue is the lack of reporting of negative results, or their burial in minor journals. Drug-testing research is notorious for this under-reporting. It's too bad because a negative result on a well-designed test is legitimately valuable and informative. A concern, besides corporate secretiveness, is that if the cost is high, taxpayers or share-holders may tire of funding yet more negative studies. Among other efforts, including by NIH, there is a formal attempt called AllTrials to rectify the under-reporting of drug trials, and this does seem at least to be thriving and growing if incomplete and not enforceable. But this non-reporting problem has been written about so much that we won't deal with it here.

Instead, there is a different sort of problem. The American Statistical Association has recently noted an important issue, which is the use and (often) misuse of p-values to support claims of identified causation (we've written several posts in the past about these issues; search on 'p-value' if you're interested, and the post by Jim Wood is especially pertinent). FiveThirtyEight has a good discussion of the p-value statement.

The usual interpretation is that p represents the probability that if there is in fact no causation by the test variable, that its apparent effect arose just by chance. So if the observed p in a study is less than some arbitrary cutoff, such as 0.05, it means essentially that if no causation were involved the chance you'd see this association anyway is no greater than 5%; that is, there is some evidence for a causal connection.

Trashing p-values is becoming a new cottage industry! Now JAMA is on the bandwagon, with an article that shows in a survey of biomedical literature from the past 25 years, including well over a million papers, a far disproportionate and increasing number of studies reported statistical significant results. Here is the study on the JAMA web page, though it is not public domain yet.

Besides the apparent reporting bias, the JAMA study found that those papers generally failed to provide adequate fleshing out of that result. Where are all the negative studies that statistical principles might expect to be found? We don't see them, especially in the 'major' journals, as has been noted many times in recent years. Just as importantly, authors often did not report confidence intervals or other measures of the degree of 'convincingness' that might illuminate the p-value. In a sense that means authors didn't say what range of effects is consistent with the data. They report a non-random effect, but often didn't give the effect size, that is, say how large the effect was even assuming that effect was unusual enough to support a causal explanation. So, for example, a statistically significant increase of risk from 1% to 1.01% is trivial, even if one could accept all the assumptions of the sampling and analysis.

Another vocal critic of what's afoot is John Ionnides; in a recent article he levels both barrels against the misuse and mis- or over-representation of statistical results in biomedical sciences, including meta-analysis (the pooling of many diverse small studies into a single large analysis to gain sufficient statistical power to detect effects and test for their consistency). This paper is a rant, but a well-deserved one, about how 'evidence-based' medicine has been 'hijacked' as he puts it. The same must be said of 'precision genomic' or 'personalized' medicine, or 'Big Data', and other sorts of imitative sloganeering going on from many quarters who obviously see this sort of promissory science as what you have to do to get major funding. We have set ourselves a professional trap, and it's hard to escape. For example, the same author has been leading the charge against misrepresentative statistics for many years, and he and others have shown that the 'major' journals have in a sense the least reliable results in terms of their replicability. But he's been raising these points in the same journals that he shows are culpable of the problem, rather than boycotting those journals. We're in a trap!

These critiques of current statistical practice are the points getting most of the ink and e-ink. There may be a lot of cover-ups of known issues, and even hypocrisy, in all of this, and perhaps more open or understandable tacit avoidance. The industry (e.g., drug, statistics, and research equipment) has a vested interest in keeping the motor running. Authors need to keep their careers on track. And, in the fairest and non-political sense, the problems are severe.

But while these issues are real and must be openly addressed, I think the problems are much deeper. In a nutshell, I think they relate to the nature of mathematics relative to the real world, and the nature and importance of theory in science. We'll discuss this tomorrow.

Wednesday, January 27, 2016

"The Blizzard of 2016" and predictability: Part III: When is a health prediction 'precise' enough?

By Ken Weiss

We've discussed the use of data and models to predict the weather in the last few days (here and here). We've lauded the successes, which are many, and noted the problems, including people not heeding advice. Sometimes that's due, as a commenter on our first post in this series noted, to previous predictions that did not pan out, leading people to ignore predictions in the future. It is the tendency of some weather forecasters, like all media these days, to exaggerate or dramatize things, a normal part of our society's way of getting attention (and resources).

We also noted the genuine challenges to prediction that meteorologists face. Theirs is a science that is based on very sound physics principles and theory, that as a meteorologist friend put it, constrain what can and might happen, and make good forecasting possible. In that sense the challenge for accuracy is in the complexity of global weather dynamics and inevitably imperfect data, that may defy perfect analysis even by fast computers. There are essentially random or unmeasured movements of molecules and so on, leading to 'chaotic' properties of weather, which is indeed the iconic example of chaos, known as the so-called 'butterfly effect': if a butterfly flaps its wings, the initially tiny and unseen perturbation can proliferate through the atmosphere, leading to unpredicted, indeed, wildly unpredictable changes in what happens.

The Butterfly Effect, far-reaching effects of initial conditions; Wikipedia, source

Reducing such effects is largely a matter of needing more data. Radar and satellite data are more or less continuous, but many other key observations are only made many miles apart, both on the surface and into the air, so that meteorologists must try to connect them with smooth gradients, or estimates of change, between the observations. Hence the limited number of future days (a few days to a week or so) for which forecasts are generally accurate.

Meteorologists' experience, given their resources, provide instructive parallels as well as differences with biomedical sciences, that aim for precise prediction, often of things decades in the future, such as disease risk based on genotype at birth or lifestyle exposures. We should pay attention to those parallels and differences.

When is the population average the best forecast?

Open physical systems, like the atmosphere, change but don't age. Physical continuity means that today is a reflection of yesterday, but the atmosphere doesn't accumulate 'damage' the way people do, at least not in a way that makes a difference to weather prediction. It can move, change, and refresh, with a continuing influx and loss of energy, evaporation and condensation, and circulating movement, and so on. By contrast, we are each on a one-way track, and a population continually has to start over with its continual influx of new births and loss to death. In that sense, a given set of atmospheric conditions today has essentially the same future risk profile as such conditions had a year or century or millennium ago. In a way, that is what it means to have a general atmospheric theory. People aren't like that.

By far, most individual genetic and even environmental risk factors identified by recent Big Data studies only alter lifetime risk by a small fraction. That is why the advice changes so frequently and inconsistently. Shouldn't it be that eggs and coffee either are good or harmful for you? Shouldn't a given genetic variant definitely either put you at high risk, or not?

The answer is typically no, and the fault is in the reporting of data, not the data themselves. This is for several very good reasons. There is measurement error. From everything we know, the kinds of outcomes we are struggling to understand are affected by a very large number of separate causally relevant factors. Each individual is exposed to a different set or level of those factors, which may be continually changing. The impact of risk factors also changes cumulatively with exposure time--because we age. And we are trying to make lifetime predictions, that is, ones of open-ended duration, often decades into the future. We don't ask "Will I get cancer by Saturday?", but "Will I ever get cancer?" That's a very different sort of question.

Each person is unique, like each storm, but we rarely have the kind of replicable sampling of the entire 'space' of potentially risk-affecting genetic variants--and we never will, because many genetic or even environmental factors are very rare and/or their combinations essentially unique, they interact and they come and go. More importantly, we simply do not have the kind of rigorous theoretical basis that meteorology does. That means we may not even know what sort of data we need to collect to get a deeper understanding or more accurate predictive methods.

Unique contributions of combinations of a multiplicity of risk factors for a given outcome means the effect of each factor is generally very small and even in individuals their mix is continually changing. Lifetime risks for a trait are also necessarily averaged across all other traits--for example, all other competing causes of death or disease. A fatal early heart attack is the best preventive against cancer! There are exceptions of course, but generally, forecasts are weak to begin with and in many ways over longer predictive time periods they will simply approximate the population--public health--average. In a way that is a kind of analogy with weather forecasts that, beyond a few days into the future, move towards the climate average.

Disease forecasts change peoples' behavior (we stop eating eggs or forego our morning coffee, say), each person doing so, or not, to his/her own extent. That is, feedback from the forecast affects the very risk process itself, changing the risks themselves and in unknown ways. By contrast, weather forecasts can change behavior as well (we bring our umbrella with us) but the change doesn't affect the weather itself.

Parisians in the rain with umbrellas, by Louis-Léopold Boilly (1803)

Of course, there are many genes in which variants have very strong effects. For those, forecasts are not perfect but the details aren't worth worrying about: if there are treatments, you take them. Many of these are due to single genes and the trait may be present at birth. The mechanism can be studied because the problem is focused. As a rule we don't need Big Data to discover and deal with them.

The epidemiological and biomedical problem is with attempts to forecast complex traits, in which most every instance is causally unique. Well, every weather situation is unique in its details, too--but those details can all be related to a single unifying theory that is very precise in principle. Again, that's what we don't yet have in biology, and there is no really sound scientific justification for collecting reams of new data, which may refine predictions somewhat, but may not go much farther. We need to develop a better theory, or perhaps even to ask whether there is such a formal basis to be had--or is the complexity we see is just what there is?

Meteorology has ways to check its 'precision' within days, whereas biomedical sciences have to wait decades for our rewards and punishments. In the absence of tight rules and ways to adjust errors, constraints on biomedical business as usual are weak. We think a key reason for this is that we must rely not on externally applied theory, but internal comparisons, like cases vs controls. We can test for statistical differences in risk, but there is no reason these will be the same in other samples, or the future. Even when a gene or dietary factor is identified by such studies, its effects are usually not very strong even if the mechanism by which they affect risk can be discovered. We see this repeatedly, even for risk factors that seemed to be obvious.

We are constrained not just to use internal comparisons but to extrapolate the past to the future. Our comparisons, say between cases and controls, are retrospective and almost wholly empirical rather than resting on adequate theory. The
'precision' predictions we are being promised are basically just applications of those retrospective findings to the future. It's typically little more than extrapolation, and because risk factors are complex and each person is unique, the extrapolation largely assumes additivity: that we just add up the risk estimates for various factors that we measured on existing samples, and use that sum as our estimate of future risk.

Thus, while for meteorology, Big Data makes sense because there is strong underlying theory, in many aspects of biomedical and evolutionary sciences, this is simply not the case, at least not yet. Unlike meteorology, biomedical and genetic sciences are the really harder ones! We are arguably just as likely to progress in our understanding by accumulating results from carefully focused questions, where we're tracing some real causal signal (e.g., traits with specific, known strong risk factors), as by just feeding the incessant demands of the Big Data worldview. But this of course is a point we've written (ranted?) about many times.

You bet your life, or at least your lifestyle!

If you venture out on the highway despite a forecast snowstorm, you are placing your life in your hands. You are also imposing dangers on others (because accidents often involve multiple vehicles). In the case of disease, if you are led by scientists or the media to take their 'precision' predictions too seriously, you are doing something similar, though most likely mainly affecting yourself.

Actually, that's not entirely true. If you smoke or hog up on MegaBurgers, you certainly put yourself at risk, but you risk others, too. That's because those instances of disease that truly are strongly and even mappably genetic (which seems true of subsets of even of most 'complex' diseases), are masked by the majority of cases that are due to easily avoidable lifestyle factors; the causal 'noise' that risky lifestyles make genetic causation harder to tease out.

Of course, taking minor risks too seriously also has known potentially serious consequences, such as of intervening on something that was weakly problematic to begin with. Operating on a slow-growing prostate or colon cancer in older people, may lead to more damage than the cancer will. There are countless other examples.

Life as a Garden Party
The need is to understand weak predictability, and to learn to live with it. That's not easy.

I'm reminded of a time when I was a weather officer stationed at an Air Force fighter base in the eastern UK. One summer, on a Tuesday morning, the base commander called me over to HQ. It wasn't for the usual morning weather briefing.....

"Captain, I have a question for you," said the Colonel.

"Yes, sir?"

"My wife wants to hold a garden party on Saturday. What will the weather be?"

"It might rain, sir," I replied.

The Colonel was not very pleased with my non-specific answer, but this was England, after all!

And if I do say so myself, I think that was the proper, and accurate, forecast.**

Plus ça change.. Rain drenches royal garden party, 2013; The Guardian

**(It did rain. The wife was not happy! But I'd told the truth.)

Comments

We always welcome comments, but we moderate them to reduce spam, gratuitous unkindness and so forth. Because we moderate comments, they won't appear on the blog until one of us publishes them, but we try to do that in a timely way.

We've had to make a change to the commenting page. People had told us that Blogger was eating their comments, so now, rather than embedding comment editing with the posts, it has to be done on a separate, full page. Unfortunately, the 'reply' option has disappeared so comments will just follow one another. We'll see how this goes.

FINALLY, please note that we assert copyright for the material written by us that appears here

Link List

Tuesday, August 13, 2019

Friday, October 19, 2018

Tuesday, October 16, 2018

Monday, September 3, 2018

Sunday, May 6, 2018

Thursday, October 13, 2016

Monday, August 1, 2016

Wednesday, March 16, 2016

Wednesday, January 27, 2016