Thursday, January 28, 2016

The delicious smell of eggs!

Paleontologists like to give names, often self-serving names, to new fossil specimens they unearth. In  part, they want to control the agenda, the species and hence evolutionary track they are revealing (for the first time, naturally!).  One naturally wants to be known as the person who discovered Hobjob Man (Homo hobjobensis).

Well, geneticists are people, too, with all the vanities that accompany that distinction.  They want to name their genes and show their insight.  That's why we have names like 'BRCA' for the 'breast cancer' genes, and countless other examples.  In fact, BRCA1 is, on current best understanding, a general-use, widely expressed gene whose coded protein is used to detect certain types of DNA mutations in the cell, mismatches (non-complementarity) between opposite nucleotides at the corresponding location on the two strands of the DNA molecule) and help fix them.  It is not the, or even a, gene 'for' breast cancer!  It received its name because mutations in the gene were discovered being transmitted among victims of breast cancer in large families. Once identified, risk associated with the gene could be documented without needed to track it in families.   Proper gene-naming should describe the chromosomal location or normal function, where known of a gene, not why or how it was discovered, and not suggesting that its purpose is to cause disease.  Even the discovery-based labeling is risky because genes often if not typically serve multiple functions.

Humorous names like 'sonic hedgehog' are not informative but at least not misleading. One interesting example concerns the 'olfactory receptor' or OR genes.  These genes code for a set of cell-surface receptor proteins, part of a larger family of such genes, that were found in the olfactory (odor-detecting) tissues, such as the lining of the nose in vertebrates like mice and humans.  There is a huge family of such genes, about 1000 in mammals, that have arisen of the eons by gene duplication (and deletion) events.  Our genomes have isolated OR genes and also many clusters, of a few or up to hundreds of adjacent OR genes.  These arose by gene duplication events (and some were lost by inaccurate DNA copying that chopped off parts of the gene), so the number of active and inactive current and former OR genes are included, varying somewhat in each of our genomes.

Big arrays of genes like these often are inaccurately duplicated when cells divide, including during the formation of sperm and egg cells.  The inaccuracy includes mutations that affect the coded OR protein of a given OR gene and hence among the many different OR genes.  This process, over the millennia, generates the huge number and variety of gene family members, of which the OR family is the largest.  In the case of ORs, the idea has been that, like the immune system, these genes enable us to discriminate among odors--a vital function for survival, finding mates, detecting enemies, and so on.  Because of their high level of sequence diversity, each OR gene's coded protein responds to (can detect) a different set of molecules that might pass through the airways.  This allows us to detect--and remember--specific odors, because the combination of responding ORs is unique to each odor.  Discovery of this clever way by which Nature allows us to discriminate what's in our environment was worthy of a Nobel prize to Richard Axle and Linda Buck in 2004.

The catch is that this only works because each nasal olfactory cell expresses only a single OR gene. How the others are shut off in that cell, but each of them is turned on in other olfactory cells is interesting, but not really understood.  At least, this elaborate system evolved for olfactory discrimination....didn't it?  After all, the genes are named for that!

Well, not so fast.  A recent paper by Flegel at el. in Frontiers in Molecular Biosciences, has looked for OR expression in individual mammal sperm cells.  It has concluded that these genes, on the surface of sperm cells, enable it to find and fertilize eggs.  As described by the authors, sperm cells locate egg cells in the female reproductive tract by various chemosensory/receptor means, in a process not fully understood. Various studies have found OR genes expressed on the surface of sperm cells, where they have been said to be involved in the movement mechanisms of sperm.  These authors checked all known OR genes for expression in human sperm cells (they looked for their RNA transcripts).  91 OR genes were detected as being expressed in this way.  They showed their presence in various sub-cellular compartments in the sperm cells, which may be suggestive of specific functions.

Interestingly, the authors claim they've been leaders in detecting 'ectopically' expressed OR gene transcripts (but they aren't the only people documenting such 'ectopic' expression; see this post from 2012).  Whether this is just transcriptional noise or really functional, the very term 'ectopic' suggests the problem with gene naming.  If they're in sperm cells, they aren't properly named as 'olfactory' receptors.  These authors detected varying numbers of OR genes in different samples.  Some of this can be experimental error, but if it is highly controlled variable expression, serious questions arise.  Many of the transcripts were from the antisense (opposite) DNA strand to the one that actually codes for a protein sequence.

The authors found some systematic locations of specific OR genes in the sperm cells, as shown here:

Localization in sperm of specific Olfactory Receptor genes.  Source: Flegel et al., see text.

The plausibility of these results is quite strange.  It is no surprise whatever to find that genes are used in multiple contexts.  But in this particular case, repeatable findings could mean sloppy transcription, so that actually important genes are near the OR genes and the latter are just transcribed and/or translated with out real function.  Of course, the authors suggest there must be some function because, essentially, of the apparent orderliness of the findings.  Yet this is very hard to understand.

OR genes vary presumably because each variant responds to differing odorant molecules.  With a repertoire of hundreds of genes, and only one expressed per olfactory neuron, we can distinguish, and remember, odors we have experienced.   For similar reasons, the genes are highly mutable--again, that keeps the detectability repertoire up.  Your brain needs to recognize which receptor cells a given odor triggers, in case of future exposure.  But the combination of reporting cells, that is, their specific ORs, shouldn't generally matter so long as the brain remembers.

That eggy aroma!
There is a huge burden of proof here.  Again it is not the multiple expression, but the suggested functions, that seem strange.  If the findings actually have to do with fertilization, what is the role of this apparently random binding-specificity, the basic purportedly olfactory repertoire strategy, of these genes on the sperm cells' surface?  How can a female present molecules that are specifically recognized by this highly individualistic OR repertoire in the male?  How can her egg cell or genital tract or whatever, present detectable molecules for the sperm to recognize? What is it that guides or attracts them, whose specificity is retained even though the OR genes themselves are so highly variable?

And of course one has to be exceedingly skeptical about antisense OR-specific RNAs having any function, if that is what is proposed.  It is more than hard to imagine what that might be, how it would work, or most importantly, how it would have evolved.  Is this a report of really striking new genetic mechanisms and evolution....or findings not yet clear between function and noise?

The mechanism is totally unclear at present, and the burden of proof a major one.  Given that others have reported that OR genes are expressed in other cells, the evidence suggests that such expression is clearly believable, whatever the reason. Indeed, years ago it was speculated that they might serve to identify body cells with unique OR-based 'zip codes' for various internal use as we recognize which cell is in which tissue and the like.

Sperm- and/or testis-specific expression of at least some OR genes has also been observed before, as these authors note, but with less extensive characterization.  Is it functional, or just sloppy genome usage?  Time will tell.  The sperm cells are programmed to know the delicious smell of freshly prepared eggs.  Now, perhaps the next check should be to see whether the same sperm cells are also looking for (or lured by) the aroma of freshly fried bacon!

But if it is another use of a specific cell-identification system, of which olfactory discrimination is but one use, then it will be consistent with the well-known opportunistic nature of evolution.  There are countless precedents.  How this one evolved will be interesting to know and, perhaps especially, to learn whether olfaction was its initial use, or one adopted after some earlier--perhaps fertilization-related--function had already evolved.

But for our purposes today, the clear lesson, at least, should be the problem of coining gene names inaptly assigned because of their first-discovered function (or, in our view, because some whimsical geneticist liked a particular movie or cartoon character, like Sonic Hedghog).

Wednesday, January 27, 2016

"The Blizzard of 2016" and predictability: Part III: When is a health prediction 'precise' enough?

We've discussed the use of data and models to predict the weather in the last few days (here and here).  We've lauded the successes, which are many, and noted the problems, including people not heeding advice. Sometimes that's due, as a commenter on our first post in this series noted, to previous predictions that did not pan out, leading people to ignore predictions in the future.  It is the tendency of some weather forecasters, like all media these days, to exaggerate or dramatize things, a normal part of our society's way of getting attention (and resources).

We also noted the genuine challenges to prediction that meteorologists face.  Theirs is a science that is based on very sound physics principles and theory, that as a meteorologist friend put it, constrain what can and might happen, and make good forecasting possible.  In that sense the challenge for accuracy is in the complexity of global weather dynamics and inevitably imperfect data, that may defy perfect analysis even by fast computers.  There are essentially random or unmeasured movements of molecules and so on, leading to 'chaotic' properties of weather, which is indeed the iconic example of chaos, known as the so-called 'butterfly effect': if a butterfly flaps its wings, the initially tiny and unseen perturbation can proliferate through the atmosphere, leading to unpredicted, indeed, wildly unpredictable changes in what happens.
The Butterfly Effect, far-reaching effects of initial conditions; Wikipedia, source

Reducing such effects is largely a matter of needing more data.  Radar and satellite data are more or less continuous, but many other key observations are only made many miles apart, both on the surface and into the air, so that meteorologists must try to connect them with smooth gradients, or estimates of change, between the observations.  Hence the limited number of future days (a few days to a week or so) for which forecasts are generally accurate.

Meteorologists' experience, given their resources, provide instructive parallels as well as differences with biomedical sciences, that aim for precise prediction, often of things decades in the future, such as disease risk based on genotype at birth or lifestyle exposures.  We should pay attention to those parallels and differences.

When is the population average the best forecast?
Open physical systems, like the atmosphere, change but don't age.  Physical continuity means that today is a reflection of yesterday, but the atmosphere doesn't accumulate 'damage' the way people do, at least not in a way that makes a difference to weather prediction.  It can move, change, and refresh, with a continuing influx and loss of energy, evaporation and condensation, and circulating movement, and so on. By contrast, we are each on a one-way track, and a population continually has to start over with its continual influx of new births and loss to death. In that sense, a given set of atmospheric conditions today has essentially the same future risk profile as such conditions had a year or century or millennium ago. In a way, that is what it means to have a general atmospheric theory. People aren't like that.

By far, most individual genetic and even environmental risk factors identified by recent Big Data studies only alter lifetime risk by a small fraction.  That is why the advice changes so frequently and inconsistently.  Shouldn't it be that eggs and coffee either are good or harmful for you?  Shouldn't a given genetic variant definitely either put you at high risk, or not? 

The answer is typically no, and the fault is in the reporting of data, not the data themselves. This is for several very good reasons.  There is measurement error.  From everything we know, the kinds of outcomes we are struggling to understand are affected by a very large number of separate causally relevant factors.  Each individual is exposed to a different set or level of those factors, which may be continually changing.  The impact of risk factors also changes cumulatively with exposure time--because we age.  And we are trying to make lifetime predictions, that is, ones of open-ended duration, often decades into the future.  We don't ask "Will I get cancer by Saturday?", but "Will I ever get cancer?"  That's a very different sort of question.

Each person is unique, like each storm, but we rarely have the kind of replicable sampling of the entire 'space' of potentially risk-affecting genetic variants--and we never will, because many genetic or even environmental factors are very rare and/or their combinations essentially unique, they interact and they come and go.  More importantly, we simply do not have the kind of rigorous theoretical basis that meteorology does. That means we may not even know what sort of data we need to collect to get a deeper understanding or more accurate predictive methods.

Unique contributions of combinations of a multiplicity of risk factors for a given outcome means the effect of each factor is generally very small and even in individuals their mix is continually changing.  Lifetime risks for a trait are also necessarily averaged across all other traits--for example, all other competing causes of death or disease.  A fatal early heart attack is the best preventive against cancer!  There are exceptions of course, but generally, forecasts are weak to begin with and in many ways over longer predictive time periods they will simply approximate the population--public health--average.  In a way that is a kind of analogy with weather forecasts that, beyond a few days into the future, move towards the climate average.

Disease forecasts change peoples' behavior (we stop eating eggs or forego our morning coffee, say), each person doing so, or not, to his/her own extent.  That is, feedback from the forecast affects the very risk process itself, changing the risks themselves and in unknown ways.  By contrast, weather forecasts can change behavior as well (we bring our umbrella with us) but the change doesn't affect the weather itself.

Parisians in the rain with umbrellas, by Louis-Léopold Boilly (1803)

Of course, there are many genes in which variants have very strong effects.  For those, forecasts are not perfect but the details aren't worth worrying about: if there are treatments, you take them.  Many of these are due to single genes and the trait may be present at birth. The mechanism can be studied because the problem is focused.  As a rule we don't need Big Data to discover and deal with them.  

The epidemiological and biomedical problem is with attempts to forecast complex traits, in which most every instance is causally unique.  Well, every weather situation is unique in its details, too--but those details can all be related to a single unifying theory that is very precise in principle.  Again, that's what we don't yet have in biology, and there is no really sound scientific justification for collecting reams of new data, which may refine predictions somewhat, but may not go much farther.  We need to develop a better theory, or perhaps even to ask whether there is such a formal basis to be had--or is the complexity we see is just what there is?

Meteorology has ways to check its 'precision' within days, whereas biomedical sciences have to wait decades for our rewards and punishments.  In the absence of tight rules and ways to adjust errors, constraints on biomedical business as usual are weak.  We think a key reason for this is that we must rely not on externally applied theory, but internal comparisons, like cases vs controls.  We can test for statistical differences in risk, but there is no reason these will be the same in other samples, or the future.  Even when a gene or dietary factor is identified by such studies, its effects are usually not very strong even if the mechanism by which they affect risk can be discovered.  We see this repeatedly, even for risk factors that seemed to be obvious.

We are constrained not just to use internal comparisons but to extrapolate the past to the future.  Our comparisons, say between cases and controls, are retrospective and almost wholly empirical rather than resting on adequate theory.  The 
'precision' predictions we are being promised are basically just applications of those retrospective findings to the future.  It's typically little more than extrapolation, and because risk factors are complex and each person is unique, the extrapolation largely assumes additivity: that we just add up the risk estimates for various factors that we measured on existing samples, and use that sum as our estimate of future risk.  

Thus, while for meteorology, Big Data makes sense because there is strong underlying theory, in many aspects of biomedical and evolutionary sciences, this is simply not the case, at least not yet.  Unlike meteorology, biomedical and genetic sciences are the really harder ones!  We are arguably just as likely to progress in our understanding by accumulating results from carefully focused questions, where we're tracing some real causal signal (e.g., traits with specific, known strong risk factors), as by just feeding the incessant demands of the Big Data worldview.  But this of course is a point we've written (ranted?) about many times.

You bet your life, or at least your lifestyle!
If you venture out on the highway despite a forecast snowstorm, you are placing your life in your hands.  You are also imposing dangers on others (because accidents often involve multiple vehicles). In the case of disease, if you are led by scientists or the media to take their 'precision' predictions too seriously, you are doing something similar, though most likely mainly affecting yourself.  

Actually, that's not entirely true.  If you smoke or hog up on MegaBurgers, you certainly put yourself at risk, but you risk others, too. That's because those instances of disease that truly are strongly and even mappably genetic (which seems true of subsets of even of most 'complex' diseases), are masked by the majority of cases that are due to easily avoidable lifestyle factors; the causal 'noise' that risky lifestyles make genetic causation harder to tease out.

Of course, taking minor risks too seriously also has known potentially serious consequences, such as of intervening on something that was weakly problematic to begin with.  Operating on a slow-growing prostate or colon cancer in older people, may lead to more damage than the cancer will. There are countless other examples.

Life as a Garden Party
The need is to understand weak predictability, and to learn to live with it. That's not easy.

I'm reminded of a time when I was a weather officer stationed at an Air Force fighter base in the eastern UK.  One summer, on a Tuesday morning, the base commander called me over to HQ.  It wasn't for the usual morning weather briefing.....

"Captain, I have a question for you," said the Colonel.

"Yes, sir?"

"My wife wants to hold a garden party on Saturday.  What will the weather be?"

"It might rain, sir," I replied.

The Colonel was not very pleased with my non-specific answer, but this was England, after all!

And if I do say so myself, I think that was the proper, and accurate, forecast.**

Plus ça change..  Rain drenches royal garden party, 2013; The Guardian

**(It did rain.  The wife was not happy! But I'd told the truth.)

Tuesday, January 26, 2016

"The Blizzard of 2016" and predictability: Part II: When is a prediction a good one? When is it good enough?

Weather forecasts require the prediction of many different parameter values.  These include temperature, wind at the ground and aloft (winds that steer storm systems, and where planes fly), humidity on the ground and in the air (that determines rain and snowfall), friction (related to tornadoes and thunderstorms), change over time and the track of these things across the surface with its own weather-affecting characteristics (like water, mountains, cities).  Forecasters have to model and predict all of these things.  In my day, we had to do it mainly with hand-drawn maps and ground observations--no satellites, basically no useful radar, only scattered ship reports over oceans, etc.), but of course now it's all computerized.

Other sciences are in the prediction business in various ways.  Genetic and other aspects of epidemiology are among them.  The widely made, now trendy promise of 'precision' medicine, or the predictions of what's good or bad for you, are clear daily examples.  But as with the weather, we need some criteria, or even some subjective sense of how good a prediction is.  Is it reliable enough to convince you to change how you live?

Yesterday, I discussed aspects of weather prediction and what people do in response, if anything.  Last weekend's big storm was predicted many days in advance, and it largely did what was being predicted.  But let's take a closer look and ask: How good is good enough for a prediction?  Did this one meet the standard?

Here are predicted patterns of snowfall depth, from the January 24th New York Times, the day after the storm, with data provided by the National Weather Service:

And now here are the measured results, as reported by various observers:

Are these well-forecast depths, or not?  How would you decide?  Clearly, the maximum snowfall reported (42") in the Washington area was a lot more than the '20+"' forecast, but is that nit-picking?  "20+" does leave a lot of leeway for additional snowfall, after all.  But, the prediction contour plot is very similar to the actual result. We are in State College, rather a weather capital because the Penn State Meteorology Department has long been a top-rated one and because Accuweather is located here as a result.  Our snowfall was somewhere between 7 and 10 inches.  The top prediction map shows us in the very light area, with somewhere between 1-5" and 7-10" expected, and the forecasts were for there to be a sharp boundary between virtually no snowfall, and a large dump.  A town only a few miles north of us had very few inches.

So was the forecast a good one, or a dud?

How good is a good forecast?
The answer to this fair question depends on the consequences.  No forecast can be perfect--not even in physics where deterministic mathematical theory seems to apply.  At the very least, there will always be measurement errors, meaning you can never tell exactly how good a prediction was.

As a lead-up to the storm's arrival in the east, I began checking a variety of commercial weather companies (AccuWeather, WeatherUnderground, the Weather Channel, WeatherBug) as well as the US National and the European Weather Services, interested in how similar they were.

This is an interesting question, because they all rely on a couple of major computer models of the weather, including an 'ensemble' of their forecasts. The local companies all use basically the same global data sources, and the same physical theory of fluid dynamics, and the same resulting numerical models.  They try to be original (that's the nature of the commercial outfits, of course, since they need to make sales, and even the government services want to show that they're in the public eye).

In the vast majority of cases, as in this one, the shared data from weather balloons, radar, ground reports, and satellite imagery, as well as the same physical theory, means that there really are only minor differences in the application of the theory to the computed models.  Data resources allow retrospective analysis to make corrections to the various models and see how each has been doing and adjust them.  For the curious, most of this is, rightly, freely available on the internet (thanks to its ultimately public nature).  Even the commercial services, as well as many universities, make data conveniently available.

In this case, the forecasts did vary. All more or less had us (State College) on a sharp edge of the advancing snow front.  Some forecasts had us getting almost no snow, others 1-3", others in the 5-8" range.  These varied within any given organization over time, as of course it should when better models become available.  But that's usually when D-day is closer and there is less extrapolation of the models, in that sense less accuracy or usefulness from a precision point of view.  At the same time, all made it clear that a big storm was coming and our location was near to the edge of real snowfall. They all also agreed about the big dump in the Washington area, but varied in terms of what they foresaw for New York and, especially, Boston.  Where most snow and disruption occurred, they gave plenty of notice, so in that sense the rest can be said to be details.  But if you expected 3" of snow and got a foot, you might not feel that way.

If you're in the forecasting business--be it for the weather or health risks based on, say, your genome or lifestyle exposures--you need to know how accurate forecasts are since they can lead to costly or even life-or-death consequences.  Crying wolf--and weather companies seem ever tempted to be melodramatic to retain viewers--is not good of course, but missing a major event could be worse, if people were not warned and didn't take precautions.  So it is important to have comparative predictions by various sources based on similar or even the same data, and for them to keep an eye on each other's reasons, and to adjust.

As far as accuracy and distance (time) is concerned, precision is a different sort of thing.  Here is the forecast by our local, excellent AccuWeather company for the next several days:

This and figure below from

And here is their forecast for the days after that.

How useful are these predictions, and how would you decide?  What minor or major decisions would you make, based on your answers?  Here nothing nasty is in the forecast, so if they blow the temperature or cloud over on the out-days of this span, you might grumble but you won't really care.

However, I'm writing this on Sunday, January 24.  The consensus of several online forecasts was all roughly like the above figures.  Basically smooth sailing for the week, with a southerly and hence warm but not very stormy air flow, and no significant weather.  But late yesterday, I saw one forecast for the possibility of another Big One like what we just had.  The forecaster outlined the similarities today with conditions ten days ago, and in a way played up the possibility of another one like it.  So I looked at the upper-air steering winds and found that they seem to be split between one that will steer cold arctic air down towards the southern and eastern US, and another branch that will sweep across the south including the most Gulf of Mexico and join up with the first branch in the eastern US, which is basically what happened last week!

Now, literally as I write, one online forecast outfit has changed its forecast for the coming week-end (just 5 days from now) to rain and possibly ice pellets.  Another site now asks "Could the eastern US face more snow later this week?" Another makes no such projection.  Go figure!

Now it's Monday.  One commercial site is forecasting basically nothing coming.  Another forecasts the probability of rain starting this weekend.  NOAA is forecasting basically nothing through Friday.

But here are screenshots from an AccuWeather video on Monday morning, discussing the coming week.  First, there is doubt as to whether the Low pressure system (associated with precipitation) will move up the east coast or farther out to sea.  The actual path taken, steered by upper-level winds, will make a big difference in the weather experienced in the east.


The difference in outcomes would essentially be because the relevant wind will be across the top of the Low, moving from east to west, that is, coming off the ocean onto land (air circulates as a counter-clockwise eddy around the center of the Low).  Rain or possibly snow will fall on land as the result.  How much, or how cold it will be depends on which path is taken.  This next shot shows a possible late-week scenario.

The grey is the upper-level steering winds, but their actual path is not certain, as the prior figure showed, meaning that exactly where the Low will go is uncertain at present.  There just isn't enough data, and so there's too much uncertainty in the analysis, to be more precise at this stage.  The dry and colder air shown coming from the west would flow underneath the most air flowing in from offshore, pushing it up and causing precipitation.  If the flow is more eastward of the alternatives in the previous figure, the 'action' will mainly be out at sea.

Well, it's now Monday afternoon, and two sites I check are predicting little if anything as of the weekend....but another site is predicting several days in a row of rain.  And....(my last 'update'), a few hours later, the site is predicting 'chance of rain' for the same days.

To me, with my very rusty, and by now semi-amateur checking of various things, it looks as if there won't be anything dropping on us.  We'll see!

The point here is how much things change and how fast on little prior indication--and we are only talking about predicting a few days, not weeks, ahead.  The above AccuWeather video shows the uncertainty explicitly, so we're not being misled, just advised.

This level of uncertainty is relevant to biology, because meteorology is based on sophisticated, sound physics theory (hydrodynamics, etc.).  It lends itself to high-quality, very extensive and even exotic instrumentation and mathematical computer simulation modeling.  Most of the time, for most purposes, however, it is already an excellent system.  And yet, while major events like the Big Blizzard this January are predictable in general, if you want specific geographic details, things fall short.  It's a subjective judgment as to when one would say "short of perfection" rather than "short but basically right.".

With more instrumentation (satellites, radar, air-column monitoring techniques, and faster computers) it will get inevitably better.  Here's a reasonable case for Big Data.  However, because of measurement errors and minor fluctuations that can't be detected, inaccuracies accumulate (that is an early example of what is meant by 'chaotic' systems: the farther down the line you want to predict, the greater your errors.  Today, in meteorology, except in areas like deserts where things hardly change, I've been told by professional colleagues who are up to date, that a week ahead is about the limit.  After that, at least under conditions and locations where weather change is common, specific conditions today are no better than the climate average for that location and time of year.

The more dynamic a situation--changing seasons, rapidly altering air and moisture movement patterns, mountains or other local effects on air flow, the less predictable over more than a few days. You have to take such longer-range predictions with a huge grain of salt, understanding that they're the best theory and intuition and experience can do at present (and taking into account that it is better to be safe--warned--than sorry, and that companies need to promote their services with what we might charitably call energetic presentations).  The realities are that under all but rather stable conditions, such long-term predictions are misleading and probably shouldn't even be made: weather services should 'just say no' to offering them.

An important aspect of prediction these days, where 'precision' has recently become a widely canted promise, is in health.  Epidemiologists promise prediction based on lifestyle data.  Geneticists promise prediction based on genotypes.  How reliable or accurate are they now, or likely to become in the predictable future?  At what point does population average do as well as sophisticated models? We'll discuss that in tomorrow's installment.

Monday, January 25, 2016

"The Blizzard of 2016" and predictability: Part I--the value of prediction

Mark Twain famously quipped, "Everybody talks about the weather but nobody does anything about it." But these days, that's far from accurate.  At least, an army of specialists try to predict the weather so that we can be prepared for it.  The various media, as well as governmental agencies, publicize forecasts.  But how good are those forecasts?

As a former meteorologist myself (back--way back--when I was an Air Force weather officer), I take an interest, partly professional but also conceptual, in how accurate forecasting has become in our computer and satellite era.

Last week, a storm developed over the southwest, and combined with atmospheric disturbance barreling down from the Canadian arctic, to cause huge rain and wind damage across the south and then veered north where it turned into "The Blizzard of 2016", dubbed by the exaggeration-hungry media.  How well was it forecast and did that do any societal good?

Here is a past-few-days summary page of mapped conditions at upper air (upper left), surface (upper right) and other levels.  On a web page called eWall ( ) you can scroll these for the prior 5 days.  The double Low pressure (red L's) on the right panel represent the center of the storm, steered in part by the winds aloft (other panels).

If you followed the forecasting over the week leading to the storm's storming up the east coast to wreak havoc there, you would say it was exceedingly well forecast, and many days in advance. Was it worth the cost?  One has to say that probably many lives were saved, huge damage avoided, and disruption minimized: people emptied grocery store shelves and hunkered down to watch the Weather Channel (and State College's own Accuweather).  Urgent things, including shopping for supplies in case of being house-bound, were done in advance and probably many medical and other similar procedures were done or rescheduled and the like.  Despite the very heavy snowfall, as predicted, the forecast was accurate enough to have been life-saving.

Lots of people still don't do anything about it!
And yet....
Despite a lot of people talking about the weather, on all sorts of public media, masses of people, acting like Mark Twain, don't do anything about it, even with the information in hand.  At least 12 people died in this storm in accidents, and others from coronaries while shoveling, and this is just what I've seen in a quick check of the online news outlets.  Thousands upon thousands were stranded for many hours in freezing cold on snow-sodden highways.  There were things like 25-mile-long stationary lines of vehicles on interstates and thousands of car and truck accidents.  That's a lot of people paying the price for their own stubbornness or ignorance.  This is what such a jam looks like:

A typical snowstorm traffic jam (
People were warned in the clearest terms for days in advance.  Our fine National Weather Service, in collaboration with complementary services in other countries, scoped out the situation and let everyone know about it, as is their very important job.  Some states, like New York,  properly closed their roads to all but necessary traffic. Their governments did their jobs.  Other states, like Kentucky, failed to do that.  So then, how is it that there was so much of what seems like avoidable damage?

Let's put the issue another way: My auto insurance rates will reflect the thousands of costly claims that will be filed because of those who failed to heed the warnings and were out on the highways anyway. So I paid for the forecasts first through my taxes, and then through the purchase prices of goods whose makers pay to advertise on weather channels, but then I also have to pay for those whose foolhardiness led to the many accidents they'll make claims for.  That's similar to people knowingly enjoying an unhealthy lifestyle, and then expecting health insurance to cover their medical bills--that insurance, too, is amortized over the population of insured including those who watch their lifestyles conscientiously.  That's the nature of insurance.

Some people, of course, simply can't stay home.  But many just won't.  Countless truckers were stranded on the roads.  They surely knew of the coming storm.  Did commercial pressure keep them on the road?  Then shame on their companies!  They surely could have pulled over or into Walmart parking lots to wait out the snowfall and its clearance--a day or so, say.  Maybe there aren't enough parking lots for that, but surely, surely they should not have been on the Interstates!  And while some people probably had strong legitimate reasons for being out, and a few may not have seen the strong, repeated forecasts over the many preceding days, most and I would say by far the most, just decided to take their trips anyway.

Nobody can say they aren't aware of pileups, crashes, and hours-long stalls that happen on Interstates during snowstorms.  It is not a new phenomenon!  Yet, again, we all will have to pay for their foolhardiness.  Maybe insurance should refuse to cover those on the road for unnecessary trips. Maybe those who clog the roads in this way should be taxed to cover the costs of, say, increased insurance rates on everyone else or emergencies that couldn't be dealt with because service vehicles couldn't get to the scene.

The National Weather Service, and companies who use their data, did a terrific job of alerting people of the coming storm, and surely saved many lives and prevented damage as a result.  Just as they do when they forecast hurricanes and warn of tornadoes.  Still, there are always people who ignore the warnings, at their own cost, and at cost to society, but that's not the fault of the NWS.

But what about predictability? Did they get it right?  What is 'right'?
It is a fair and important question to ask how closely the actual outcome of the storm was predicted.   The focus is on the accuracy in detail, not the overall result, and that leads one to examine the nature of the science and--of course in our case here on this blog--to compare it with the state of the art of epidemiological, including genetic, predictions.  Not all forecasts are as dramatic and in a sense clear-cut as a major storm like this one.

I have been in the 'prediction' business for decades, first as a meteorologist and subsequently in trying to understand the causal relationships, genetic and evolutionary, that explain our individual traits.  Tomorrow, we'll discuss aspects of the Big Storm's forecasts that weren't so accurate and compare that with the situation in these biological areas.

Monday, January 18, 2016

Blogs and public science: nothing new

Universities generally say their faculty have three major responsibilities: teaching, research, and service.  That's the usual listing order, though in our time it would be perhaps more accurate to reverse that.  Service comes first, in the form first and foremost of grants and fund-raising, and secondly, time-eating bureaucracy.  Research, meaning raising funds (again!) and lots of publication comes second.  Research is worshipped as a public good, but arcane research counts in many fields (like the humanities or very micro-focused science).  Teaching, well, only if you can't find a way to get out of it.

Actually, we're being cynical (just a bit).  'Service' isn't only about hawking for money.  Public education is also part of that.  And again we're not being wholly cynical about that.  The universities want to write about your work in their PR magazines sent to alumni and in press releases (of course, this is also largely about money).  But educating the general public about what we're doing in research and scholarship is, in fact, an important role even if part of that is self-serving.  Popular science books, for example, draw attention but also can provide the non-specialist citizenry a way to get a general understanding or even a fascination with scholarly and scientific discoveries.

Science can be very technical, specialized, and arcane.  Much of what we ask about is quite remote from direct application or things the non-scientific public care about much less know anything about. That means that if you don't have the detailed background or time to continue keeping up to date in a professional sense, as most people clearly don't, having a professional explain the gist of the issues can be quite valuable and also quite interesting.

The idea of 'popular' science has changed over time, of course, because in the past only a small segment of the public was involved, perhaps only the aristocracy. This was true of much of music, philosophy, literature and so on.  But there has also long been at least somewhat of a tradition of specialists explaining things to the public.

Probably among the most common instances would be religious leaders explaining the technicalities of scripture.  Travelers have long told tales of what they saw in far-away places.  Even ancient itinerant speakers--Homer, perhaps, in ancient Greece--came around and did this, 'performing' in a sense.  Maybe most of this was for the upper classes to witness, but how restrictive that would have been probably varied.

The physician Galen was a performer of this sort.  He did dissections (or, worse, vivisections) to show off anatomy and attract attention to his medical knowledge and services.  I don't know about Marco Polo, but would expect he regaled many with his tales.  Boyle, Thomas Edison, and others here and in Europe regularly put on demonstration shows for the public, or at least those whose support they might want.  Probably phrenologists and alchemists did the same.

In the 18th and 19th centuries in Europe, travelers certainly entertained audiences with their tales, often for fund-raising purposes.  A major exploratory expedition to the South Pole was funded in this way (not by government grants).  Thomas Huxley, Darwin's famous 'bulldog' (outspoken, aggressive advocate) loved to give public lectures, and was especially dedicated to educating the working man. In the mid-1800s, Michael Faraday gave public lectures about phenomena related to electricity and magnetism.  Not all of these were university faculty by any means, and perhaps most of the latter stayed within their classrooms.  But there were several leading academics who became quite popular among the reading and museum/lecture-attending public.  Tales of fossil exploration in the American west were given, back east, and public intellectuals in the major coastal universities were well-known.

Popular science and popularizing scientists: only the formats are new
The tradition has proliferated as faculty have more and more been expected to do 'service', including public education.  For much of the 20th century and even more into the present, the public scientist has been a TV stable, and one widely seen in magazines and newspaper science sections.  Indeed, many now are not really scientists any longer but drop-outs who became journalists or documentary makers.  But a few famous ones, like Steven Jay Gould, Ed Wilson, Richard Feynman, Neal deGrasse Tyson, Sean Carroll, Carl Sagan, Richard Dawkins, and others essentially kept their academic groundings.  Often one could argue they lost much of their rigorous credentials in the process, but not always.

As media changed, so did the means by which scientists, professional and once-were, could convey the gist of technical science to the public.  Among other things, funding has become media-driven and so organizations like NIH, NASA, and universities (and their counterparts in other countries) have major PR departments of their own, to spin as well as educate.

We now have at least two relatively new medium: the blogosphere and open-source publishing.  The latter is largely still arcanely professional, but is opening up to unmonitored commentary.  Popular science magazines online allow commentary and discussion.  Q and A sites, like Quora in physics, Reddit, and many others, provide interactions among scientists, students, and public freely and without being restricted to classrooms.

Blogs, such as this one, are increasingly being used as regular outlets for faculty, to communicate to the web-addicted public.  Here, we can write technical or popular or tweener posts.  We can Tweet a post to reach a desired audience.  We can mix technical, whimsical, speculative and specific commentaries.

Universities need to get on board faster
Universities are themselves now waking up to the value of these media as part of faculty members' 'service' responsibilities.  But universities, which should pioneer what's new, are often stodgy and fearsomely conservative.  Deans and chairs tend to stick to what's known, the traditional, even if most research, in the sciences as well as humanities, mainly collects dust in library archive annexes. Students go to the library to work on computers, online data bases, and the like.

It's not just that most research will quickly be dust-collecting.  The idea of peer review is way over-rated as a way to purge the bad and only publish the good and important.  Peer review is creaking under the weight of its hoary insider tradition, and because reviewers are so overloaded that they rarely can give proper scrutiny.  Overloaded 'supplemental information' doesn't help, nor does the need to review grant proposals (or write them).  Time is short.

It's true that traditional publishing of research in peer-reviewed journals, even burnished with online (pay as you go) open-access routes, still has first priority in administrators' eyes.  But things are changing, as they should, must, and will.  Online publishing also has online open reviewing, and comments by readers.  There may be far too many journals, but weird ideas do have a chance to be seen, and online searching makes them available.  Much is junk, of course, but at least you, not a panel of insider reviewers, get to judge.

It's a different kind of arena, and recalcitrant institutions will have to modernize.  Some faculty we know (including our own, fantastic Holly Dunsworth) have successfully, and deservedly achieved tenure with public media being a substantial part of their records.  As they age into administrative roles, the changing landscape will be built into their world-views.  That, too, will mature and perhaps become stodgy, to be replaced or supplemented by whatever the future holds.  But it's likely to be much more dynamic and flexible than the legacy of the past that we have too much still to live with today.

As open-source and online media increase their fraction of publication, we will likely become a more widely integrated and aware society.  The local classroom is opening up to a global forum, where anyone, not just the elite few, can gather round, and hear whichever oracle or orator they choose.

Homer would probably recognize the phenomenon.

Tuesday, January 12, 2016

Cancer--luck or environment? Part II: Nothing to food-fight over

Yesterday we commented on the 'controversy' over whether cancer is mainly due to environmentally (lifestyle) or inherently (randomly) arising mutations.  This is a tempest in a teaspoon.

Mutations, whatever their individual cause, must accumulate among dividing cells until one cell has the bad luck to accumulate a set of changes that 'transforms' it into a misbehaving cancer cell.  The set of changes varies even among tumors of the same organ, because many different genes and their expression-regulation contribute to the growth, or restraint of growth, even within the same tissue. That is, not all breast, colon, or lung cancers are caused by the same set of mutations.   It then proliferates, rapidly dividing and thus indubitably acquiring more mutational changes that enable it to do things like metastasize to other parts of the body, or develop resistance to drug treatment.  The more rapidly it grows and spreads, the more rapidly such things can happen.

Even if the first transformational cause were due entirely to environmentally-induced mutations, the real dangers that ensue during the tumor's lifespan are relatively rapid additions to the original tumorigenesis process, and so in a sense the main dangers of cancer are primarily, if not nearly exclusively, due to inherent mutation among cancer cells.  If you get lung cancer and then stop smoking, your lung cancer will still evolve. Indeed, if environment contributes, it may make things worse--if that "environment" is radiation or chemotherapy: radiation definitely causes mutations, and chemotherapy weeds out cells that haven't experienced resistance mutations, leaving or even making room for tumor lineage cells that do have resistance mutations.  Finally, things that stimulate cell division can facilitate new mutations or even just make a tumor spread more rapidly.

So clearly cancer is not all due to environmental, nor to inherently occurring changes.  These and other factors comprise multiple, interacting causative effects.  Attributing cause to environment or inherency is misleading.  But what if cancer were in fact even entirely due to lifestyle factors that stimulate cell division or directly cause mutation?  Of course this would be very good for the Big Data epidemiologists and their studies, and threatening to industries and so on that produce mutagenic waste or products etc.  But suppose epidemiologists were to continue to find major carcinogenic environmental factors (that is, that the major ones, like smoking, aren't already known). Let us further suppose that avoidance behavior were to follow the announcement of the risks (not an obvious thing to assume, actually; the tobacco industry is still thriving, after all).  Then what?

Epidemiologists would say their work has prevented cancer and would claim victory over the to-them strange idea that cancer is due to inherent mistakes in DNA replication and is inevitable if one were to live long enough.  A lifestyle-change-based reduction in cancer would be clearly a very good thing.  But in fact, it would not be an unalloyed victory: one thing it would do is keep the non-exposed person alive (because s/he didn't get cancer!) and that in turn means that s/he would be at higher risk of (1) other age-related deteriorative diseases that dying of cancer would have precluded, many of which are waiting in the wings at ages when cancers arise, and (2) eventually getting cancer at some older age.  In the first case, the rates of other diseases like stroke and diabetes etc. would necessarily go up.  The risk of slowly petering out in increasingly bad shape in an intensive nursing unit would go up.  That would, of course, lower the lifetime cancer risk, but not in a very pleasant way.

In other words, lifestyle changes can delay cancer, but even assuming that the per-year exposure to environmental mutagens were reduced, the consequently longer exposure to those mutagens might mean their lifetime total would go up), so whether or not it decreased the lifetime risk of cancer would be an open question.  However, what this would do would be, by removing environmental causes, to raise the fraction of cancers that are due to inherent mutation, strengthening the fraction of Vogelstein-Tomasetti cases!

It's undoubtedly good to get cancer later rather than earlier in life, but not an unalloyed good.  In any case, what these points show is that the argument over the particular fraction of cancers that are due to environment vs inherent mutation is rather needless.  At most it might be relevant to ask how much of funding investment in big epidemiological studies is going to pay off, rather than spending on some other clearer issues (especially if the major environmental mutagens are already known).  There have already been scads of massive long-term studies of almost anything you can name, to identify carcinogenic exposures. With some very important exceptions, that are by now well known, these studies have largely come up empty, or with now-it-is/now-it-isn't conclusions, in the sense that risk factors are either weak, or if strong are rare and hard to find embedded in the broad mix of chronic disease risk factors.  Environments are always changing with new possible carcinogenic exposures arising, but basically those with strong effects usually show up on their own such as by multiple cases of a particular cancer type in some specific location or among workers in a particular industry or in vitro mutagenesis studies and the like.

If causation is too generic, don't get your hopes up
If comparisons among countries, for example, show that the same cancer can have very different age patterns or incidence rates, this may suggest lifestyles as major risk differences. But that's far from saying that the causal elements are individually strong or simple enough to be enumerated by the usual Big Study epidemiological approach. One can be extremely doubtful that this would be the case.

Saying something is 'environmental' because, for example, it varies among populations is like saying something is 'genetic' because it varies among relatives.  If it's like genetic factors as documented by countless GWAS studies, there are many different, correlated or even independent contributors, then each person's cancer will be due to a different set or complex set of experiences and the luck of the mutational draw.  As with GWAS and related approaches, it is far from clear that large, long-term environmental studies, more mega than we've already had for decades, will be the appropriate way to approach the problem.

Indeed, to a considerable extent, if each case is causally unique, by some different combination of factors and their respective strengths in that individual, then it's epistemologically not very different from saying that cancer occurs randomly, which, though for a different sort of reason, is what V and T said.  There won't, for example, be a specific environmental change you can make, any more than a specific gene you can re-engineer, to make the disease go away or even to change much in frequency or age of onset.

Food fights like this one are normal in science and often have to do with egos, investment in one 'paradigm' or another, how research is supported or advice from experts are conveyed to the public.  But such disputes, though very human, are rather off the point.  We often basically ignore risks we know, as in the proliferation of CT and other radiation-based scanning and medical testing which can be carcinogenic.  Life is mutagenic, one way or another.  So while you have life, enjoy your food--don't waste it by throwing it at each other!  There are better questions to argue about.

Monday, January 11, 2016

Food-Fight Alert!! Is cancer bad luck or environment? Part I: the basic issues

Not long ago Vogelstein and Tomasetti stirred the pot by suggesting that most cancer was due to the bad luck of inherent mutational events in cell duplication, rather than to exposure to environmental agents.  We wrote a pair of posts on this at the time. Of course, we know that many environmental factors, such as ionizing radiation and smoking, contribute causally to cancer because (1) they are known mutagens, and (2) there are dose or exposure relationships with subsequent cancer incidence. However, most known or suspected environmental exposures do not change cancer risk very much or if they do it is difficult to estimate or even prove the effect.  For the purposes of this post we'll simplify things and assume that what transforms normal cells into cancer cells is genetic mutations; though causation isn't always so straightforward, that won't change our basic storyline here.

Vogelstein and Tomasetti upset the environmental epidemiologists' apple cart by using some statistical analysis of cancer risks related, essentially, to the number of cells at risk, their normal time of renewal by cell division, and age (time as correlated with number of cell divisions).  Again simplifying, the number of at-risk actively dividing cells is correlated with the risk of cancer, as a function of age (reflecting time for cell mutational events), and with a couple of major exceptions like smoking, this result did not require including data on exposure to known mutagens.  V and T suggested that the inherently imperfect process of DNA replication in cell division could, in itself, account for the age- and tissue-specific patterns of cancer.  V and T estimated that except for the clear cases like smoking, a large fraction of cancers were not 'environmental' in the primary causal sense, but were just due, as they said, to bad luck: the wrong set of mutations occurring in some line of body cells due to inherent mutation when DNA is copied before cell division, and not detected or corrected by the cell.  Their point was that, excepting some clear-cut environmental risks such as ionizing and ultraviolet radiation and smoking, cancer can't be prevented by life-style changes, because its occurrence is largely due to the inherent mutations arising from imperfect DNA replication.

Boy, did this cause a stink among environmental epidemiologists!  Now one we think undeniable factor in this food fight is that environmental epidemologists and the schools of public health that support them (or, more accurately, that the epidemiologists support with their grants) would be put out of business if their very long, very large, and very expensive studies of environmental risk (and the huge percent of additional overhead that pays the schools' members meal-tickets) were undercut--and not funded and the money went elsewhere.  In a sense of lost pride, which is always a factor in science because it's run by humans, all that epidemiological work would go to waste, to the chagrin of many, if it was based on misunderstanding the basic nature of the mutagenic and hence carcinogenic processes.

So naturally the V and T explanation has been heavily criticized from within the industry.  But they will also raise the point, and it's a valid one, that we clearly are exposed to many different agents and chemicals that are the result of our culture and not inevitable and are known to cause mutations in cell culture, and these certainly must contribute to cancer risk.  The environmentalists naturally want the bulk of causation to be due to such lifestyle factors because (1) they do exist, and (2) they are preventable at least in principle.  They don't in principle object to the reality that inherent mutations do arise and can contribute to cancer risk, but they assert that most cancer is due to bad behavior rather than bad luck and hence we should concentrate on changing our behavior.

Now in response, a paper in Nature ("Substantial contribution of extrinsic risk factors to cancer development," Wu et al.) provides a statistical analysis of cancer data that is a rebuttal to V and T's assertions.  The authors present various arguments to rebut V and T's assertion that most cancer can be attributed to inherent mutation, and argue instead that external factors account for 70 to 90% of risk.  So there!

In fact, these are a variety of technical arguments, and you can judge which seem more persuasive (many blog and other commentaries are also available as this question hits home to important issues--including vested interests).  But nobody can credibly deny that both environment and inherent DNA replication errors are involved.  DNA replication is demonstrably subject to uncorrected mutational change, and that (for example) is what has largely driven evolution--unless epidemiologists want to argue that for all species in history, lifestyle factors were the major mutagens, which is plausible but very hard to prove in any credible sense.  

At the same time, environmental agents do include mutational effects of various sorts and higher doses generally mean more mutations and higher risk.  So the gist of the legitimate argument (besides professional pride or territoriality and preservation of public health's mega-studies) is really the relative importance of environment vs inherent processes.  The territoriality component of this is reminiscent of the angry assertion among geneticists, about 30 years ago, that environmental epidemiologists and their very expensive studies were soaking up all the money so geneticists couldn't get much of it.  That is one reason geneticists were so delighted when cheap genome sequencing and genetic epidemiological studies (like GWAS) came along, promising to solve problems that environmental epidemiology wasn't answering--to show that it's all in the genes (and so that's where the funding should go).  

But back to basic biology 
Cells in each of our tissues have their own life history.  Many or most tissues are comprised of specialized stem cells that divide and one of the daughter cells differentiates into a mature cell of that tissue type.  This is how, for example, the actively secreting or absorbing cells in the gut are produced and replaced during life.  Various circumstances inherent and environmentally derived can affect the rate of such cell division. Stimulating division is not the same as being a direct mutagen, but there is a confounding because more cell division means more inherent mutational accumulation.  That is, an environmental component can increase risk without being a mutagen and the mutation is due to inherent DNA replication error.  Cell division rates among our different tissues vary quite a lot, as some tissues are continually renewing during life, others less so, some renew under specific circumstances (e.g., pregnancy or hormonal cycles), and so on.

As we age, cell divisions slow down, also in patterned ways.  So mutations will accumulate more slowly and they may be less likely to cause an affected cell to divide rapidly.  After menopause, breast cells slow or stop dividing.  Other cells, as in the gut or other organs, may still divide, but less often.  Since mutation, whether caused by bad luck or by mutagenic agents, affects cells when they divide and copy their DNA, mutation rates and hence cancer rates often slow with advancing age.  So the rate of cancer incidence is age-specific as well as related to the size of organs and lifestyle stimulates to growth or mutation.  These are at least a general characteristics of cancer epidemiology.

It would be very surprising if there were no age-related aspect to cancer (as there is with most degenerative disease).  The absolute risk might diminish with lower exposure to environmental mutagens or mitogens, but the replicability and international consistency of basic patterns suggests inherent cytological etiology.  It does not, of course, in any sense rule out environmental factors working in concert with normal tissue activity, so that as noted above it's not easy to isolate environment from inherent causes.

Wu et al.'s analysis makes many assumptions, the data (on exposures and cell-counts) are suspect in many ways, and it is difficult to accept that any particular analysis is definitive.  And in any case, since both types of causation are clearly at work, where is the importance of the particular percentages of risk due to each?  Clearly strong avoidable risks should be avoided, but clearly we should not chase down every miniscule risk or complex unavoidable lifestyle aspect, when we know inherent mutations arise and we have a lot of important diseases to try to understand better, not just cancer.

Given this, and without discussing the fine points of the statistical arguments, the obvious bottom line that both camps agree on is that both inherent and environmental mutagenic factors contribute to cancer risk. However, having summarized these points generally, we would like to make a more subtle point about this, that in a sense shows how senseless the argument is (except for the money that's at stake). As we've noted before, if you take into account the age-dependency of risk of diseases of this sort, and the competing causes that are there to take us away, both sides in this food fight come away with egg on their face.  We'll explain what we mean, tomorrow.