December 17, 2010

The slipperiness of empirical truth

I just finished reading an engaging article in The New Yorker, called "The Truth Wears Off." The author Jonah Lehrer talks about a problem that many scientific disciplines face - it's called the decline effect. The decline effect is when well-established and multiply confirmed empirical studies begin to show a reduced effect size or are no longer provable.

One example of this is called the phenomenon of "verbal overshadowing" demonstrated by a psychologist named Jonathan Schooler in 1990. He showed that subjects shown a face and asked to describe it are LESS likely to recognize the face when shown the same face later than those who simply view the face. But Schooler states in this New Yorker article that he has since found it difficult to replicate this earlier finding. He says,

"It was as if nature gave me this great result and then tried to take it back."



Although I think this statement is revealing and would like to address the personal attachment underlying it, I will come back to the sentiment later.

One other study presented in The New Yorker article highlights some ideas that are worth remembering, especially in the wake of "arsenicgate." It is the well-known case of fluctuating asymmetry. I say well-known but it's probably well-known only to us evolutionary biologists. For those of you who don't know what it is, the idea is that picky females use symmetry as a proxy for genetic quality. The original study was published by Anders Moller in Nature and it set of a biological feeding frenzy where researchers tested it on everything from fruit flies to humans. But apparently the results did not hold - as the number of studies went up the effect sizes dropped and results were contradictory. The article quotes a biologist Leigh Simmons who performed his own study that failed to find an effect in horned beetles. He says,

"The worse part was that when I submitted these null results I had difficulty getting them published. The journals only wanted confirming data."

As a grad student I was introduced to the ideas surrounding publication bias. I remember one faculty member in a grad class ranting about the lack of null results distorting our scientific view of the world.

Publication bias is defined by Moller and Jennions (2001) as the phenomena arising from bias in submitting, reviewing, accepting and publishing scientific results. The published literature doesn't necessarily reflect the truth about the living world. Any bias in that view can then distort how we might see the truth. Moller and Jennions (2001) carried out a huge meta-analysis of the ecology and evolutionary literature to quantify this bias. They found that authors tend to submit studies that show positive results or disprove a null hypothesis. Second, editors rarely highlight null results. Thirdly, the prestige of an institution or author is linked to the likelihood that a paper will be accepted.


The funnel graph has a pretty basic explanation. If you examine the effect sizes from a sample of studies that have similar methods, then a plot of those sample sizes against effect sizes reveals a funnel. The base of the funnel occurs because there is a larger variance in effect size at small sample sizes and smaller variance when sample sizes are large. Sometimes researchers get lucky and get the big and splashy significant result and it will get published. But sometimes researchers will use a small sample size and find no effect - this will not get published. If the true effect is small and publication is related to statistical significance (unless sample size is large), a decrease in the effect sizes of studies with increasing sample size will be found (Moller and Jennions 2001). And this is exactly what was shown with a meta-analysis of fluctuating asymmetry studies. When sample size (of the different experiments) was plotted as a function of effect size, the pattern that emerged was - a decline in effect size with increased sample size.

But Lehrer is late to the party. This is something I learned in Stats101. And at least in my discipline, I think journals are making more of an effort to publish science, regardless of whether there is significance or not. One example is in the latest issue of Evolution, there is a paper entitled, "No effect of environmental hetereogeneity on the maintenance of genetic variation in wing shape in Drosophila melanogaster."

Selective reporting, however, is not the only form of author bias that can lead to the decline effect. The second is perception biases. Lehrer interviews an epidemiologist at Stanford, John Ioannidis, who says that these perception biases are a real problem because many researchers engage in "significance chasing" or trying to find ways to interpret the data so that they can get a positive result.

"It feels good to validate a hypothesis. It feels even better when you've got a financial interest in the idea or your career depends upon it. And that's why, even after a claim has been systematically disproven, you still see some stubborn researchers citing the first few studies that show a strong effect. They really want to believe that it's true."

In Lying Stones of Marrakech, Gould suggests the idea that observation is pure and unsullied would require us humans to free our minds from the constraint of our cultural, sociological and psychological baggage. Schooler's statement earlier, belies the fact that he was attached to his ideas. And it reminded me that when I was a grad student attending a conference how surprised I was at the defensiveness of scientists who presented their work. I think that this is exactly what Gould suggests turns the empiricist method into a shibboleth.

According to Ioannidis, the real problem is faulty design. You know I can't agree more. Although I wouldn't go as far as what Schooler suggests that "Every researcher should have to spell out, in advance, how many subjects they're going to use." I do agree that every scientific publication should be crystal clear as to "what exactly they're testing, and what constitutes a sufficient level of proof. We have to be much more transparent about our experiments."

One of the last issues that The New Yorker article describes is the presence of random chance events or noise in contributing to the decline effect. Stochastic variance. The article presents the case of John Crabbe a neuroscientist in the late 90s, whoc conducted the same experiment in three different labs: New York, Alberta and Oregon. The experiment looked at the behaviour of mice that were injected with cocaine. (I refuse to take "a shot" at this.) He found that despite standardizing every variable of the experiment there were inconsistencies that didn't follow any particular pattern.



And this is where Lehrer falls subject to his own selective reporting, he says,

"The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise. The hyperactivity of those coked-up Edmonton mice wasn't an interesting new fact - it was a meaningless outlier, a byproduct of invisible variables that we don't understand. The problem, of course, is that such dramatic findings are also most likely to get published in prestigious journals, since the data are both statistically significant and entirely unexpected."

While I don't disagree with the second half of his statement, he is wrong when it comes to stochastic variance. Any ecologist knows that many ecological systems are difficult to replicate and are rarely exact. And it's impossible to control all and every variable. Thus, we expect that uncertainty or noise will be present in the data, either in the form of observation uncertainty and/or process uncertainty (Hilborn and Mangel 1997 - The Ecological Detective). We hope that our design has been good enough to reduce observation uncertainty, and thus allow us to say something about the processes that may be acting, but that is not always the case. We can, however, perform preliminary experiments to estimate what the variance in the observations might be and then use a power analysis to determine what sample size might be needed to see a certain effect size, given that observational variance. And if the observational uncertainty is too high, then this in itself is useful information.

So just because there is noise swamping out the effect sizes in some experiments doesn't mean that "a lot of scientific data are nothing but noise." This is in my mind an example of publication bias in the media. The tendency toward exaggeration and the fantastical because after all that's what makes a good story.

I think the responsibility of doing exact and careful science lies with the scientist and his/her peers. Frankly, I don't think there is anything wrong with the scientific method, the problem lies with poor science, poor design and poor reviewers. As Gould suggests we as scientific peers must also act as "watchdogs to debunk the authoritarian form of the empiricist myth."

In the end, Francis Bacon got it right when he said,

"Idols are the profoundest fallacies of the mind of man. Nor do they deceive in particulars...but form a corrupt and crookedly-set predisposition of the mind; which doth, as it were, wrest and infect all the anticipations of the understanding. For the mind of man...is so far from being like a smooth, equal and clear glass, which might sincerely take and reflect the beams of things, according to their true incidence; that it is rather like an enchanted glass, full of superstitions, apparitions, and impostures."

And ultimately this is what led to the Arsenic Debacle. The researchers were hell bent on proving their hypothesis instead of disproving it. And thankfully to watchdogs like Rosie Redfield, Althea Andreadis , and Alex Bradley, they were held accountable. But it was a great reminder to the rest of us to not fall prey to our superstitions.

The article ends with Lehrer saying,

"We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to choose what to believe."

In the end, as a Bayesian, I think that we can only use empirical data to calculate the probability that an idea is true.

7 comments:

KateClancy said...

Wow, fantastic post. I really appreciated your perspective and your critical evaluation of the Lehrer piece. Did you see his follow-up on his Wired blog? I thought it was useful. I like his writing and what he was trying to do, but I also thought what you were saying -- and how you used some nice literature to support it -- were important.

Ms.PhD said...

I agree, great post, and one of my all-time favorite subjects.

I might point out, though, that "entirely unexpected" results are NOT the most likely to get published. There has to be either some pre-existing expectation, or some pre-existing controversy, for anything to get published. Things that come completely out of left field are usually not taken seriously. Which is too bad, because we usually learn years later that they had the potential to promote huge advances, if only anyone had been more open-minded at the time.

Carlo said...

Great post! There's a lot not to like about that last paragraph by Lehrer. As someone who prefers the Popperian approach to scientific testing, I don't think anything can be 'proved'. Conveniently, this avoids the issue of early data acting as 'proof' rather than confirmatory evidence of a hypothesis, which can be refuted later.

Coincidentally, have you come across any serious criticism of the arsenic paper in the professional literature? It's kind of surprising that despite how down I was about the paper from the beginning (trusting the insights of Redfield et al.) my impression was that most people here at the NIH loved it. Some of my fellow postdocs even seemed miffed that I was so cynical about it. That's what happens when you've got very few evolutionary people in a sea of clinicians and geneticists, I suppose ;-p

unknown said...

@KBHC
I haven't seen his follow-up blog - could you post the link here so folks can read it?

@MsPhD
Good point - conceptual basis is everything. I was assuming that was the case - but you know what happens you assume...

@Carlo
I have not come across any official journal-published criticism but presumably that has to do with when it was published, ie in the last month. The editors of Science were remiss in not giving the opposition voice at the same time they published the work. Rosie has written a letter to Science - no idea if they will publish it though. I do think that her blog and others constitute professional literature. Wow - I am surprised that folks at NIH loved it. Good for you for holding to your criticism!

Sue Ann Bowling said...

There's another source of bias, though I'm not sure it feeds into the decline of apparent results over time. This is that a scientist who differs from the norm finds it very hard to get funding. But then I come at it from getting my undergraduate degree just before plate tectonics was developed. There were obvious problems with pre-plate tectonics geophysics, even to me--you can't have land bridges AND isostacy. But it took a long time before plates were accepted.

Anonymous said...

Excellent post. Probably the best sci blog post I've read from anyone all year.

Nat Blair said...

Great post! Not sure how I missed it, but I'm glad I ended up finding it.

Love that Bacon quote. Perfect!

The liability of a brown voice.

 It's 2am in the morning and I can't sleep.  I'm unable to let go of the ruminations rolling around in my brain, I'm thinkin...