A couple popular articles have challenged how well the scientific method works. David H. Freeman in The Atlantic, and Jonah Lehrer The Truth Wears Off, in The New Yorker.
Lehrer argues that effect sizes tend to get smaller over time in a broad range of disciplines, including in the physical sciences(More on that later.) According to Lehrer, “The decline effect is troubling because it reminds us how difficult it is to prove anything.”
John P. A. Ioannidis argues that in some fields the majority of positive findings are flat out wrong and are not corrected quickly. Even when corrected, not everyone notices. The science may not be as quickly self correcting as we like to think. Ioannidis found that 11 of 45 highly cited studies had never been even been retested. More disturbing, when others were tested and found to be incorrect, it sometimes took years for the word to spread. (Consider the Autism/vaccination link, for example.)
Some conventional explanations for problems include:
- Regression to the mean: This is probably the easiest to see. Early results are likely to be wrong because they result from small samples with a large standard error. As the sample size increases, the value will “regress” towards its true value. Large effects in early work are likey to encourage follow up.
- Publication bias: Smaller effects or results that are not significant can be hard to publish. If they are not rejected by the publisher, the discouraged researcher might decide that the work is not interesting. Either way, the bias is to notice large effect measurements and forget send the smaller ones down the memory hole.
- Researcher bias: Scientific dispassion is as great a myth as the “mad scientist” see [Mitroff, 74]. Researchers work hard on their pet theories. They make their reputation with interesting and unexpected results. Researchers can select approaches most likely to get the desired results and they will look harder at results that don’t conform to their expectations. The problem is made worse by financial incentives, publish or perish, and flexible research designs.
- Population or environmental differences: Success in a specific situation may not carry over to another. For example, if drug, technique, or a treatment helps a well defined group, say people in a given age group who have a specific disease with a specific symptom, then it is likely to be tried with other similar, but different groups. If it is less effective with the new group, those numbers will used in the overall result, thus reducing the effect sized measurement. This wouldn’t explain the axial vector coupling referred to by Lehrer, but I challenge that example anyway.
- Chance (and publication bias): Significance at the p=0.05 level, means that 1 in 20 experiments a true null will still appear to have a significant effect.
- Significance Chasing: “If you torture the data long enough, it will confess”. (Attributed to Ronald Coase) If the originally planned experiment shows no statistically significant results, so one works the data, removes outliers, and scans for other effects. This is especially problematic when the research approach and objectives are flexible. This plays to the strong incentives of academics to publish.
- Large data sets. This often goes with data mining and statistically significant, but practically insignificant effects. Paul Meehl once demonstrated how easy it is to find correlations in a * large* data set. Large samples can detect much smaller effects.
(See also http://scienceblogs.com/pharyngula/2010/12/science_is_not_dead.php, from whom I’ve adapted some of these ideas)
But the problems are deeper. In a field where we know the yield is low, that is the likelihood of a real positive result is low, most positive findings will be false. That’s just bayesianism. Consider the typical 80% power, 5% p experiment. The statistical approach has an implicit assumption of an uninformative prior, don’t really know if a positive result is likely or not. What if the field is very low yield, very unlikely to produce a result? If we get a positive result, the most likely reason is chance rather than a real effect. This works in reverse as well.
Less well known explanations include more subtle statistical arguments
Andrew Gelman: statistical significance testing can be biased towards large results. Statistical significance relies on both sample size and effect size. Especially with small sample and large errors, only a large measured effect will appear as statistically significant. Gelman recommends a retrospective power analysis to make explicit the effect size that could be resolved with the final sample.
Donald Berry and James Berger: statistical significance testing is overrated. Berry and Berger showed how the inverse probablity (significance testing assumes the probablity of Y if NOT X, Bayes probability would be Given Y, how likely is X) can be very different from the statisical significance calculation”1-p”. For very small p, one can underestimate the probability of an outlier by chance by orders of magnitude. Berry and Berger would recommend applying Bayesian analysis.
Most of these issues wouldn’t be a problem, the method is, after all, self correcting. But academics don’t get famous for repeating others’s work. Most work is not replicated. But it also seems likely that this is also less cited and less important.
in Why most published research findings are False, Ioannidis does a pretty good jo of laying out a number of problems.
Lord Rutherford: “If your experiment requires statistics, you ought to have done a better experiment.”
David H. Freeman, Lies, Damned Lies, and Medical Science, November 2010, The Atlantic, http://www.theatlantic.com/magazine/archive/2010/11/lies-damned-lies-and-medical-science/8269/
Johah Lehrer,, The Truth Wears Off, Is there something Wrong with the Scientific Method? December 13 2010, The New Yorker. http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer
Why most published research findings are False, John P. A. Ioannidis,PLoS Med. 2005 August; 2(8): e124. Published online 2005 August 30. doi: 10.1371/journal.pmed.0020124. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/
Ian Mitroff, Norms and Counter Norms in a select group of the Apollo Moon Scientists: A case study of the Amvbialence of Scientists. American Sociological Review, vol 39, Aug 1974.
Andrew Gelman. Statistical Modeling, Causal Inference, and Social Science, , December 13, 2010, http://www.stat.columbia.edu/~cook/movabletype/archives/2010/12/the_truth_wears.html
Berger, J. O. & Berry, D. A. (1988). Statistical analysis and illusion of objectivity. American Scientist, 76, 159-165.