BEHAVIOURAL BIAS IN EMPIRICAL RESEARCH: Evidence from top economics journals

There is a strong behavioural bias in empirical research, according to a new study by Yanos Zylberberg to be presented at the Royal Economic Society''s annual conference 2015. Most academic disciplines use a statistical threshold to discriminate between accepted and rejected hypotheses – and Zylberberg''s work indicates that researchers respond to this threshold and modify their results to get published.

Hypothesis testing

A standard empirical research process often consists in hypothesis testing. Consider a researcher – call him Albert – willing to test a hypothesis, for example, that foxes prefer organic hens. Typically, Albert will collect data, examine the empirical correlation in his sample (how many times organic hens are eaten compared with non-organic ones) and extract one important statistics, the p-value.

The p-value is the probability of observing an empirical correlation if the initial hypothesis is false, for example, foxes do not prefer organic hens. With a high p-value, there is a non-negligible probability that the empirical observation is due to luck and not to a true relation. In contrast, with a low p-value, it is very unlikely that the empirical observation is due to luck.

For Albert, this p-value is extremely important: the academic community has agreed on a threshold below which a hypothesis is considered true, or an effect is considered significant. If the p-value is above 5%, Albert will not be able to conclude that foxes prefer organic hens, and his research will probably not get published. If, instead, the p-value is below 5%, Albert will conclude that foxes are attracted to organic hens, and his research will have substantially more chance of being published.

Behavioural bias

Such selection on p-values distorts Albert''s incentives. If, during his analysis, Albert is first confronted with a p-value of 13%, he may be tempted to try variations of his initial analysis, for example, analyse only part of his data or adopt different specifications, until he gets the desired result and reaches a value below 5%.

This behaviour has been referred to as data fishing, data dredging or p-hacking. Importantly, if all researchers behave like Albert, there would not only be too many p-values below 5% in academic journals, but there would also be far too few p-values just above the 5% threshold.

This study finds strong empirical evidence for this shortage of just-insignificant p-values. The researchers collect the p-values published between 2005 and 2011 in the of the most prestigious journals in economics (American Economic Review, Journal of Political Economy and Quarterly Journal of Economics) and show a strong empirical regularity: the distribution of p-values has a two-humped camel shape with a first hump for high p-values, missing p-values between 25% and 10%, and a second hump for p-values slightly below 5% (see figure 1).

Basically, there are misallocated p-values (20% of the p-values are missing – roughly the size of the valley between the two humps) that should have been between 25 and 10%, and that can be retrieved below 5%.

The researchers relate this misallocation to some authors'' and papers'' characteristics. They find that the presence of a misallocation correlates with incentives to get published: the misallocation is lower for older and tenured professors compared with younger researchers.

The misallocation also correlates with the importance of the empirical result in the publication prospects. In theoretical papers, the empirical analysis is less crucial, and, indeed, the misallocation is much lower. Finally, while the majority of researchers now use stars to highlight the significance of their results (p-values below 5%), some do not use them, and there is a much lower misallocation for the latter than for the former.

These results point to the existence of behavioural biases in empirical research induced by result-driven carrots (publication, media attention…). Such bias in Albert''s case is quite innocuous, but hypothesis testing is widely used in psychology, medicine, political science, biology or economics to discuss the efficiency of a new drug, the effect of a policy, etc. Recent contributions discuss new practices for the scientific community that would either restrict the leeway in hypothesis testing, or break down the incentives to focus on positive results (see references below).


Star Wars: The Empirics Strike Back (A. Brodeur, M. Le, M. Sangnier, Y. Zylberberg)

Nosek, B. A., Spies, J. and Motyl, M.: 2012, Scientific Utopia: II – Restructuring Incentives and Practices to Promote Truth Over Publishability, Perspectives on Psychological Science

Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K., Gerber, A., Glennerster, R., Green, D., Humphreys, M., Imbens, G. et al.: 2014, Promoting transparency in social science research, Science 343(6166), 30{31.