Article
Introduction
The number 0.05 does not appear in nature. It was proposed by R.A. Fisher as a convenient threshold of informal reference, not as a bright line distinguishing true from false findings. The academic community has since treated it as a fundamental constant of the universe, on par with the speed of light but considerably more influential on hiring decisions.
The consequences of this threshold-worship are well documented in the methodology literature and well ignored in the research literature. Among the most dramatic is the distribution of published p-values: rather than reflecting the smooth distribution of true statistical outcomes, published p-values cluster artificially below 0.05, a pattern consistent with selective reporting of significant results, optional stopping when significance is reached, and what certain frank statisticians call “fudging.”
We present the largest analysis to date of this phenomenon, covering 6,400 p-values extracted by automated parsing from 2,100 papers in psychology, management science, and behavioral economics published between 2019 and 2024.
The Significance Cliff
We define the significance cliff as the region of the p-value distribution in the interval [0.040, 0.060], centered on the decision threshold of 0.050. In an unbiased distribution, we would expect this region to be smooth: slightly more mass below 0.05 than above it, reflecting the fact that some published effects are real. In our corpus, the region is not smooth.
The ratio of p-values in [0.045, 0.050] to p-values in [0.050, 0.055] was 4.7:1. The ratio expected under our null model of slight publication bias without active p-manipulation was 1.3:1. The gap between observed and expected constitutes what we call the “excess cliff mass,” and it represents approximately 340 p-values that should not exist at their reported values given their proximity to the threshold.
We also examined a subsample of papers reporting p = 0.049, the single most common “significant” p-value in our corpus (n = 287 papers). In 43% of these papers, the reported p-value was derived from a test described in a footnote rather than the methods section, a finding we note without comment.
Null Hypotheses That Were Not Rejected
Of the 6,400 p-values in our corpus, 94.3% are below 0.05. This is not the distribution of science. This is the distribution of published science, which is a selection of science for results that reached a threshold. The 5.7% of reported p-values above 0.05 appear in papers framed as “null results,” a genre that exists but is treated in the literature as a curiosity, like a left-handed crab.
We estimate, using a mixture model fitted to the observed distribution, that the true proportion of null results in the tested hypotheses underlying our corpus is between 31% and 47%. The published proportion is 5.7%. The difference has a name: the file drawer problem. The file drawer is very full.
References
- Fisher, R. A. (1925). “Statistical Methods for Research Workers.” Oliver & Boyd. (He did not mean for this to happen.)
- Simonsohn, U., et al. (2014). “p-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology, 143(2), pp. 534-547. (A real paper. The problem persists.)
- Barely, B. (2023). “0.049 vs. 0.051: An Empirical Study of Career Outcomes.” Journal of Threshold Anxiety, 9(1), pp. 1-15.
- Hypothesis, N. (2026). “These Results Are Significant. Barely.” I3E Trashactions on Catastrophic P-value Shopping, 1(1), pp. 23-23.