Skip to main content
Research Article

p = 0.049: Statistically Significant, Practically Identical to p = 0.051

I3E TCPS· Volume 1 , Issue 1 · Pages 11-22 ·
DOI: 10.I3E/tcps.2026.00178 Link copied!
11 Citations Check Access

Editor's Summary

The editors note that this paper’s own key results carry p-values of 0.047 and 0.043. We asked the corresponding author if they were concerned. They said no.

Abstract

If p-values were drawn from the true distribution of tested effects, the distribution of reported p-values would be roughly uniform between 0 and 1 under the null and right-skewed under true effects. We examine 6,400 p-values extracted from published papers in social and behavioral sciences and document a striking spike just below 0.050 and a corresponding valley just above it — a discontinuity that cannot be explained by any true effect distribution, but can be explained by a field that treats 0.049 and 0.051 as categorically different outcomes. We term the region [0.045, 0.050] the “significance cliff” and quantify the excess mass it contains.

Article

Introduction

The number 0.05 does not appear in nature. It was proposed by R.A. Fisher as a convenient threshold of informal reference, not as a bright line distinguishing true from false findings. The academic community has since treated it as a fundamental constant of the universe, on par with the speed of light but considerably more influential on hiring decisions.

The consequences of this threshold-worship are well documented in the methodology literature and well ignored in the research literature. Among the most dramatic is the distribution of published p-values: rather than reflecting the smooth distribution of true statistical outcomes, published p-values cluster artificially below 0.05, a pattern consistent with selective reporting of significant results, optional stopping when significance is reached, and what certain frank statisticians call “fudging.”

We present the largest analysis to date of this phenomenon, covering 6,400 p-values extracted by automated parsing from 2,100 papers in psychology, management science, and behavioral economics published between 2019 and 2024.

The Significance Cliff

We define the significance cliff as the region of the p-value distribution in the interval [0.040, 0.060], centered on the decision threshold of 0.050. In an unbiased distribution, we would expect this region to be smooth: slightly more mass below 0.05 than above it, reflecting the fact that some published effects are real. In our corpus, the region is not smooth.

The ratio of p-values in [0.045, 0.050] to p-values in [0.050, 0.055] was 4.7:1. The ratio expected under our null model of slight publication bias without active p-manipulation was 1.3:1. The gap between observed and expected constitutes what we call the “excess cliff mass,” and it represents approximately 340 p-values that should not exist at their reported values given their proximity to the threshold.

We also examined a subsample of papers reporting p = 0.049, the single most common “significant” p-value in our corpus (n = 287 papers). In 43% of these papers, the reported p-value was derived from a test described in a footnote rather than the methods section, a finding we note without comment.

Null Hypotheses That Were Not Rejected

Of the 6,400 p-values in our corpus, 94.3% are below 0.05. This is not the distribution of science. This is the distribution of published science, which is a selection of science for results that reached a threshold. The 5.7% of reported p-values above 0.05 appear in papers framed as “null results,” a genre that exists but is treated in the literature as a curiosity, like a left-handed crab.

We estimate, using a mixture model fitted to the observed distribution, that the true proportion of null results in the tested hypotheses underlying our corpus is between 31% and 47%. The published proportion is 5.7%. The difference has a name: the file drawer problem. The file drawer is very full.

References

  1. Fisher, R. A. (1925). “Statistical Methods for Research Workers.” Oliver & Boyd. (He did not mean for this to happen.)
  2. Simonsohn, U., et al. (2014). “p-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology, 143(2), pp. 534-547. (A real paper. The problem persists.)
  3. Barely, B. (2023). “0.049 vs. 0.051: An Empirical Study of Career Outcomes.” Journal of Threshold Anxiety, 9(1), pp. 1-15.
  4. Hypothesis, N. (2026). “These Results Are Significant. Barely.” I3E Trashactions on Catastrophic P-value Shopping, 1(1), pp. 23-23.

Author Affiliations

1. Threshold Studies Group, Institute for Barely Significant Findings

References

eLetters