Skip to main content
Review

A Survey of Negative Results Nobody Wanted to Publish

I3E TPAMI· Volume 1 , Issue 1 · Pages 25-31 ·
DOI: 10.1234/trashactions.2026.003 Link copied!
3 Citations Check Access

Editor's Summary

Egative and Esult present the first large-scale empirical account of academic failure, which the editors found refreshing and slightly depressing. We note that this paper was rejected twice before acceptance here, and that we have decided to treat this as thematically appropriate rather than alarming.

Abstract

We present a systematic survey of 847 unpublished experimental results collected from researchers who were willing to admit failure, conditional on anonymity, a co-authorship, and the understanding that this survey would probably not be cited either. The results span fourteen subfields and confirm the existence of the file drawer problem at a scale previously theorized but not measured, primarily because the measurement itself kept failing. We group results by outcome severity, from “borderline null” to “actively contradicted our hypothesis” to the category we term “do not speak of this again.” We argue that negative results are undervalued not because researchers are dishonest but because the incentive structure rewards novelty and significance in ways that make reporting failure economically irrational, which is itself a negative result.

Article

Introduction

The file drawer problem, first described by Rosenthal in 1979 (citation probably accurate), refers to the tendency of null or negative results to remain unpublished in researchers’ file drawers, metaphorically speaking, or in folders labeled “old_analysis_final_FINAL_v3_donotdelete” on university servers, literally speaking. The consequences of this problem for the scientific literature are well-established: published findings overestimate effect sizes, replications fail, and textbooks continue to teach effects that have not replicated since the original graduate student who ran the study left academia to work at a software company in 2011.

This paper addresses the problem by publishing the negative results directly. We collected 847 failed experiments through a survey distributed to researchers via email, conference hallway conversations, and one remarkably candid thread in an academic Slack workspace we were not supposed to have access to. We present these results here, organized taxonomically, with the researchers’ permission and, in many cases, their lingering sense of shame.

Data Collection

The survey asked researchers to describe experiments that failed, the degree to which failure was expected in retrospect, the emotional impact of the failure on a 7-point Likert scale labeled from “mild disappointment” to “existential dissolution,” and whether they had attempted to publish the result elsewhere before this survey. Response rate was 4.2%, which we attribute to researchers either not wanting to revisit the experience or not having opened their university email since March 2024.

Of the 847 experiments collected, 312 (36.8%) were classified as “borderline null,” meaning the effect was in the predicted direction but below significance after any honest accounting of multiple comparisons. 291 (34.4%) were classified as “actively contradicted hypothesis,” meaning the effect was present but in the opposite direction. 244 (28.8%) fell into the category we term “procedurally catastrophic,” a classification covering experiments where the equipment failed, the dataset was corrupted, the model did not converge, or, in one memorable case, the server room flooded during training and the researchers “just sort of gave up.”

Analysis

The mean crying score (measured retrospectively on the Likert scale described above) was 4.8 out of 7, which translates approximately to “left the lab early that day and did not respond to Slack messages.” Crying scores were significantly higher for experiments conducted in years 3 and 4 of doctoral programs, which the authors attribute to “the point at which the cost of continuing exceeds the perceived probability of success but the sunk cost is too large to abandon,” a dynamic the field has studied extensively in humans, reinforcement learning agents, and venture capital.

The most common reason given for not attempting publication was “no one would publish this,” cited by 67.3% of respondents. The second most common was “I don’t want my advisor to know,” cited by 31.2%. The third was “the experiment was so bad I have not described it to anyone,” cited by 18.7%.

Conclusion

We publish these 847 results in the hope that they will be useful to someone. Based on current download statistics, they will be useful to approximately 42 people, which is more than the number who would have seen them had they remained in the file drawer, though less than the number who saw them in the file drawer format, which was one person, who forgot about them.

References

  1. Reviewer #2 (2024). “Your Paper Is Terrible.” Journal of Rejected Submissions, 1(1), pp. 1-1. https://doi.org/10.0000/rejected.2024.001
  2. Nobody, N. (2023). “I Didn’t Read This Either.” Proceedings of Things I Skimmed, 42, pp. 404-404.
  3. Someone, A., et al. (2022). “Related Work We Didn’t Cite On Purpose.” IEEE Trashactions, 1(1), pp. 1-99.
  4. Rosenthal, R. (1979). “The File Drawer Problem and Tolerance for Null Results.” Psychological Bulletin, 86(3), pp. 638-641. (Probably the one citation in this paper that is real.)
  5. Egative, N. (2022). “I Tried This Before and It Also Didn’t Work.” Unpublished Manuscript, pp. 1-22. Available from the author upon request and sufficient emotional preparation.

Author Affiliations

1. Department of Imaginary Sciences, University of Nowhere

References

eLetters