Skip to main content
Research Article

We Achieve State-of-the-Art on a Benchmark We Designed Specifically for This Paper

I3E TBE· Volume 1 , Issue 1 · Pages 1-9 ·
DOI: 10.I3E/tbe.2026.00203 Link copied!
12 Citations Check Access

Editor's Summary

The editors have verified that BenchmarkMax-9000 was created by the same authors three weeks before submission. We consider this a reasonable timeline for establishing a new standard.

Abstract

We present OmniModel-X, a novel architecture that achieves state-of-the-art results on BenchmarkMax-9000, a comprehensive evaluation suite we introduce in this paper. OmniModel-X outperforms all 7 baseline models, which we selected after preliminary experiments in which we tested 23 baselines and reported the 7 that performed least favorably. Our method improves over the strongest reported baseline by 3.2 absolute points, which we describe in the abstract as “dramatically surpassing prior art” and in the conclusion as “a significant leap forward for the field.”

Article

Introduction

State-of-the-art performance is the primary currency of machine learning publications. A paper that achieves state-of-the-art is publishable; a paper that does not is either a methods contribution, an analysis, or a rejection. Given these incentives, it is perhaps unsurprising that researchers have developed sophisticated techniques for achieving state-of-the-art performance reliably and in advance of writing the paper. This work represents the current pinnacle of those techniques.

We introduce OmniModel-X, a model that is state-of-the-art by construction. Our key contributions are: (1) a new benchmark, BenchmarkMax-9000, designed after OmniModel-X was trained; (2) a curated set of baselines selected for their comparative weakness; and (3) an ablation study in which removing components of OmniModel-X always makes it worse, because we removed components until that was true.

BenchmarkMax-9000

BenchmarkMax-9000 consists of 4,200 examples across seven tasks chosen to reflect “the full breadth of real-world challenges in the domain.” Task selection was guided by pilot experiments conducted in November 2025, during which we identified the tasks on which OmniModel-X performed best relative to available baselines. The correlation between task selection and model advantage is coincidental and would be very difficult to prove otherwise.

Examples were collected from publicly available sources, filtered for quality using a procedure we describe as “rigorous” (two authors independently reviewed a random sample of 50 examples and disagreed on 6 of them, which we resolved by removing the 6 examples). Human performance was assessed by asking two lab members to complete 100 examples each. We report human performance as a ceiling; we do not report that our model exceeds it on three of seven tasks.

Baseline Selection

We evaluated OmniModel-X against 7 baselines drawn from recent literature. Initial experiments considered 23 candidate baselines. Exclusion criteria were: (1) model not publicly available at evaluation time, applied to 11 models; (2) evaluation code incompatible with our infrastructure, applied to 4 models; and (3) results not reproducible under our experimental setup, applied to 1 model that was performing better than OmniModel-X. The remaining 7 baselines are reported in Table 1.

Results

OmniModel-X achieves state-of-the-art performance on BenchmarkMax-9000, as defined by the fact that it outperforms all 7 selected baselines. The improvement over the strongest baseline is 3.2 absolute points (78.1 vs. 74.9), which we characterize as “dramatic” in the abstract and “meaningful” in the results section, and which a statistician who reviewed an early draft characterized as “within the margin of experimental variance given your sample size,” a comment we addressed by removing that statistician from the acknowledgments.

References

  1. Baseline, W., et al. (2024). “A Reasonably Good Model That We Did Not Compare Against.” Proceedings of Models We Missed, 1, pp. 1-12.
  2. Exaggeration, B. (2023). “Absolute vs. Relative Improvement: A Guide to Choosing the Larger Number.” Journal of Framing Effects, 7(1), pp. 99-108.
  3. Benchmark, D., & Designer, B. (2025). “My Benchmark, My Rules.” Workshop on Evaluation Practices (That Benefit Me), pp. 1-4.
  4. Hypothesis, N. (2026). “We Tried to Replicate This. The Benchmark Was Not Released.” I3E Trashactions on Reproducibility Problems in Imaginary Institutions, 1(1), pp. 1-3.

Author Affiliations

1. Laboratory for Self-Congratulatory Evaluation, Institute of Convenient Baselines

References

eLetters