Article
Introduction
State-of-the-art performance is the primary currency of machine learning publications. A paper that achieves state-of-the-art is publishable; a paper that does not is either a methods contribution, an analysis, or a rejection. Given these incentives, it is perhaps unsurprising that researchers have developed sophisticated techniques for achieving state-of-the-art performance reliably and in advance of writing the paper. This work represents the current pinnacle of those techniques.
We introduce OmniModel-X, a model that is state-of-the-art by construction. Our key contributions are: (1) a new benchmark, BenchmarkMax-9000, designed after OmniModel-X was trained; (2) a curated set of baselines selected for their comparative weakness; and (3) an ablation study in which removing components of OmniModel-X always makes it worse, because we removed components until that was true.
BenchmarkMax-9000
BenchmarkMax-9000 consists of 4,200 examples across seven tasks chosen to reflect “the full breadth of real-world challenges in the domain.” Task selection was guided by pilot experiments conducted in November 2025, during which we identified the tasks on which OmniModel-X performed best relative to available baselines. The correlation between task selection and model advantage is coincidental and would be very difficult to prove otherwise.
Examples were collected from publicly available sources, filtered for quality using a procedure we describe as “rigorous” (two authors independently reviewed a random sample of 50 examples and disagreed on 6 of them, which we resolved by removing the 6 examples). Human performance was assessed by asking two lab members to complete 100 examples each. We report human performance as a ceiling; we do not report that our model exceeds it on three of seven tasks.
Baseline Selection
We evaluated OmniModel-X against 7 baselines drawn from recent literature. Initial experiments considered 23 candidate baselines. Exclusion criteria were: (1) model not publicly available at evaluation time, applied to 11 models; (2) evaluation code incompatible with our infrastructure, applied to 4 models; and (3) results not reproducible under our experimental setup, applied to 1 model that was performing better than OmniModel-X. The remaining 7 baselines are reported in Table 1.
Results
OmniModel-X achieves state-of-the-art performance on BenchmarkMax-9000, as defined by the fact that it outperforms all 7 selected baselines. The improvement over the strongest baseline is 3.2 absolute points (78.1 vs. 74.9), which we characterize as “dramatic” in the abstract and “meaningful” in the results section, and which a statistician who reviewed an early draft characterized as “within the margin of experimental variance given your sample size,” a comment we addressed by removing that statistician from the acknowledgments.
References
- Baseline, W., et al. (2024). “A Reasonably Good Model That We Did Not Compare Against.” Proceedings of Models We Missed, 1, pp. 1-12.
- Exaggeration, B. (2023). “Absolute vs. Relative Improvement: A Guide to Choosing the Larger Number.” Journal of Framing Effects, 7(1), pp. 99-108.
- Benchmark, D., & Designer, B. (2025). “My Benchmark, My Rules.” Workshop on Evaluation Practices (That Benefit Me), pp. 1-4.
- Hypothesis, N. (2026). “We Tried to Replicate This. The Benchmark Was Not Released.” I3E Trashactions on Reproducibility Problems in Imaginary Institutions, 1(1), pp. 1-3.