Language model benchmarks widely 'contaminated', study finds

Sat, 28th Feb 2026

NE2NE has published research suggesting widespread benchmark contamination across many commonly used tests for large language models, raising doubts about how well benchmark scores reflect real-world performance.

The study examines whether models were trained on, or indirectly exposed to, benchmark items later used in evaluations. Such contamination can inflate results and distort comparisons between vendors and model families. The issue has become more urgent as enterprises rely on benchmark tables for procurement, model selection, and risk assessment.

Testing method

NE2NE's research team used a cloze-deletion approach adapted from established academic methods. Cloze-style testing measures whether a system can predict missing words from surrounding context. The method builds on techniques such as overlap analysis and benchmark probing described in prior work, including research associated with Brown et al. (2020) and Deng et al. (2023).

The analysis covered 4,590 model-question pairs across 17 frontier language models and 18 public benchmarks. NE2NE reported an overall contamination rate of 57.3 percent and described the result as evidence that some benchmarks may no longer serve as clean measures of generalisation.

Procurement pressure

Benchmark scores have become a shorthand for performance in vendor marketing and corporate decision-making. Many organisations also treat published benchmark results as directly comparable across models, even when suppliers differ in training data practices, fine-tuning steps, and evaluation settings.

NE2NE argues that repeated reuse of popular datasets increases the likelihood that benchmarks circulate through training pipelines. That can happen directly when benchmark items appear in training corpora, or indirectly through derivatives, community reproductions of test sets, or materials that closely resemble them.

Steven Pappadakes, NE2NE's founder and CEO, linked the findings to the way benchmarks are used in competitive model rankings.

"Our industry relies heavily on benchmark rankings to compare models, but benchmarks were never intended to become static scoreboards," Pappadakes said. "If contamination is present, organizations may be making high-stakes decisions based on metrics that no longer reflect real-world capability. This highlights the need for more transparent evaluation practices and continuous reassessment of how benchmarks are used."

Industry implications

The research adds to a broader debate over whether benchmark-driven competition encourages optimisation for test performance rather than robust behaviour in production. Risk teams in regulated sectors are also scrutinising evaluation methods more closely as language models move into customer-facing workflows, internal knowledge tools, and decision support.

Contamination can also have practical consequences for buyers. If benchmark scores overstate general performance, organisations may underestimate failure rates in areas such as customer service, document drafting, coding assistance, and compliance-related analysis. That can affect roll-out plans, human oversight requirements, and remediation costs when models behave unpredictably.

The study also points to a transparency gap in the market: many frontier models still lack publicly documented contamination analyses. This makes it harder for third parties to compare results across providers or to separate genuine generalisation gains from familiarity with benchmark content.

Benchmark lifecycle

Benchmark designers already rotate some datasets, retire older tasks, and introduce new test formats. Even so, long-lived benchmarks remain central to public model leaderboards and headline performance claims. The research renews attention on benchmark lifecycle management, including how often tasks should change and how test sets can be protected from leaking into training data.

Stronger auditing could include routine overlap checks between training corpora and evaluation sets, more detailed disclosures about data sources, and independent replication of results. Some researchers also advocate private test sets, sealed evaluations, or more dynamic task generation. Others emphasise measuring robustness, calibration, and performance under distribution shifts rather than relying on a narrow set of static tests.

NE2NE is based in Cleveland and was founded in 2021. It works with enterprises and technology providers on deploying automation and data integration tools, and it positions the research as part of an effort to close gaps between published claims and operational outcomes.

The findings add to calls for more rigorous benchmark design and for clearer interpretation of benchmark scores as models and training datasets evolve.

Preferred Source

Language model benchmarks widely 'contaminated', study finds

FinTech

Industry

MarTech

Infrastructure

Commerce

Enterprise

Cybersecurity

Telecomms