AI Benchmark Debate Heats Up Over Misleading Results

A recent dispute has erupted among AI labs over benchmarking practices, with OpenAI accusing xAI of publishing misleading results for its latest model, Grok 3. The debate centers on how benchmarks are reported and interpreted.

xAI claims that its Grok 3 Reasoning Beta variant outperformed OpenAI’s best-performing model, o3-mini-high, on the AIME 2025 math test. However, OpenAI employees pointed out that this graph did not include a crucial metric: “cons@64,” which gives models 64 attempts to answer each problem and takes the most frequent answers as final scores.

Excluding cons@64 can significantly boost benchmark scores, leading to inaccurate comparisons between models. A more transparent graph showing all model performances at cons@64 was suggested by some experts. The xAI co-founder, Igor Babushkin, argued that OpenAI has also published misleading benchmark charts in the past.

Despite this, the debate highlights the need for clearer explanations of AI benchmarks and their limitations. As AI researcher Nathan Lambert noted, a key metric remains elusive: the computational cost required to achieve the best score. This underscores the importance of understanding not only a model’s strengths but also its weaknesses.

Source: https://techcrunch.com/2025/02/22/did-xai-lie-about-grok-3s-benchmarks