Study finds popular large language model rankings can flip on tiny data changes

New research from MIT and IBM Research shows that leaderboards for large language models on popular crowdsourced platforms can change when only a handful of user ratings are removed, raising questions about how reliable these rankings are for real-world decisions.

Researchers at MIT and IBM Research have found that rankings on widely used large language model comparison platforms are highly unstable, with the identity of the top model sometimes hinging on only a few user ratings out of tens of thousands. On platforms like Arena, which compare models in open-ended conversations based on crowdsourced preferences, removing just 0.003 percent of user ratings is enough to topple the top-ranked model. These leaderboards are closely watched by users trying to choose helpful models and by companies that use high placements in their marketing.

The study examined multiple Arena-style platforms and showed that small perturbations in the data can reliably flip the number one spot. In one case on LMArena, removing just two user ratings out of 57,477 was enough to flip the number one spot from GPT-4-0125-preview to GPT-4-1106-preview, with both removed ratings being matchups where GPT-4-0125-preview lost to much lower-ranked models. Across platforms, the same pattern emerged: in Chatbot Arena with large language model judges, 9 out of 49,938 ratings (0.018 percent) flipped the top spot, in Vision Arena it took 28 out of 29,845 (0.094 percent), and in Search Arena, 61 out of 24,469 (0.253 percent). MT-bench was the only significant outlier, requiring 92 out of 3,355 evaluations, about 2.74 percent, which the researchers attribute to its controlled design with 80 multi-turn questions and expert annotators rather than open crowdsourcing.

To uncover these fragile points, the team developed an approximation method that quickly identifies which data points most influence the rankings, then confirms the impact with an exact recalculation; they report that analyzing a dataset of 50,000 ratings takes less than three minutes on a standard laptop. The underlying issue is not limited to Artificial Intelligence benchmarks: in historical NBA data, removing just 17 out of 109,892 games (0.016 percent) was sufficient to change the top-ranked team, pointing to weaknesses in the statistical methods used when performance gaps at the top are small. The researchers stress that the problem they highlight is statistical robustness rather than deliberate manipulation, and suggest mitigations such as letting users express confidence in their choices, filtering out low-quality prompts, screening outliers, or involving mediators. They argue that noise, user error, or outlier judgments should not decide which system is crowned best, and that benchmark leaderboards for Artificial Intelligence systems, while useful, remain only an approximate guide that must be complemented with hands-on testing in real workflows.

55

Impact Score

HMRC signs £175m Quantexa deal for fraud detection

HM Revenue and Customs has signed a £175 million, 10-year agreement with Quantexa to unify fragmented data and strengthen fraud detection. The deployment is designed to automate routine work while keeping decisions transparent, auditable and subject to human approval.

Us supercomputers test new Artificial Intelligence chip suppliers

Sandia National Laboratories is evaluating chips from Israeli startup NextSilicon as major chipmakers shift their roadmaps toward Artificial Intelligence. The move reflects growing concern that mainstream processors are deprioritizing the scientific computing features government labs still need.

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.