Study finds popular large language model rankings can flip on tiny data changes

New research from MIT and IBM Research shows that leaderboards for large language models on popular crowdsourced platforms can change when only a handful of user ratings are removed, raising questions about how reliable these rankings are for real-world decisions.

Researchers at MIT and IBM Research have found that rankings on widely used large language model comparison platforms are highly unstable, with the identity of the top model sometimes hinging on only a few user ratings out of tens of thousands. On platforms like Arena, which compare models in open-ended conversations based on crowdsourced preferences, removing just 0.003 percent of user ratings is enough to topple the top-ranked model. These leaderboards are closely watched by users trying to choose helpful models and by companies that use high placements in their marketing.

The study examined multiple Arena-style platforms and showed that small perturbations in the data can reliably flip the number one spot. In one case on LMArena, removing just two user ratings out of 57,477 was enough to flip the number one spot from GPT-4-0125-preview to GPT-4-1106-preview, with both removed ratings being matchups where GPT-4-0125-preview lost to much lower-ranked models. Across platforms, the same pattern emerged: in Chatbot Arena with large language model judges, 9 out of 49,938 ratings (0.018 percent) flipped the top spot, in Vision Arena it took 28 out of 29,845 (0.094 percent), and in Search Arena, 61 out of 24,469 (0.253 percent). MT-bench was the only significant outlier, requiring 92 out of 3,355 evaluations, about 2.74 percent, which the researchers attribute to its controlled design with 80 multi-turn questions and expert annotators rather than open crowdsourcing.

To uncover these fragile points, the team developed an approximation method that quickly identifies which data points most influence the rankings, then confirms the impact with an exact recalculation; they report that analyzing a dataset of 50,000 ratings takes less than three minutes on a standard laptop. The underlying issue is not limited to Artificial Intelligence benchmarks: in historical NBA data, removing just 17 out of 109,892 games (0.016 percent) was sufficient to change the top-ranked team, pointing to weaknesses in the statistical methods used when performance gaps at the top are small. The researchers stress that the problem they highlight is statistical robustness rather than deliberate manipulation, and suggest mitigations such as letting users express confidence in their choices, filtering out low-quality prompts, screening outliers, or involving mediators. They argue that noise, user error, or outlier judgments should not decide which system is crowned best, and that benchmark leaderboards for Artificial Intelligence systems, while useful, remain only an approximate guide that must be complemented with hands-on testing in real workflows.

55

Impact Score

MiniMax 2.5 local deployment and performance guide

MiniMax 2.5 is a large open language model tuned for coding, tool use, search and office workflows, with quantized variants designed to run on high memory desktops and workstations using llama.cpp and OpenAI compatible APIs.

2026 outlook for global Artificial Intelligence regulation

Governments are tightening rules on high risk Artificial Intelligence while courts and public figures test traditional legal tools against deepfakes and data misuse. New Zealand businesses face growing extraterritorial obligations and governance pressures as global Artificial Intelligence norms solidify.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.