Researchers at MIT and IBM Research have found that rankings on widely used large language model comparison platforms are highly unstable, with the identity of the top model sometimes hinging on only a few user ratings out of tens of thousands. On platforms like Arena, which compare models in open-ended conversations based on crowdsourced preferences, removing just 0.003 percent of user ratings is enough to topple the top-ranked model. These leaderboards are closely watched by users trying to choose helpful models and by companies that use high placements in their marketing.
The study examined multiple Arena-style platforms and showed that small perturbations in the data can reliably flip the number one spot. In one case on LMArena, removing just two user ratings out of 57,477 was enough to flip the number one spot from GPT-4-0125-preview to GPT-4-1106-preview, with both removed ratings being matchups where GPT-4-0125-preview lost to much lower-ranked models. Across platforms, the same pattern emerged: in Chatbot Arena with large language model judges, 9 out of 49,938 ratings (0.018 percent) flipped the top spot, in Vision Arena it took 28 out of 29,845 (0.094 percent), and in Search Arena, 61 out of 24,469 (0.253 percent). MT-bench was the only significant outlier, requiring 92 out of 3,355 evaluations, about 2.74 percent, which the researchers attribute to its controlled design with 80 multi-turn questions and expert annotators rather than open crowdsourcing.
To uncover these fragile points, the team developed an approximation method that quickly identifies which data points most influence the rankings, then confirms the impact with an exact recalculation; they report that analyzing a dataset of 50,000 ratings takes less than three minutes on a standard laptop. The underlying issue is not limited to Artificial Intelligence benchmarks: in historical NBA data, removing just 17 out of 109,892 games (0.016 percent) was sufficient to change the top-ranked team, pointing to weaknesses in the statistical methods used when performance gaps at the top are small. The researchers stress that the problem they highlight is statistical robustness rather than deliberate manipulation, and suggest mitigations such as letting users express confidence in their choices, filtering out low-quality prompts, screening outliers, or involving mediators. They argue that noise, user error, or outlier judgments should not decide which system is crowned best, and that benchmark leaderboards for Artificial Intelligence systems, while useful, remain only an approximate guide that must be complemented with hands-on testing in real workflows.
