Study finds popular large language model rankings can flip on tiny data changes

New research from MIT and IBM Research shows that leaderboards for large language models on popular crowdsourced platforms can change when only a handful of user ratings are removed, raising questions about how reliable these rankings are for real-world decisions.

Researchers at MIT and IBM Research have found that rankings on widely used large language model comparison platforms are highly unstable, with the identity of the top model sometimes hinging on only a few user ratings out of tens of thousands. On platforms like Arena, which compare models in open-ended conversations based on crowdsourced preferences, removing just 0.003 percent of user ratings is enough to topple the top-ranked model. These leaderboards are closely watched by users trying to choose helpful models and by companies that use high placements in their marketing.

The study examined multiple Arena-style platforms and showed that small perturbations in the data can reliably flip the number one spot. In one case on LMArena, removing just two user ratings out of 57,477 was enough to flip the number one spot from GPT-4-0125-preview to GPT-4-1106-preview, with both removed ratings being matchups where GPT-4-0125-preview lost to much lower-ranked models. Across platforms, the same pattern emerged: in Chatbot Arena with large language model judges, 9 out of 49,938 ratings (0.018 percent) flipped the top spot, in Vision Arena it took 28 out of 29,845 (0.094 percent), and in Search Arena, 61 out of 24,469 (0.253 percent). MT-bench was the only significant outlier, requiring 92 out of 3,355 evaluations, about 2.74 percent, which the researchers attribute to its controlled design with 80 multi-turn questions and expert annotators rather than open crowdsourcing.

To uncover these fragile points, the team developed an approximation method that quickly identifies which data points most influence the rankings, then confirms the impact with an exact recalculation; they report that analyzing a dataset of 50,000 ratings takes less than three minutes on a standard laptop. The underlying issue is not limited to Artificial Intelligence benchmarks: in historical NBA data, removing just 17 out of 109,892 games (0.016 percent) was sufficient to change the top-ranked team, pointing to weaknesses in the statistical methods used when performance gaps at the top are small. The researchers stress that the problem they highlight is statistical robustness rather than deliberate manipulation, and suggest mitigations such as letting users express confidence in their choices, filtering out low-quality prompts, screening outliers, or involving mediators. They argue that noise, user error, or outlier judgments should not decide which system is crowned best, and that benchmark leaderboards for Artificial Intelligence systems, while useful, remain only an approximate guide that must be complemented with hands-on testing in real workflows.

55

Impact Score

Anumana wins FDA clearance for pulmonary hypertension ECG Artificial Intelligence tool

Anumana has received FDA 510(k) clearance for an Artificial Intelligence-enabled pulmonary hypertension algorithm designed for use with standard 12-lead electrocardiograms. The company says the software can help clinicians spot early signs of disease within existing workflows and without moving patient data outside the health system environment.

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.