A multilingual benchmark introduced as OneRuler at COLM 2025 evaluated how large language models handle long documents and produced an unexpected ranking: Polish leads accuracy at extended context lengths. The paper tested 26 languages across retrieval and aggregation tasks and reports Polish achieving an average accuracy of 88% at long-context scales, defined around 64,000 tokens and beyond. English falls to sixth place on that scale, while Chinese ranks among the bottom four.
The authors argue the disparity is tied less to training data volume and more to tokenization efficiency and script characteristics. Languages using Latin-based scripts, such as Polish, French and Spanish, consistently outperformed languages that use logographic or abugida writing systems. The benchmark shows many languages with logographic or abugida scripts, including Chinese, Korean and Tamil, deliver only moderate accuracy even at shorter contexts and deteriorate further as sequence length increases. The measured performance gap between strongest and weakest languages widens sharply as context expands, moving from an 11 percent difference at 8,000 tokens to a 34 percent difference at 128,000 tokens. The study also highlights sensitivity to instruction phrasing: permitting a model to answer none when a target string is absent reduced English accuracy by 32 percent at 128k tokens.
The findings imply that long-context evaluation for Artificial Intelligence systems cannot rely solely on English benchmarks. While the OneRuler tests compared model families, the results suggest that generalizing performance across languages is misleading unless tokenization and script effects are accounted for. As context windows grow into the tens of thousands of tokens, structural language differences become more important than dataset dominance, and multilingual long-context benchmarks are necessary for representative evaluation.
