Polish outperforms English and Chinese in long-context large language model tests

A new OneRuler benchmark presented at COLM 2025 finds Polish achieves the highest accuracy when large language models process very long documents. The study links the advantage to tokenization and script differences rather than dataset volume.

A multilingual benchmark introduced as OneRuler at COLM 2025 evaluated how large language models handle long documents and produced an unexpected ranking: Polish leads accuracy at extended context lengths. The paper tested 26 languages across retrieval and aggregation tasks and reports Polish achieving an average accuracy of 88% at long-context scales, defined around 64,000 tokens and beyond. English falls to sixth place on that scale, while Chinese ranks among the bottom four.

The authors argue the disparity is tied less to training data volume and more to tokenization efficiency and script characteristics. Languages using Latin-based scripts, such as Polish, French and Spanish, consistently outperformed languages that use logographic or abugida writing systems. The benchmark shows many languages with logographic or abugida scripts, including Chinese, Korean and Tamil, deliver only moderate accuracy even at shorter contexts and deteriorate further as sequence length increases. The measured performance gap between strongest and weakest languages widens sharply as context expands, moving from an 11 percent difference at 8,000 tokens to a 34 percent difference at 128,000 tokens. The study also highlights sensitivity to instruction phrasing: permitting a model to answer none when a target string is absent reduced English accuracy by 32 percent at 128k tokens.

The findings imply that long-context evaluation for Artificial Intelligence systems cannot rely solely on English benchmarks. While the OneRuler tests compared model families, the results suggest that generalizing performance across languages is misleading unless tokenization and script effects are accounted for. As context windows grow into the tens of thousands of tokens, structural language differences become more important than dataset dominance, and multilingual long-context benchmarks are necessary for representative evaluation.

55

Impact Score

73% of Artificial Intelligence startups are just prompt engineering

A widely shared post claims the author reverse-engineered 200 Artificial Intelligence startups and found 73 percent were effectively thin wrappers around provider models. Hacker News commenters debated the methodology, the value of prompt and context engineering, and whether those companies have defensible moats.

Are we all living inside an artificial intelligence bubble

Circular deals have turned into a dominant financial pattern in the artificial intelligence boom: investors fund start-ups and then sell them the compute and infrastructure they must buy back. The practice has sped infrastructure build out but also created tightly coupled financial risk.

How Artificial Intelligence maps company connections to drive alpha

Using Artificial Intelligence tools to collate company text data enables the construction of networks of nodes and edges that reveal supply chain, technology and peer links. Those network signals can complement quantitative strategies and help reduce momentum crash risk.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.