Polish outperforms English and Chinese in long-context large language model tests

A new OneRuler benchmark presented at COLM 2025 finds Polish achieves the highest accuracy when large language models process very long documents. The study links the advantage to tokenization and script differences rather than dataset volume.

A multilingual benchmark introduced as OneRuler at COLM 2025 evaluated how large language models handle long documents and produced an unexpected ranking: Polish leads accuracy at extended context lengths. The paper tested 26 languages across retrieval and aggregation tasks and reports Polish achieving an average accuracy of 88% at long-context scales, defined around 64,000 tokens and beyond. English falls to sixth place on that scale, while Chinese ranks among the bottom four.

The authors argue the disparity is tied less to training data volume and more to tokenization efficiency and script characteristics. Languages using Latin-based scripts, such as Polish, French and Spanish, consistently outperformed languages that use logographic or abugida writing systems. The benchmark shows many languages with logographic or abugida scripts, including Chinese, Korean and Tamil, deliver only moderate accuracy even at shorter contexts and deteriorate further as sequence length increases. The measured performance gap between strongest and weakest languages widens sharply as context expands, moving from an 11 percent difference at 8,000 tokens to a 34 percent difference at 128,000 tokens. The study also highlights sensitivity to instruction phrasing: permitting a model to answer none when a target string is absent reduced English accuracy by 32 percent at 128k tokens.

The findings imply that long-context evaluation for Artificial Intelligence systems cannot rely solely on English benchmarks. While the OneRuler tests compared model families, the results suggest that generalizing performance across languages is misleading unless tokenization and script effects are accounted for. As context windows grow into the tens of thousands of tokens, structural language differences become more important than dataset dominance, and multilingual long-context benchmarks are necessary for representative evaluation.

55

Impact Score

EU Artificial Intelligence Act omnibus deal delays high-risk rules

A provisional EU agreement would push back key high-risk Artificial Intelligence Act deadlines while keeping major transparency duties on track for 2 August 2026. The deal also adds a new ban on non-consensual intimate imagery and child sexual abuse material generated by Artificial Intelligence systems.

UK and EU Artificial Intelligence regulatory outlook for May 2026

The UK is moving ahead with targeted Artificial Intelligence measures in policing, online safety, cyber security and copyright policy, while the EU is refining how the EU Artificial Intelligence Act will apply in practice. Consultations, new offences and implementation deadlines are shaping the next phase of compliance on both sides.

Germany sets out national implementation of the Artificial Intelligence Act

Germany has published a draft law to implement the European Artificial Intelligence Act through new supervisory structures, clearer institutional responsibilities, and measures designed to support innovation. The proposal puts the Federal Network Agency at the center of enforcement while preserving sector-specific oversight in sensitive fields.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.