How Artificial Intelligence and Wikipedia push vulnerable languages into a feedback loop

September 27, 2025

Small-language editions of Wikipedia are being flooded with machine-translated text that is training Artificial Intelligence systems on bad data, creating a self-reinforcing cycle of errors. Greenlandic’s Wikipedia is now set to close as editors warn of wider harm to endangered languages.

When Kenneth Wehr took over the Greenlandic-language Wikipedia, he deleted most of its roughly 1,500 articles after finding that nearly all had been created by non-speakers using machine translation. Entries were riddled with basic grammatical errors, invented words, and factual mistakes, such as a page stating that Canada had 41 inhabitants. The problem extends well beyond Greenlandic: smaller Wikipedia editions in languages across Africa and the Arctic have been swamped by automatically translated content. Volunteers estimate that 40 to 60 percent of articles in some African-language editions are uncorrected machine translations, and an audit of Inuktitut suggests that more than two-thirds of longer pages include machine-generated text.

This flood of low-quality content is feeding back into the Artificial Intelligence systems that rely on Wikipedia as a primary data source for low-resource languages. Researchers and editors warn that models trained on error-filled pages will produce more bad translations, which then get posted back to Wikipedia, creating a loop of degradation. Studies have shown that Wikipedia can comprise more than half of the training corpus for certain African languages, and for 27 under-resourced languages it has been the only easily accessible text source online. Linguists note that many of these languages, including agglutinative ones like Greenlandic, are structurally challenging for current machine translation approaches, compounding the problem.

While bots have long handled routine Wikipedia maintenance, Artificial Intelligence tools have enabled what one researcher calls “Wikipedia hijackers” to mass-produce plausible-looking but faulty pages. Wikipedia’s Content Translate tool, which leverages external machine translation, has been restricted or discouraged in some communities, including English, because most auto-translated drafts fail quality standards. Editors from Fulfulde and Igbo report rampant errors, from mistranslated basic terms to letters not used in the language, which they say discourage readers and undermine trust. A Hawaiian language professor estimates that on some pages 35 percent of words are incomprehensible, threatening revitalization efforts.

Downstream harms are already visible: Artificial Intelligence systems trained on flawed Wikipedia text are helping generate error-strewn language-learning books for Indigenous languages such as Inuktitut, Cree, and Manx. The Wikimedia Foundation says communities are responsible for quality control and typically considers closures only after complaints, a stance that leaves dormant editions vulnerable. In contrast, Inari Saami shows what careful stewardship can achieve, with thousands of articles copy-edited by fluent speakers and integrated into schooling to coin and standardize vocabulary. Despite such successes, Wehr’s push to close Greenlandic Wikipedia was accepted, citing Artificial Intelligence tools that “frequently produced nonsense.” He fears errors are now embedded in translation systems, noting that major tools cannot even count to 10 in proper Greenlandic.

Source

70

Impact Score

Latest News

Artificial intelligence training and fair use in the shadow of Geoffrey Hinton

September 30, 2025

A new appeal in Thomson Reuters v. ROSS Intelligence will test whether using copyrighted works to train Artificial Intelligence is fair use. The piece argues that the practice emerged in academia alongside Geoffrey Hinton’s scaling breakthroughs, not in Silicon Valley.

How artificial intelligence overviews are reshaping search traffic for publishers

September 30, 2025

Independent studies show that Google’s artificial intelligence overviews are reducing clicks to publishers while changing user behavior. As zero-click results rise, publishers are rethinking content strategy, branding, and distribution.

Santander uses Artificial Intelligence tool to help bust human trafficking gangs

September 30, 2025

Santander UK says an Artificial Intelligence system from fintech firm ThetaRay has produced hundreds of alerts for the National Crime Agency, contributing to the takedown of trafficking rings. The bank plans broader deployment after a year of results.

Generative Artificial Intelligence in travel

September 30, 2025

PhocusWire’s hub compiles in-depth reporting on generative Artificial Intelligence across the travel industry, from customer service and marketing to product development. The page highlights new tools, research and leaders shaping automation, personalization and decision-making.

Nvidia’s investment in OpenAI rekindles concern over circular financing in the Artificial Intelligence boom

September 30, 2025

Nvidia’s latest funding of OpenAI intensifies scrutiny of its practice of backing customers who then buy or lease its GPUs. Analysts warn such circular deals can inflate perceived demand and echo patterns from past tech bubbles.

How Artificial Intelligence and Wikipedia push vulnerable languages into a feedback loop

70

Impact Score

Latest News

Artificial intelligence training and fair use in the shadow of Geoffrey Hinton

How artificial intelligence overviews are reshaping search traffic for publishers

Santander uses Artificial Intelligence tool to help bust human trafficking gangs

Generative Artificial Intelligence in travel

Nvidia’s investment in OpenAI rekindles concern over circular financing in the Artificial Intelligence boom

Contact Us