How Artificial Intelligence and Wikipedia push vulnerable languages into a feedback loop

Small-language editions of Wikipedia are being flooded with machine-translated text that is training Artificial Intelligence systems on bad data, creating a self-reinforcing cycle of errors. Greenlandic’s Wikipedia is now set to close as editors warn of wider harm to endangered languages.

When Kenneth Wehr took over the Greenlandic-language Wikipedia, he deleted most of its roughly 1,500 articles after finding that nearly all had been created by non-speakers using machine translation. Entries were riddled with basic grammatical errors, invented words, and factual mistakes, such as a page stating that Canada had 41 inhabitants. The problem extends well beyond Greenlandic: smaller Wikipedia editions in languages across Africa and the Arctic have been swamped by automatically translated content. Volunteers estimate that 40 to 60 percent of articles in some African-language editions are uncorrected machine translations, and an audit of Inuktitut suggests that more than two-thirds of longer pages include machine-generated text.

This flood of low-quality content is feeding back into the Artificial Intelligence systems that rely on Wikipedia as a primary data source for low-resource languages. Researchers and editors warn that models trained on error-filled pages will produce more bad translations, which then get posted back to Wikipedia, creating a loop of degradation. Studies have shown that Wikipedia can comprise more than half of the training corpus for certain African languages, and for 27 under-resourced languages it has been the only easily accessible text source online. Linguists note that many of these languages, including agglutinative ones like Greenlandic, are structurally challenging for current machine translation approaches, compounding the problem.

While bots have long handled routine Wikipedia maintenance, Artificial Intelligence tools have enabled what one researcher calls “Wikipedia hijackers” to mass-produce plausible-looking but faulty pages. Wikipedia’s Content Translate tool, which leverages external machine translation, has been restricted or discouraged in some communities, including English, because most auto-translated drafts fail quality standards. Editors from Fulfulde and Igbo report rampant errors, from mistranslated basic terms to letters not used in the language, which they say discourage readers and undermine trust. A Hawaiian language professor estimates that on some pages 35 percent of words are incomprehensible, threatening revitalization efforts.

Downstream harms are already visible: Artificial Intelligence systems trained on flawed Wikipedia text are helping generate error-strewn language-learning books for Indigenous languages such as Inuktitut, Cree, and Manx. The Wikimedia Foundation says communities are responsible for quality control and typically considers closures only after complaints, a stance that leaves dormant editions vulnerable. In contrast, Inari Saami shows what careful stewardship can achieve, with thousands of articles copy-edited by fluent speakers and integrated into schooling to coin and standardize vocabulary. Despite such successes, Wehr’s push to close Greenlandic Wikipedia was accepted, citing Artificial Intelligence tools that “frequently produced nonsense.” He fears errors are now embedded in translation systems, noting that major tools cannot even count to 10 in proper Greenlandic.

70

Impact Score

Generative Artificial Intelligence in travel

PhocusWire’s hub compiles in-depth reporting on generative Artificial Intelligence across the travel industry, from customer service and marketing to product development. The page highlights new tools, research and leaders shaping automation, personalization and decision-making.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.