Governments can shape what Artificial Intelligence chatbots say through training data

May 15, 2026

A Nature study finds that government influence over online media can carry through into large language model training data and affect chatbot responses to political questions. The effect appears especially strongly when prompts are asked in a country’s primary language.

Governments can indirectly influence what large language models say about politics by shaping the online media environment those systems learn from. Researchers from University of Oregon, Purdue University, the University of California San Diego, New York University, and Princeton University describe this as institutional influence: political power affecting the information environment, which then affects training data and model behavior. The study examined open training data, retraining experiments, human evaluation, audits of commercial chatbots, and cross-language testing.

In a case study focused on China, the researchers found that state-coordinated media appeared frequently in training data derived from Common Crawl. Comparing Chinese state-coordinated media with a major open-source multilingual dataset, they found more than 3.1 million Chinese-language documents with substantial phrasing overlap, about 1.64% of the dataset’s Chinese-language subset. That is over 40 times the rate for documents from Chinese-language Wikipedia, a common training source. Among documents mentioning Chinese political leaders or institutions, the share rose as high as 23%. Only about 12% of the matched documents came from known government or news domains, suggesting the phrasing had spread broadly across the web before entering training corpora. The researchers also found that commercial models memorized distinctive phrases associated with this material.

The team then tested whether this content could change model outputs. In retraining experiments with a small open model, adding scripted news to the training data made the models more likely to produce more favorable answers, nearly 80% of the time compared with an unmodified model. The effect remained when compared with other non-scripted Chinese media and with general Chinese-language internet text. The researchers argue that repeated, coordinated phrasing can become embedded in a model and later appear as neutral-sounding information.

Cross-language tests on commercial systems found similar patterns. In responses to political questions about China, human raters judged the Chinese-prompted answer to be more favorable to China 75.3% of the time. For prompts not about China, the rate was no different from chance. Follow-on studies using real user prompts and additional commercial models found the same general tendency on questions about Chinese leaders and institutions.

The broader cross-national analysis covered 37 countries where a national language is largely concentrated within a single country. Models portrayed governments and institutions from countries with stronger media control more favorably in that country’s language than in English. The researchers say the cross-national finding is correlational, but consistent with the mechanism identified in the China case. They frame the issue as a training-data transparency problem with implications for democracy, censorship, and the growing role of Artificial Intelligence systems in explaining and interpreting the world.

Source

78

Impact Score

Latest News

Deepfake porn’s hidden victims

May 15, 2026

Nonconsensual sexual deepfakes are harming not only the people whose faces are inserted into explicit content, but also adult performers whose bodies and likenesses are repurposed without consent. As generative Artificial Intelligence tools spread, performers face growing psychological, legal, and financial risks with limited protection.

Mistral discusses cybersecurity Artificial Intelligence model with European banks

May 15, 2026

Mistral is discussing a cybersecurity-focused Artificial Intelligence model with European banks seeking alternatives to restricted-access systems such as Anthropic’s Mythos. The model is being developed to identify cybersecurity vulnerabilities.

Deepfake porn and chatbot privacy breaches

May 15, 2026

Nonconsensual deepfake pornography is harming not only people whose faces are inserted into explicit media, but also adult creators whose bodies and likenesses are reused without permission. Generative Artificial Intelligence chatbots are also exposing private phone numbers, making personal information easier to retrieve and harder to control.

European Union Artificial Intelligence Act raises layered compliance demands for finance

May 15, 2026

Banks, insurers and financial intermediaries face a more complex compliance environment as the European Union Artificial Intelligence Act overlays existing financial regulation and the GDPR. Proposed changes in the Digital Omnibus Package may delay some obligations, but the core challenge remains managing overlapping rules, roles and regulators.

Europe and US discuss biometric data-sharing framework

May 14, 2026

European Union and US officials are negotiating a border security arrangement that could enable continuous biometric data exchanges on EU citizens. The UK says the US has also requested access to fingerprint records as part of Visa Waiver Program discussions.

Governments can shape what Artificial Intelligence chatbots say through training data

78

Impact Score

Latest News

Deepfake porn’s hidden victims

Mistral discusses cybersecurity Artificial Intelligence model with European banks

Deepfake porn and chatbot privacy breaches

European Union Artificial Intelligence Act raises layered compliance demands for finance

Europe and US discuss biometric data-sharing framework

Contact Us