Governments can shape what Artificial Intelligence chatbots say through training data

A Nature study finds that government influence over online media can carry through into large language model training data and affect chatbot responses to political questions. The effect appears especially strongly when prompts are asked in a country’s primary language.

Governments can indirectly influence what large language models say about politics by shaping the online media environment those systems learn from. Researchers from University of Oregon, Purdue University, the University of California San Diego, New York University, and Princeton University describe this as institutional influence: political power affecting the information environment, which then affects training data and model behavior. The study examined open training data, retraining experiments, human evaluation, audits of commercial chatbots, and cross-language testing.

In a case study focused on China, the researchers found that state-coordinated media appeared frequently in training data derived from Common Crawl. Comparing Chinese state-coordinated media with a major open-source multilingual dataset, they found more than 3.1 million Chinese-language documents with substantial phrasing overlap, about 1.64% of the dataset’s Chinese-language subset. That is over 40 times the rate for documents from Chinese-language Wikipedia, a common training source. Among documents mentioning Chinese political leaders or institutions, the share rose as high as 23%. Only about 12% of the matched documents came from known government or news domains, suggesting the phrasing had spread broadly across the web before entering training corpora. The researchers also found that commercial models memorized distinctive phrases associated with this material.

The team then tested whether this content could change model outputs. In retraining experiments with a small open model, adding scripted news to the training data made the models more likely to produce more favorable answers, nearly 80% of the time compared with an unmodified model. The effect remained when compared with other non-scripted Chinese media and with general Chinese-language internet text. The researchers argue that repeated, coordinated phrasing can become embedded in a model and later appear as neutral-sounding information.

Cross-language tests on commercial systems found similar patterns. In responses to political questions about China, human raters judged the Chinese-prompted answer to be more favorable to China 75.3% of the time. For prompts not about China, the rate was no different from chance. Follow-on studies using real user prompts and additional commercial models found the same general tendency on questions about Chinese leaders and institutions.

The broader cross-national analysis covered 37 countries where a national language is largely concentrated within a single country. Models portrayed governments and institutions from countries with stronger media control more favorably in that country’s language than in English. The researchers say the cross-national finding is correlational, but consistent with the mechanism identified in the China case. They frame the issue as a training-data transparency problem with implications for democracy, censorship, and the growing role of Artificial Intelligence systems in explaining and interpreting the world.

78

Impact Score

AMD claims EPYC lead in agentic Artificial Intelligence workloads

AMD is using rack-level benchmarks to argue EPYC CPUs will remain central to agentic Artificial Intelligence infrastructure. The claims target Nvidia’s Vera platform and Intel’s Xeon lineup as data centers rebalance around CPU-heavy orchestration work.

Hades variant affects 23 PyPI package versions

The Mini Shai-Hulud Hades variant is targeting PyPI packages tied to bioinformatics and Artificial Intelligence themes. Socket researchers say the malware uses Python startup hooks and compiled extensions to run a JavaScript stealer.

DiffusionGemma rethinks text generation with diffusion

DiffusionGemma applies diffusion-style denoising to text, trading autoregressive token-by-token decoding for iterative canvas refinement. Its design combines encoder guidance, bidirectional denoising, scheduling, and entropy-based sampling.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.