Generative AI Models Frequently Distort Scientific Findings, Study Warns

A new study reveals that large language models, including ChatGPT and DeepSeek, routinely exaggerate scientific findings, even when prompted for accuracy. Artificial Intelligence users may be misled by distorted science summaries.

Recent research published in Royal Society Open Science finds that large language models (LLMs) such as ChatGPT and DeepSeek can misrepresent scientific evidence when summarizing medical and scientific studies. Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University/University of Cambridge systematically analyzed 4,900 summaries generated by ten leading LLMs. Their findings reveal that up to 73 percent of summaries featured inaccurate or exaggerated conclusions compared to the published study abstracts and articles, with newer LLMs ironically performing worse than older iterations.

The study specifically identified a worrying pattern in which LLMs transformed cautious, study-specific claims into broad, general assertions. For instance, models would inappropriately shift a statement like ´The treatment was effective in this study´ to ´The treatment is effective.´ This tendency to generalize was not resolved by instructing the LLMs to be more accurate. Instead, prompts that emphasized avoiding inaccuracies almost doubled the risk of overgeneralization. Peters expressed concern that students, researchers, and policymakers may be misled into trusting summaries that are, in fact, more error-prone when extra caution is requested.

Comparing human-written with chatbot-generated summaries, the study found that LLMs were nearly five times more likely to exaggerate the scope of scientific findings. Chin-Yee noted that since overgeneralizations are already common in science writing, models tend to mirror these patterns learned from training data. Additionally, user preference for accessible and universally applicable responses might further incentivize LLMs to sacrifice precision for generality. To mitigate these issues, the authors recommend choosing LLMs with better accuracy like Claude, lowering the model’s ´temperature´ to reduce creative extrapolation, and structuring science summary prompts to stress indirect, past-tense language. The researchers stress the need for greater scrutiny and evaluation of Artificial Intelligence systems before deploying them in scientific communication.

77

Impact Score

Creating artificial intelligence that matters

The MIT-IBM Watson Artificial Intelligence Lab outlines how academic-industry collaboration is turning research into deployable systems, from leaner models and open science to enterprise-ready tools. With students embedded throughout, the lab targets real use cases while advancing core methods and trustworthy practices.

Inside the Artificial Intelligence divide roiling Electronic Arts

Electronic Arts is pushing nearly 15,000 employees to weave Artificial Intelligence into daily work, but many developers say the tools add errors, extra cleanup, and job anxiety. Internal training, in-house chatbots, and executive cheerleading are colliding with creative skepticism and ethical concerns.

China’s Artificial Intelligence ambitions target US tech dominance

China is closing the Artificial Intelligence gap with the United States through cost-efficient models, aggressive open-source releases and state-backed investment, even as chip controls and censorship remain constraints. Startups like DeepSeek and giants such as Alibaba and Tencent are helping redefine the balance of power.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.