Generative AI Models Frequently Distort Scientific Findings, Study Warns

A new study reveals that large language models, including ChatGPT and DeepSeek, routinely exaggerate scientific findings, even when prompted for accuracy. Artificial Intelligence users may be misled by distorted science summaries.

Recent research published in Royal Society Open Science finds that large language models (LLMs) such as ChatGPT and DeepSeek can misrepresent scientific evidence when summarizing medical and scientific studies. Uwe Peters of Utrecht University and Benjamin Chin-Yee of Western University/University of Cambridge systematically analyzed 4,900 summaries generated by ten leading LLMs. Their findings reveal that up to 73 percent of summaries featured inaccurate or exaggerated conclusions compared to the published study abstracts and articles, with newer LLMs ironically performing worse than older iterations.

The study specifically identified a worrying pattern in which LLMs transformed cautious, study-specific claims into broad, general assertions. For instance, models would inappropriately shift a statement like ´The treatment was effective in this study´ to ´The treatment is effective.´ This tendency to generalize was not resolved by instructing the LLMs to be more accurate. Instead, prompts that emphasized avoiding inaccuracies almost doubled the risk of overgeneralization. Peters expressed concern that students, researchers, and policymakers may be misled into trusting summaries that are, in fact, more error-prone when extra caution is requested.

Comparing human-written with chatbot-generated summaries, the study found that LLMs were nearly five times more likely to exaggerate the scope of scientific findings. Chin-Yee noted that since overgeneralizations are already common in science writing, models tend to mirror these patterns learned from training data. Additionally, user preference for accessible and universally applicable responses might further incentivize LLMs to sacrifice precision for generality. To mitigate these issues, the authors recommend choosing LLMs with better accuracy like Claude, lowering the model’s ´temperature´ to reduce creative extrapolation, and structuring science summary prompts to stress indirect, past-tense language. The researchers stress the need for greater scrutiny and evaluation of Artificial Intelligence systems before deploying them in scientific communication.

77

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend