Some Artificial Intelligence chatbots are drawing on retracted scientific papers to answer questions, according to recent studies and tests confirmed by MIT Technology Review. While fabrication of links and references is a known issue, even accurate citations can mislead when the underlying papers have been retracted and answers do not disclose that status. Researchers warn that this poses risks as the public uses chatbots for medical advice and as students and scientists adopt science-focused Artificial Intelligence tools. The US National Science Foundation invested in building Artificial Intelligence models for science research in August, suggesting such usage will grow.
In one study, Weikuan Gu and colleagues queried OpenAI’s ChatGPT running GPT-4o with prompts based on 21 retracted medical imaging papers. The chatbot referenced retracted papers in five cases and advised caution in only three. Another study in August used ChatGPT-4o mini to evaluate 217 retracted and low-quality papers across fields and found that none of the responses mentioned retractions or other concerns. No similar studies have been released on GPT-5. Yuanxi Fu argues that retraction status is an essential quality indicator for tools serving the general public, and OpenAI did not provide a response to requests for comment on the results.
The problem extends beyond ChatGPT. In June, MIT Technology Review tested research-oriented tools including Elicit, Ai2 ScholarQA, Perplexity, and Consensus using questions based on the same 21 retracted papers. Elicit cited five retracted papers, Ai2 ScholarQA 17, Perplexity 11, and Consensus 18, none with explicit retraction warnings. Some providers have since responded. Consensus says it has integrated retraction data from publishers, aggregators, web crawling, and Retraction Watch, and a retest in August saw it cite five retracted papers. Elicit removes retracted items flagged by OpenAlex and is expanding sources. Ai2 says its tool does not automatically detect or remove retractions, while Perplexity notes it does not claim to be 100 percent accurate.
Experts caution that retraction databases remain incomplete and labor intensive to maintain. Ivan Oransky of Retraction Watch says a truly comprehensive database would require significant resources and manual curation. Publisher practices also vary widely, using labels such as correction, expression of concern, erratum, and retracted for different reasons, which complicates automated detection. Papers can persist across preprint servers and repositories, and models may rely on outdated training data. Most academic search engines do not perform real-time checks against retraction data, leaving accuracy at the mercy of their corpora.
Suggested remedies include adding more context for models and users, such as linking journal-commissioned peer reviews and critiques on PubPeer alongside papers. Many publishers, including Nature and the BMJ, post retraction notices outside paywalls, and companies are urged to better leverage such signals as well as news coverage of retractions. Until systems improve, experts say both creators and users of Artificial Intelligence tools must exercise skepticism and due diligence.