OpenAI´s internal tests show its newest large language models, particularly GPT o3 and GPT o4-mini, are significantly more prone to hallucinating—generating false information—than prior versions. The company found that GPT o3, touted as its most powerful model yet, produced hallucinations 33% of the time when answering questions about public figures using the PersonQA benchmark, double the rate of its predecessor, GPT o1. The GPT o4-mini fared even worse, with a hallucination rate of 48%. On more general queries tested by the SimpleQA benchmark, those rates climbed to 51% for o3 and an alarming 79% for o4-mini, with o1 previously at 44%.
This trend is especially perplexing given the growing focus on so-called ´reasoning´ models. Such models are designed to break tasks into logical steps to achieve human-like problem-solving capabilities. OpenAI, alongside rivals like Google and DeepSeek, has championed these advancements as part of the next leap in Artificial Intelligence, claiming earlier models like o1 could outperform PhD students in certain academic fields. However, new findings suggest that complexity and improved reasoning may actually introduce more avenues for error, leading to increased hallucinations, contrary to industry hopes for greater reliability.
OpenAI acknowledges the worsening issue but disputes that reasoning models are inherently more prone to error, stating that research is ongoing to understand and mitigate the problem. Regardless, continued high rates of hallucination threaten the usefulness of large language models in real-world applications, especially where the main advantage is supposed to be saving time and labor. If outputs require meticulous double-checking, the incentive to use such technology diminishes. The current challenge for OpenAI and the wider Artificial Intelligence sector remains clear: without addressing this ´robot dreams´ problem, trust in these systems will be difficult to establish.