OpenAI has launched GPT-5 as a faster and more capable model for ChatGPT, highlighting stronger performance across math, coding, writing, and health advice. The company also says hallucination rates have fallen versus earlier systems. GPT makes incorrect claims 9.6 percent of the time, compared to 12.9 percent for GPT-4o. According to the GPT-5 system card, the new model’s hallucination rate is 26 percent lower than GPT-4o. GPT-5 also had 44 percent fewer responses with “at least one major factual error.”
Those improvements still leave a meaningful error rate. Roughly one in 10 responses from GPT-5 could contain hallucinations, a notable concern given OpenAI’s emphasis on healthcare as a potential use case. OpenAI’s comparisons show that web access is a major factor in reducing inaccuracies. In evaluations with web browsing enabled, GPT-5: 9.6 percent GPT-5-thinking: 4.5 percent o3: 12.7 percent GPT-4o: 12.9 percent. OpenAI also tested more open-ended and complex prompts, where GPT-5 with additional reasoning performed significantly better than earlier reasoning models such as o3 and o4-mini.
The picture changes sharply when web access is removed. On OpenAI’s Simple QA benchmark, described as a set of fact-seeking questions with short answers, hallucination rates rose substantially across all models. GPT-5 main: 47 percent GPT-5-thinking: 40 percent o3: 46 percent GPT-4o: 52 percent. GPT-5 with thinking was only marginally better than o3, while the standard GPT-5 performed one percent higher than o3 and only a few percentage points below GPT-4o. The results suggest that GPT-5 is notably more dependable when it can pull from current online information rather than relying only on training data.
Early usage has also shown that lower aggregate error rates do not eliminate visible mistakes. One GPT-5 demo explaining how planes work drew criticism from Beth Barnes, founder and CEO of Artificial Intelligence research nonprofit METR, who said the model repeated a common misconception involving the Bernoulli Effect and airflow around airplane wings. The episode reinforced a broader point in OpenAI’s own data: GPT-5 appears improved, but factual accuracy still varies significantly depending on browsing access and the type of prompt.
