OpenAI reports lower hallucination rates for GPT-5

OpenAI says GPT-5 produces fewer false claims than earlier models, especially when it can browse the web. The gains look smaller without web access, underscoring how much reliability still depends on live sourcing.

OpenAI has launched GPT-5 as a faster and more capable model for ChatGPT, highlighting stronger performance across math, coding, writing, and health advice. The company also says hallucination rates have fallen versus earlier systems. GPT makes incorrect claims 9.6 percent of the time, compared to 12.9 percent for GPT-4o. According to the GPT-5 system card, the new model’s hallucination rate is 26 percent lower than GPT-4o. GPT-5 also had 44 percent fewer responses with “at least one major factual error.”

Those improvements still leave a meaningful error rate. Roughly one in 10 responses from GPT-5 could contain hallucinations, a notable concern given OpenAI’s emphasis on healthcare as a potential use case. OpenAI’s comparisons show that web access is a major factor in reducing inaccuracies. In evaluations with web browsing enabled, GPT-5: 9.6 percent GPT-5-thinking: 4.5 percent o3: 12.7 percent GPT-4o: 12.9 percent. OpenAI also tested more open-ended and complex prompts, where GPT-5 with additional reasoning performed significantly better than earlier reasoning models such as o3 and o4-mini.

The picture changes sharply when web access is removed. On OpenAI’s Simple QA benchmark, described as a set of fact-seeking questions with short answers, hallucination rates rose substantially across all models. GPT-5 main: 47 percent GPT-5-thinking: 40 percent o3: 46 percent GPT-4o: 52 percent. GPT-5 with thinking was only marginally better than o3, while the standard GPT-5 performed one percent higher than o3 and only a few percentage points below GPT-4o. The results suggest that GPT-5 is notably more dependable when it can pull from current online information rather than relying only on training data.

Early usage has also shown that lower aggregate error rates do not eliminate visible mistakes. One GPT-5 demo explaining how planes work drew criticism from Beth Barnes, founder and CEO of Artificial Intelligence research nonprofit METR, who said the model repeated a common misconception involving the Bernoulli Effect and airflow around airplane wings. The episode reinforced a broader point in OpenAI’s own data: GPT-5 appears improved, but factual accuracy still varies significantly depending on browsing access and the type of prompt.

58

Impact Score

LiteLLM drops Delve after security compliance dispute

LiteLLM is replacing Delve and redoing its security certifications after a malware incident and escalating allegations around Delve’s compliance practices. The company plans to use Vanta and an independent third-party auditor to verify its controls.

ARC-AGI-3 exposes limits in Artificial Intelligence reasoning

ARC-AGI-3 introduces interactive, instruction-free environments designed to test whether frontier Artificial Intelligence systems can adapt to genuinely novel situations. Early results show top models performing near zero, highlighting a sharp gap between pattern recognition and open-ended exploration.

NVIDIA Rubin Ultra reportedly hits packaging limits at TSMC

NVIDIA is reportedly running into manufacturing problems with Rubin Ultra as its planned package pushes beyond current TSMC capabilities. The issue centers on CoWoS-L packaging for a much larger multi-die, high-bandwidth memory design.

Intel BOT reshapes code execution through vectorization

Intel’s Binary Optimization Tool is changing how executable applications run on Arrow Lake Refresh systems, with measurable gains in some workloads. Primate Labs found that the tool cuts instruction counts and aggressively shifts execution from scalar code to vector instructions, prompting Geekbench to label BOT-enhanced results.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.