Artificial intelligence hallucination benchmark compares popular language models

Researchers benchmarked 37 large language models on 60 fact-focused questions to measure hallucination rates and found that context window size and model scale are poor proxies for reliability.

The article examines how large language models produce hallucinations and presents a benchmark that compares hallucination rates across popular systems. The authors state that Artificial Intelligence models sometimes generate information that appears plausible but is incorrect or misleading, which they refer to as Artificial Intelligence hallucinations. According to the article, 77% of businesses concerned about AI hallucinations.1 The benchmark covers 37 different large language models, tested with 60 questions derived from news content, and finds that xAI Grok 4 has the lowest hallucination rate (i.e., highest accuracy rate) of 15%, while also showing that a model’s size alone does not determine its likelihood to hallucinate.

The benchmark methodology is based on questions built from CNN articles that are intended to be hard to guess and easy to verify. The team used automated web data collection from CNN’s RSS feed to construct a dataset and wrote 60 questions that focus on precise numerical values, temporal relationships, and statistical facts. An evaluation pipeline then checks model outputs in two stages. First, a static exact match step compares the model’s answer string to the ground truth extracted from the article. Second, if there is no exact match, an additional large language model as a judge evaluates whether the answer is semantically equivalent to the ground truth, accounting for formatting differences such as “26 million” vs. “26000000” or variants like “n/a” and “not given.” Only answers failing both checks are labeled hallucinations, and an example prompt shows how a “not given” ground truth is used to diagnose fabricated answers.

The results are analyzed against factors such as cost and context window size, and the article reports that there is little to no linear correlation between context capacity and accuracy. Models engineered for context windows of 1M+ tokens do not consistently hallucinate less than smaller models, and high and low performers appear across both short and long context ranges. Beyond benchmarking, the article explores why hallucinations occur, pointing to training data limitations, outdated or biased information, knowledge cutoffs, and the fact that language models are optimized for next word prediction rather than factual correctness. It then surveys mitigation strategies including retrieval-augmented generation, modern prompt design, external fact-checking, and human in the loop review, as well as more advanced approaches such as agentic systems that plan and call tools, uncertainty estimation with confidence scores and consistency checks, and better communication of uncertainty to users. Finally, the article highlights a new line of work that treats hallucinations as compression artifacts and describes the Expectation-level Decompression Law, along with an open source Hallucination Risk Calculator that supports pre-generation risk assessment, context evaluation, and SLA-style guarantees for OpenAI-compatible APIs.3

52

Impact Score

AMD acquires MEXT for Artificial Intelligence memory optimization

AMD is acquiring MEXT to address memory bottlenecks affecting cloud and enterprise infrastructure. MEXT’s Artificial Intelligence-powered predictive memory technology is designed to make flash behave more like DRAM while preserving performance and efficiency.

South Korea’s embrace of Artificial Intelligence

South Korea’s enthusiasm for Artificial Intelligence reflects a long-running national belief that technology can drive modernization, competitiveness, and economic growth. The optimism is strong, but concerns over jobs, education, privacy, and inequality are becoming harder to ignore.

US limits Anthropic Artificial Intelligence model access

The US government ordered Anthropic to restrict access to its latest Artificial Intelligence models over national security concerns. The move adds regulatory risk to investor assessments of advanced Artificial Intelligence companies.

Artificial Intelligence surgery tool used in UK for first time

A portable Artificial Intelligence system has been deployed at St Mark’s, the National Bowel Hospital, to colour-code body parts during live surgery. The tool is designed to help surgeons identify hidden structures in real time and improve safety during operations.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.