Artificial intelligence hallucination benchmark compares popular language models

Researchers benchmarked 37 large language models on 60 fact-focused questions to measure hallucination rates and found that context window size and model scale are poor proxies for reliability.

The article examines how large language models produce hallucinations and presents a benchmark that compares hallucination rates across popular systems. The authors state that Artificial Intelligence models sometimes generate information that appears plausible but is incorrect or misleading, which they refer to as Artificial Intelligence hallucinations. According to the article, 77% of businesses concerned about AI hallucinations.1 The benchmark covers 37 different large language models, tested with 60 questions derived from news content, and finds that xAI Grok 4 has the lowest hallucination rate (i.e., highest accuracy rate) of 15%, while also showing that a model’s size alone does not determine its likelihood to hallucinate.

The benchmark methodology is based on questions built from CNN articles that are intended to be hard to guess and easy to verify. The team used automated web data collection from CNN’s RSS feed to construct a dataset and wrote 60 questions that focus on precise numerical values, temporal relationships, and statistical facts. An evaluation pipeline then checks model outputs in two stages. First, a static exact match step compares the model’s answer string to the ground truth extracted from the article. Second, if there is no exact match, an additional large language model as a judge evaluates whether the answer is semantically equivalent to the ground truth, accounting for formatting differences such as “26 million” vs. “26000000” or variants like “n/a” and “not given.” Only answers failing both checks are labeled hallucinations, and an example prompt shows how a “not given” ground truth is used to diagnose fabricated answers.

The results are analyzed against factors such as cost and context window size, and the article reports that there is little to no linear correlation between context capacity and accuracy. Models engineered for context windows of 1M+ tokens do not consistently hallucinate less than smaller models, and high and low performers appear across both short and long context ranges. Beyond benchmarking, the article explores why hallucinations occur, pointing to training data limitations, outdated or biased information, knowledge cutoffs, and the fact that language models are optimized for next word prediction rather than factual correctness. It then surveys mitigation strategies including retrieval-augmented generation, modern prompt design, external fact-checking, and human in the loop review, as well as more advanced approaches such as agentic systems that plan and call tools, uncertainty estimation with confidence scores and consistency checks, and better communication of uncertainty to users. Finally, the article highlights a new line of work that treats hallucinations as compression artifacts and describes the Expectation-level Decompression Law, along with an open source Hallucination Risk Calculator that supports pre-generation risk assessment, context evaluation, and SLA-style guarantees for OpenAI-compatible APIs.3

52

Impact Score

A decade of OpenAI and the road to superintelligence

Sam Altman reflects on OpenAI’s first ten years, from uncertain beginnings to systems that outperform top humans on difficult intellectual tasks, and lays out a confident path toward superintelligence over the next decade.

NVIDIA designs location verification technology to track GPU smugglers

Reuters reports NVIDIA has developed a software-based location verification tool that identifies where and how its GPUs are deployed to curb GPU smuggling and enforce export restrictions. The company says the service will help data center operators monitor the health and inventory of their Artificial Intelligence GPU fleets.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.