Artificial intelligence hallucination benchmark compares popular language models

Researchers benchmarked 37 large language models on 60 fact-focused questions to measure hallucination rates and found that context window size and model scale are poor proxies for reliability.

The article examines how large language models produce hallucinations and presents a benchmark that compares hallucination rates across popular systems. The authors state that Artificial Intelligence models sometimes generate information that appears plausible but is incorrect or misleading, which they refer to as Artificial Intelligence hallucinations. According to the article, 77% of businesses concerned about AI hallucinations.1 The benchmark covers 37 different large language models, tested with 60 questions derived from news content, and finds that xAI Grok 4 has the lowest hallucination rate (i.e., highest accuracy rate) of 15%, while also showing that a model’s size alone does not determine its likelihood to hallucinate.

The benchmark methodology is based on questions built from CNN articles that are intended to be hard to guess and easy to verify. The team used automated web data collection from CNN’s RSS feed to construct a dataset and wrote 60 questions that focus on precise numerical values, temporal relationships, and statistical facts. An evaluation pipeline then checks model outputs in two stages. First, a static exact match step compares the model’s answer string to the ground truth extracted from the article. Second, if there is no exact match, an additional large language model as a judge evaluates whether the answer is semantically equivalent to the ground truth, accounting for formatting differences such as “26 million” vs. “26000000” or variants like “n/a” and “not given.” Only answers failing both checks are labeled hallucinations, and an example prompt shows how a “not given” ground truth is used to diagnose fabricated answers.

The results are analyzed against factors such as cost and context window size, and the article reports that there is little to no linear correlation between context capacity and accuracy. Models engineered for context windows of 1M+ tokens do not consistently hallucinate less than smaller models, and high and low performers appear across both short and long context ranges. Beyond benchmarking, the article explores why hallucinations occur, pointing to training data limitations, outdated or biased information, knowledge cutoffs, and the fact that language models are optimized for next word prediction rather than factual correctness. It then surveys mitigation strategies including retrieval-augmented generation, modern prompt design, external fact-checking, and human in the loop review, as well as more advanced approaches such as agentic systems that plan and call tools, uncertainty estimation with confidence scores and consistency checks, and better communication of uncertainty to users. Finally, the article highlights a new line of work that treats hallucinations as compression artifacts and describes the Expectation-level Decompression Law, along with an open source Hallucination Risk Calculator that supports pre-generation risk assessment, context evaluation, and SLA-style guarantees for OpenAI-compatible APIs.3

52

Impact Score

Nvidia launches nemotron 3 nano omni for enterprise agents

Nvidia has introduced Nemotron 3 Nano Omni, a multimodal open model designed to support enterprise agents that reason across vision, speech and language. The launch extends Nvidia’s push beyond hardware into models and services while targeting more efficient agentic workflows.

Intel 18A-P node improves performance and efficiency

Intel plans to present new results for its 18A-P process at the VLSI 2026 Symposium, highlighting gains in performance, power efficiency, and manufacturing predictability. The updated node is positioned as a stronger option for customers seeking 18A density with better operating characteristics.

EA CEO defends broader Artificial Intelligence use in game development

EA CEO Andrew Wilson defended the company’s internal use of Artificial Intelligence after employee claims that the tools were slowing work rather than helping. He framed the technology as an aid for repetitive quality assurance tasks, even as concerns persist over its broader impact on development.

Generative Artificial Intelligence is reshaping cybercrime less than feared

Research into criminal underground forums suggests generative Artificial Intelligence is being used mainly as a productivity tool rather than a transformative criminal breakthrough. The biggest near-term risks may come from automation, fraud support, and attackers adapting content to influence chatbot outputs.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.