The article examines how large language models produce hallucinations and presents a benchmark that compares hallucination rates across popular systems. The authors state that Artificial Intelligence models sometimes generate information that appears plausible but is incorrect or misleading, which they refer to as Artificial Intelligence hallucinations. According to the article, 77% of businesses concerned about AI hallucinations.1 The benchmark covers 37 different large language models, tested with 60 questions derived from news content, and finds that xAI Grok 4 has the lowest hallucination rate (i.e., highest accuracy rate) of 15%, while also showing that a model’s size alone does not determine its likelihood to hallucinate.
The benchmark methodology is based on questions built from CNN articles that are intended to be hard to guess and easy to verify. The team used automated web data collection from CNN’s RSS feed to construct a dataset and wrote 60 questions that focus on precise numerical values, temporal relationships, and statistical facts. An evaluation pipeline then checks model outputs in two stages. First, a static exact match step compares the model’s answer string to the ground truth extracted from the article. Second, if there is no exact match, an additional large language model as a judge evaluates whether the answer is semantically equivalent to the ground truth, accounting for formatting differences such as “26 million” vs. “26000000” or variants like “n/a” and “not given.” Only answers failing both checks are labeled hallucinations, and an example prompt shows how a “not given” ground truth is used to diagnose fabricated answers.
The results are analyzed against factors such as cost and context window size, and the article reports that there is little to no linear correlation between context capacity and accuracy. Models engineered for context windows of 1M+ tokens do not consistently hallucinate less than smaller models, and high and low performers appear across both short and long context ranges. Beyond benchmarking, the article explores why hallucinations occur, pointing to training data limitations, outdated or biased information, knowledge cutoffs, and the fact that language models are optimized for next word prediction rather than factual correctness. It then surveys mitigation strategies including retrieval-augmented generation, modern prompt design, external fact-checking, and human in the loop review, as well as more advanced approaches such as agentic systems that plan and call tools, uncertainty estimation with confidence scores and consistency checks, and better communication of uncertainty to users. Finally, the article highlights a new line of work that treats hallucinations as compression artifacts and describes the Expectation-level Decompression Law, along with an open source Hallucination Risk Calculator that supports pre-generation risk assessment, context evaluation, and SLA-style guarantees for OpenAI-compatible APIs.3
