Researchers are confronting mounting evidence that existing evaluation tools for large language models and agentic systems fail to capture real-world performance and risks. Long-horizon agents built for multi-step goals such as multi-day planning, system migrations, or compliance work show accuracy collapsing as tasks exceed tens to low hundreds of dependent decisions in benchmarks like LORE, DeepPlanning, and long-horizon reasoning suites, mainly because small per-step errors compound and expose brittle memory, context handling, and tool use. Even with recent improvements, systems still fail most realistic long-horizon tasks in DeepPlanning, AgentLongBench, and TRIP-Bench, pushing the field toward evaluation methods that focus on stabilizing performance across long horizons, robust state tracking and memory under ultra-long contexts, reliable error recovery and meta-reasoning, and tests that reveal deception, specification gaming, and safety issues instead of tracking task completion alone.
To better measure factual reliability, a new benchmark called KGHaluBench uses a knowledge graph to generate multifaceted questions rooted in randomly selected entities, addressing the narrow and static nature of current hallucination tests. The framework builds an automated verification pipeline with entity-level and fact-level checks and introduces hallucination metrics split into breadth-of-knowledge (HaluBOK) and depth-of-knowledge (HaluDOK) rates so evaluators can see whether models hallucinate because they lack coverage across topics or depth within a topic. In experiments with 25 state-of-the-art models, the verification pipeline shows strong agreement with human judgment (79.19 % and 87.74 % at different verification stages), and the results highlight knowledge characteristics that correlate with hallucination rates. The approach promises more interpretable factual accuracy assessment but faces challenges in keeping the underlying knowledge graph current and scaling verification to increasingly complex, open-ended outputs.
A separate large-scale analysis of 60 widely used benchmarks shows that nearly half of the benchmarks are saturated and that saturation increases with age, meaning many tests can no longer distinguish top-performing models as scores cluster near the maximum. By annotating benchmarks with 14 design and data properties and testing five hypotheses, the study finds that commonly proposed fixes such as hiding test data, switching between multiple choice and open generation, or targeting specific languages have limited impact, while expert-curated benchmarks resist saturation better than crowdsourced sets. The work introduces a quantitative saturation index to mark when a benchmark stops reliably separating model performance and argues that benchmark age and small test set sizes are the strongest predictors of saturation, motivating larger, higher-resolution, dynamically updated suites with reported score uncertainty and multi-dimensional metrics. In parallel, researchers are improving evaluation for open-ended professional tasks with JADE, a two-layer framework where Layer 1 encodes domain-expert skills as stable evaluation criteria and Layer 2 runs claim-level, evidence-dependent assessment that invalidates conclusions built on earlier false claims, which together substantially improve large language model as judge stability, surface hallucinated financial ratios and logical non-sequiturs on BizBench financial analysis tasks, and transfer effectively to a medical benchmark without new expert labels.
