Researchers introduce first automated failure attribution benchmark for multi-agent systems

A multidisciplinary team unveils the first benchmark and methods for automated failure attribution in multi-agent Artificial Intelligence systems, tackling the challenge of identifying where and why collaboration goes wrong.

Researchers from Penn State University and Duke University, in collaboration with Google DeepMind and other institutions, have introduced a new research area focused on ´automated failure attribution´ for LLM-driven multi-agent systems. As these systems increase in scale and complexity, pinpointing the source of failures among autonomous agents has become a major obstacle for developers. Manual log inspection is labor-intensive and slow, often stalling system improvement. Addressing this, the team formalized automated failure attribution as a measurable challenge and created a large benchmark dataset called Who&When, opening new directions for debugging and enhancing reliability in multi-agent Artificial Intelligence environments.

The Who&When benchmark captures a diverse array of failure cases from 127 multi-agent systems, blending algorithmically-generated and expert-annotated logs for realism. Each entry comes with detailed human labels: ´Who´ identifies the responsible agent, ´When´ marks the critical error step, and ´Why´ provides a natural language reason for the failure. Building on this foundation, the researchers developed and rigorously evaluated three automated methods for identifying failure sources. These include the All-at-Once, Step-by-Step, and Binary Search approaches, each offering distinct trade-offs between accuracy, efficiency, and interpretability. Their paper, now spotlighted at ICML 2025, provides code and data as open-source resources to the community.

Experimental results reveal the formidable challenge of automated failure attribution. No method achieved high accuracy: the best approach correctly attributed blame to the right agent about 53.5% of the time, but could precisely locate the decisive error step only 14.2% of the time. Hybrid approaches combining multiple strategies performed modestly better but at significant computational cost. Leading reasoning models such as GPT-4o, OpenAI o1, and DeepSeek R1 all struggled with the task, underscoring its complexity compared to traditional evaluation metrics in Artificial Intelligence. Notably, methods that prompted models for explicit reasoning improved performance, though longer contexts generally degraded results. The work provides an essential step toward systematizing failure analysis in collaborative Artificial Intelligence, with open tools aiming to accelerate research and practical deployment.

73

Impact Score

LLM-PIEval: a benchmark for indirect prompt injection attacks in large language models

Large language models have increased interest in Artificial Intelligence and their integration with external tools introduces risks such as direct and indirect prompt injection. LLM-PIEval provides a framework and test set to measure indirect prompt injection risk and the authors release API specifications and prompts to support wider assessment.

NVIDIA may stop bundling memory with gpu kits amid gddr shortage

NVIDIA is reportedly considering supplying only bare silicon to its aic partners rather than the usual gpu and memory kit as gddr shortages constrain fulfillment. The move follows wider industry pressure from soaring dram prices and an impending price increase from AMD of about 10% across its gpu lineup.

SK Hynix to showcase 48 Gb/s 24 Gb GDDR7 for Artificial Intelligence inference

SK Hynix will present a 24 Gb GDDR7 chip rated for 48 Gb/s at ISSCC 2026, claiming a symmetric dual-channel design and updated internal interfaces that push past the expected 32 to 37 Gb/s. The paper positions the device for mid-range Artificial Intelligence inference and SK Hynix will also show LPDDR6 running at 14.4 Gb/s.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.