Researchers from Penn State University and Duke University, in collaboration with Google DeepMind and other institutions, have introduced a new research area focused on ´automated failure attribution´ for LLM-driven multi-agent systems. As these systems increase in scale and complexity, pinpointing the source of failures among autonomous agents has become a major obstacle for developers. Manual log inspection is labor-intensive and slow, often stalling system improvement. Addressing this, the team formalized automated failure attribution as a measurable challenge and created a large benchmark dataset called Who&When, opening new directions for debugging and enhancing reliability in multi-agent Artificial Intelligence environments.
The Who&When benchmark captures a diverse array of failure cases from 127 multi-agent systems, blending algorithmically-generated and expert-annotated logs for realism. Each entry comes with detailed human labels: ´Who´ identifies the responsible agent, ´When´ marks the critical error step, and ´Why´ provides a natural language reason for the failure. Building on this foundation, the researchers developed and rigorously evaluated three automated methods for identifying failure sources. These include the All-at-Once, Step-by-Step, and Binary Search approaches, each offering distinct trade-offs between accuracy, efficiency, and interpretability. Their paper, now spotlighted at ICML 2025, provides code and data as open-source resources to the community.
Experimental results reveal the formidable challenge of automated failure attribution. No method achieved high accuracy: the best approach correctly attributed blame to the right agent about 53.5% of the time, but could precisely locate the decisive error step only 14.2% of the time. Hybrid approaches combining multiple strategies performed modestly better but at significant computational cost. Leading reasoning models such as GPT-4o, OpenAI o1, and DeepSeek R1 all struggled with the task, underscoring its complexity compared to traditional evaluation metrics in Artificial Intelligence. Notably, methods that prompted models for explicit reasoning improved performance, though longer contexts generally degraded results. The work provides an essential step toward systematizing failure analysis in collaborative Artificial Intelligence, with open tools aiming to accelerate research and practical deployment.