Researchers introduce first automated failure attribution benchmark for multi-agent systems

A multidisciplinary team unveils the first benchmark and methods for automated failure attribution in multi-agent Artificial Intelligence systems, tackling the challenge of identifying where and why collaboration goes wrong.

Researchers from Penn State University and Duke University, in collaboration with Google DeepMind and other institutions, have introduced a new research area focused on ´automated failure attribution´ for LLM-driven multi-agent systems. As these systems increase in scale and complexity, pinpointing the source of failures among autonomous agents has become a major obstacle for developers. Manual log inspection is labor-intensive and slow, often stalling system improvement. Addressing this, the team formalized automated failure attribution as a measurable challenge and created a large benchmark dataset called Who&When, opening new directions for debugging and enhancing reliability in multi-agent Artificial Intelligence environments.

The Who&When benchmark captures a diverse array of failure cases from 127 multi-agent systems, blending algorithmically-generated and expert-annotated logs for realism. Each entry comes with detailed human labels: ´Who´ identifies the responsible agent, ´When´ marks the critical error step, and ´Why´ provides a natural language reason for the failure. Building on this foundation, the researchers developed and rigorously evaluated three automated methods for identifying failure sources. These include the All-at-Once, Step-by-Step, and Binary Search approaches, each offering distinct trade-offs between accuracy, efficiency, and interpretability. Their paper, now spotlighted at ICML 2025, provides code and data as open-source resources to the community.

Experimental results reveal the formidable challenge of automated failure attribution. No method achieved high accuracy: the best approach correctly attributed blame to the right agent about 53.5% of the time, but could precisely locate the decisive error step only 14.2% of the time. Hybrid approaches combining multiple strategies performed modestly better but at significant computational cost. Leading reasoning models such as GPT-4o, OpenAI o1, and DeepSeek R1 all struggled with the task, underscoring its complexity compared to traditional evaluation metrics in Artificial Intelligence. Notably, methods that prompted models for explicit reasoning improved performance, though longer contexts generally degraded results. The work provides an essential step toward systematizing failure analysis in collaborative Artificial Intelligence, with open tools aiming to accelerate research and practical deployment.

73

Impact Score

Trump executive order targets state Artificial Intelligence laws

Executive Order 14365 lays out a federal strategy to discourage, challenge, and potentially preempt state Artificial Intelligence laws viewed as burdensome. Employers are advised to keep complying with current state and local rules while preparing for regulatory uncertainty in 2026.

Who decides how America uses Artificial Intelligence in war

Stanford experts are divided over how the United States should govern Artificial Intelligence in defense, surveillance, and warfare. Their views converge on one point: decisions with such high stakes cannot be left to companies alone.

GPUBreach bypasses IOMMU on GDDR6-based NVIDIA GPUs

Researchers from the University of Toronto describe GPUBreach, a rowhammer attack against GDDR6-based NVIDIA GPUs that can bypass IOMMU protections. The technique enables CPU-side privilege escalation by abusing trusted GPU driver behavior on the host system.

Google Vids opens free video generation to all Google users

Google has made Google Vids available to anyone with a Google account, adding free access to video generation with its latest models. The move expands Google’s end-to-end video workflow and increases pressure on rivals that charge for similar tools.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.