MAST – UC Berkeley Sky Computing Lab

Multi-Agent System Failure Taxonomy

While the formal definition of agents remains debated, this study defines an LLM-based agent as an artificial entity with three components: (1) prompt specifications (initial state), (2) conversation trace (state), and (3) the ability to interact with environments, such as tool usage (action). A multi-agent system (MAS) is defined as a collection of agents designed to interact through orchestration, enabling collective intelligence. Despite the increasing adoption of MAS, their performance gains often remain minimal compared to single-agent frameworks or simple baselines like best-of-N sampling . Our empirical analysis reveals high failure rates even for state-of-the-art (SOTA) open-source MAS; for instance, ChatDev achieves only 33.33% correctness on our ProgramDev benchmark.

To understand MAS failures, we conduct the first systematic evaluation of MAS execution traces using Grounded Theory and iterative refinement. We analyze 7 popular open-source MAS frameworks (MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, AG2) across 200 conversation traces (each averaging over 15,000 lines of text) from diverse tasks, employing expert human annotators. Through this meticulous process, we uncovered 14 unique failure modes, which we systematically organized into a groundbreaking taxonomy called MAST (Multi-Agent System Failure Taxonomy).

Blog Post

GitHub

Contributors

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Publications

CoRR – Why Do Multi-Agent LLM Systems Fail?