The paper’s practical point: adding more agents does not automatically create more reliability. It often creates a small organization, with all the usual failure modes.
Source note: Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. “Why Do Multi-Agent LLM Systems Fail?” NeurIPS 2025 Datasets and Benchmarks Track, proceedings publication date April 23, 2026. https://proceedings.neurips.cc/paper_files/paper/2025/hash/b1041e52d3be19f0a9bc491657488e4a-Abstract-Datasets_and_Benchmarks_Track.html
Why This Paper Matters
Multi-agent LLM systems have an intuitive appeal.
Instead of asking one model to solve everything, give the work to a team. One agent plans. One writes code. One reviews. One verifies. One coordinates. The architecture sounds more like a real organization, and that makes it easy to believe it should be more capable than a single agent.
This paper is a useful correction to that intuition.
The authors study why multi-agent systems fail across popular frameworks such as ChatDev, MetaGPT, HyperAgent, AppWorld, AG2, Magentic-One, and OpenManus. Their answer is not simply “the model was not smart enough.” Many failures look like organization-design failures: unclear roles, brittle handoffs, lost context, agents ignoring each other, premature shutdown, and review steps that check the wrong things.
That matters because the agent market keeps moving toward orchestration. Companies are building agent teams, agent swarms, workflow agents, coding-agent pipelines, research agents, sales agents, and multi-step automation systems. The engineering question is no longer whether a model can answer a prompt. It is whether a group of model-driven components can coordinate work without quietly losing the plot.
This paper gives that problem a vocabulary.
The Idea in Plain English
The paper treats a multi-agent LLM system like a failing organization.
A bad organization does not only fail because its employees are incapable. It fails because responsibilities overlap, nobody knows who has final authority, meetings drop important information, work gets repeated, and review processes rubber-stamp outputs without checking whether the actual goal was met.
The same pattern shows up in multi-agent systems.
An agent may disobey its role. Another may assume missing information instead of asking for clarification. A reviewer may verify that code compiles but never check whether the program satisfies the user’s request. A coordinator may stop the process too early because the termination rule was ambiguous.
The authors call their taxonomy MAST: the Multi-Agent System Failure Taxonomy. They pair it with MAST-Data, a dataset of 1,642 annotated execution traces from seven multi-agent frameworks. Instead of reporting only whether a system succeeded, they label how it failed.
That shift is the important move. Success rate tells builders whether the system broke. Failure mode tells them what kind of system they are actually operating.
What the Researchers Tested
The researchers built the dataset in two stages.
First, expert annotators studied more than 150 multi-agent execution traces using grounded theory. These traces were long, messy records of agents talking, calling tools, making assumptions, reviewing work, and terminating. The team iterated on the taxonomy until three annotators reached strong agreement, with Cohen’s kappa reported at 0.88.
Second, they used the taxonomy to scale annotation. Manually labeling more than 1,600 traces is expensive, so the authors built an LLM-as-judge annotation pipeline. With few-shot calibration, that annotator reached 94% accuracy and kappa 0.77 against human labels, then generalized to two additional systems and benchmarks with kappa 0.79.
The resulting MAST-Data covers coding, math, and general-agent tasks across several frameworks and model families, including GPT-4-series models, Claude 3.7 Sonnet, Qwen2.5, and CodeLlama.
The paper also runs case studies to see whether taxonomy-guided changes can improve systems. For example, it changes ChatDev role hierarchy and verification behavior, then measures whether the failure profile moves.
What They Found
Multi-agent failure is not one problem
MAST contains 14 failure modes grouped into three broad categories.
System design issues are the largest category in the main taxonomy view, accounting for 44.2% of observed failures. These include disobeying task specifications, disobeying role specifications, repeating steps, losing conversation history, and failing to recognize termination conditions.
Inter-agent misalignment accounts for 32.3%. These failures happen when the agents do not coordinate properly: a conversation resets, an agent fails to ask for clarification, the task derails, one agent withholds information, another ignores input, or an agent’s stated reasoning does not match its action.
Task verification accounts for 23.5%. These are quality-control failures: the system stops too early, does no meaningful verification, or performs verification that is wrong or incomplete.
The useful part is not the exact percentages. The useful part is that each category suggests a different repair. A role failure is not fixed the same way as a verifier failure. A communication failure is not fixed the same way as a termination-condition failure.
The benchmark score hides the failure profile
The authors report failure rates from 41% to 86.7% across the studied systems, while warning that these rates are not directly comparable because the systems run on different benchmarks.
The more important finding is that systems fail differently.
AppWorld often suffers from premature termination. OpenManus tends toward step repetition. HyperAgent shows prominent step repetition and incorrect verification. MetaGPT and ChatDev differ not only in success rate, but in the balance of system-design, coordination, and verification failures.
That is the reason this paper is more useful than another leaderboard. If two systems both fail 40% of the time, one may need better state management while the other needs a stronger verifier. Aggregate accuracy cannot tell the difference.
Better models do not erase bad architecture
The paper does not deny that model quality matters. It compares models and finds meaningful differences. In one MetaGPT programming comparison, GPT-4o generally performs better than Claude 3.7 Sonnet and shows 39% fewer system-design failures.
But the paper’s deeper point is that the architecture still matters. When the same model is used inside different multi-agent designs, the failure distribution changes. MetaGPT has fewer system-design and inter-agent failures than ChatDev on one comparison, but more task-verification failures.
That is the organization analogy again. Hiring smarter people helps, but it does not fix a broken operating model by itself.
Review agents are often too shallow
The verification findings are especially relevant for builders.
Many multi-agent systems include a reviewer, tester, critic, or verifier. That can create a false sense of safety. The paper shows that verifiers often check easy surface conditions: whether code compiles, whether obvious placeholder comments remain, whether an answer is present. They may miss whether the result actually satisfies the task.
In one example, a generated chess program passes superficial checks while still being unusable because it does not validate against real game rules.
This is the part buyers should care about. A system that contains a “reviewer agent” is not necessarily safer. The question is what the reviewer can observe, what it is allowed to test, and whether it has a high-level definition of success.
Why It Happens
The failure pattern comes from the gap between conversation and coordination.
Multi-agent systems often communicate through natural language. That makes them flexible, but it also makes them ambiguous. Agents can summarize badly, omit important details, assume too much, or treat another agent’s output as authoritative when it should be challenged.
Standardized protocols can help with plumbing. The paper mentions systems such as Model Context Protocol and Agent-to-Agent-style communication as useful for message formats and interoperability. But the authors argue that the harder failures happen even when agents are already speaking to each other inside the same framework.
The hard part is social reasoning. Does this agent know what the other agent needs? Does it understand that a handoff is incomplete? Does it know when to ask for clarification? Does it know who has authority to terminate the task?
That is why the paper keeps returning to system design. Multi-agent reliability depends on roles, state, communication rules, escalation paths, verification layers, and termination semantics.
What This Means for Builders
Builders should stop treating “multi-agent” as a reliability feature by itself.
The practical move is to instrument failure modes. A team building an agent workflow should log traces, label failures, and maintain a failure distribution. If the dominant failures are step repetition and termination confusion, adding another critic agent is probably noise. If the dominant failures are incorrect verification, then better role prompts will not be enough.
The paper’s case studies show both the promise and the limit of simple fixes. Improving ChatDev role specifications produced a 9.4 percentage-point success-rate gain in one intervention. Adding high-level task-objective verification produced a 15.6 percentage-point gain in another. In the AG2 MathChat case, prompt and topology changes helped under some model settings, but not uniformly.
That is the right lesson: tactical fixes can help, but reliability probably requires structural design. The authors point toward stronger verification, more explicit communication protocols, confidence thresholds, better memory and state management, and domain-specific test mechanisms.
For software agents, that means running real tests, not just asking a reviewer agent whether the code looks good. For research agents, it means checking citations and claims against source material. For operations agents, it means defining rollback, approval, and exception paths before the system touches production workflows.
What This Means for Buyers and Operators
Buyers should ask vendors for traces, not just demos.
A multi-agent demo can look impressive because the conversation is visible. The system appears to deliberate. It may have named agents, assigned roles, and a final reviewer. That theater is not the same as reliability.
The better questions are operational:
How often does the system terminate too early? How does it know a task is complete? What does the verifier actually verify? Can the system detect when agents disagree? Does it preserve state across handoffs? Are failures classified by root pattern or only counted as generic errors?
For operators, the paper suggests a procurement test: ask for a failure taxonomy and a recent failure breakdown. If a vendor cannot explain how its multi-agent system fails, it probably cannot explain how it improves.
What to Watch Next
The field should move from orchestration diagrams to failure observability.
Researchers will likely build better benchmarks for agent teams, but the more useful work may happen in trace analysis, typed handoffs, replayable logs, verifier design, and role-specific training. Multi-agent systems need the equivalent of incident analysis, not just task-completion scoring.
Builders should also watch whether agent protocols evolve beyond message transport. A protocol that standardizes tool calls is useful. A protocol that helps agents preserve assumptions, uncertainty, authority, and verification obligations would be more important.
The next generation of agent systems will probably look less like a group chat and more like an operating system for delegated work.
Limitations and Caveats
The paper is careful about scope.
MAST is not claimed to be exhaustive. Multi-agent systems will produce failure modes outside this taxonomy, especially as architectures change. Root-cause labeling is also hard because agent traces are long and failures can compound. A late verification failure may be caused by an earlier handoff failure.
The benchmark numbers should not be read as a clean ranking of frameworks. The systems are evaluated on different tasks and configurations, and the paper itself warns against direct comparison.
The annotation pipeline also relies on proprietary LLMs, including OpenAI’s o1 in the reported setup. That makes scaling possible, but it also means the labeling process inherits model limitations and API dependencies.
Finally, the paper diagnoses more than it cures. It shows that better prompts, role definitions, topology, and verification can improve results, but it does not provide a universal recipe for reliable multi-agent systems.
That is fine. The valuable contribution is the map. The work says: before claiming the agent team is smarter, understand how the team fails.
Source
Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E., & Stoica, I. (2026). Why Do Multi-Agent LLM Systems Fail? NeurIPS 2025 Datasets and Benchmarks Track. Available at: https://proceedings.neurips.cc/paper_files/paper/2025/hash/b1041e52d3be19f0a9bc491657488e4a-Abstract-Datasets_and_Benchmarks_Track.html