Multi-Agent Systems Done Right | Towards AI

Author(s): Vlad Johnson

Originally published on Towards AI.

In the rapidly evolving field of Artificial Intelligence, multi-agent systems have emerged as a powerful approach to tackling complex, multi-step problems that often exceed the capabilities of single agents.

These systems have proven to enable continuous software development, comprehensive market research, generation of detailed intelligence reports, and handling other tasks that frequently involve non-deterministic state transitions. They also integrate seamlessly with custom tools such as REPLs, plotting interfaces, and web search capabilities to enhance their effectiveness.

Drawing from my experience in building and refining such systems, I will share key lessons that will help you with the design, implementation, and optimization of multi-agent architectures — saving you time and effort along the way.

Supervisors: Intelligence and Context

One-Layer Hierarchical Multi-Agent System with a Supervising Agent

One fundamental lesson I learned early on is to always select the most capable LLMs for supervisory roles — whether managing specialized agents in flat hierarchies or overseeing other supervisors in multi-layered systems.

Lower-parameter models often fail to maintain the structured outputs essential for multi-agent workflows. A supervisor must logically determine which agent to engage next based on interaction history and task progression. Less powerful models not only make poor agent selection decisions, frequently calling the same agent repeatedly, but also struggle to consistently produce the required data formats.

My experiments show that models with fewer than 7B parameters exhibit unacceptably high failure rates when generating specific JSON, XML, or Markdown outputs. This confirms that supervisory models must be powerful enough to both understand and enforce output structure. Therefore, I recommend using higher-parameter models, such as a 32B parameters model distilled from DeepSeek-R1 over a 7B one.

The next key takeaway is the importance of providing supervisors with comprehensive context. All relevant interaction history should be included, trimming only when context limits are reached. A supervisor’s effectiveness is directly tied to its access to complete state information.

Early implementations that provided minimal context — such as only the most recent output and basic instructions — performed poorly. In contrast, giving supervisors full visibility into the task chain significantly improves outcomes, enabling them to:

  • Track true progress toward objectives.
  • Detect when agents are caught in circular reasoning.
  • Offer targeted guidance to specialized agents.
  • Recognize when to pivot strategies entirely.

The optimal approach maintains a complete task history, flowing to the supervisor with each interaction. This creates a natural division of responsibilities:

  • Higher-parameter models (70B+): Handle delegation and enforce structured outputs.
  • Mid-tier models (7–70B): Perform reasoning and content generation.
  • Smaller models (<7B): Execute classification, extraction, and other constrained tasks.

Think Before You Speak

It is no surprise that “thinking” models consistently outperform those who do not across a wide range of tasks, particularly in complex software development. Effective problem-solving in this domain requires iterative reasoning, abstraction, and multi-step planning — capabilities these models are specifically designed to enhance.

The Antidote to Agent Loops

One of the most frustrating problems in multi-agent systems is the dreaded agent calling loop — where agents continuously pass tasks between each other without making meaningful progress. These loops occur when goals are ambiguous or success conditions are poorly defined.

The solution is surprisingly straightforward but often overlooked: define concrete goals with explicit success and failure criteria. Every agent should know precisely what it is trying to accomplish and how to recognize when it is done.

For example, rather than instructing an agent to just “improve this text,” specify “reduce this text by 30% while preserving all key points; marked by successful comparison with the original key points list.” This clarity prevents the endless refinement loop where an agent keeps making marginal changes because it lacks a clear stopping condition.

I found particular success using:

  • Numerical thresholds whenever possible (accuracy percentages, word counts, etc.).
  • Boolean completion criteria (presence/absence of specific elements).
  • Explicit failure conditions that trigger escalation to supervisor models.
  • Time or iteration limits as a failsafe against loops.

Constrained Supervisors and Helpful Advisors

When a supervising model is responsible for coordinating agents, evaluating outputs, generating content, and making strategic decisions all at once, it often excels at none of them. Supervisors perform best when their scope is limited.

Instead of overloading your supervisors with responsibilities, I recommend structuring supervision hierarchies such that:

  • A supervising model dedicated solely to coordination and task delegation.
  • A separate high-level advisor agent that provides strategic guidance.
  • A clear division of responsibilities between different hierarchy levels.

For example, in a code generation system, the supervisor can focus exclusively on workflow management: assigning tasks, tracking progress, and ensuring smooth information flow between agents. Meanwhile, the advisor evaluates output quality against predefined objectives and suggests strategic adjustments without getting bogged down in execution details.

Enhancing the supervisor with an advisor creates a “second brain” that provides real-time improvements, such as suggesting more efficient implementations or modifications to prevent potential errors. This separation of roles leads to more effective decision-making and higher-quality outputs.

The Power of Hierarchy

Two-Layer Hierarchical Multi-Agent System with a Supervising Agent

Through extensive experimentation, I have confirmed that hierarchical structures, modeled after real-world organizational structures, perform exceptionally well. The most effective configuration follows the “chief supervisor → team supervisors → expert agents” paradigm.

This structure succeeds because it:

  • Establishes clear accountability at each level.
  • Ensures efficient information filtering and aggregation.
  • Allows specialized teams to focus on subproblems.
  • Provides natural error correction mechanisms.

For complex tasks like equity research, this approach is particularly effective. A senior manager model oversees the overall strategy and delegates major components to team leads, each responsible for a core area such as data acquisition, research, modeling, and reporting. These team leads manage specialized sub-teams or agents focused on executing domain-specific tasks. This structure promotes deep functional expertise, clear accountability, and streamlined coordination. It also enables rigorous quality control at each stage, improving accuracy, reliability, and scalability of the final outputs.

Start Simple, Then Scale

Most successful multi-agent systems share a common origin story: they begin with a streamlined architecture that evolves organically in response to real-world demands. Rather than architecting complex hierarchies prematurely, implement the simplest viable structure that addresses your core requirements.

My recommended evolution path:

  1. Deploy one supervisor orchestrating 3–5 specialized agents with clearly defined domains of expertise.
  2. Monitor which agents handle the most complex workflows or diverse responsibilities — these represent natural evolution points.
  3. Transform high-complexity agents into structured teams with specialized sub-agents.
  4. Maintain a limited span of control (≤5 agents per supervisor).

For example, if you are building a data science team, you might start with agents for data analysis, feature engineering, modeling, and validation. As the process scales, each of these agents can evolve into a dedicated team. The data analysis stage, for instance, could expand to include specialists for exploratory analysis, visualization, and statistical testing, ensuring a more structured and scalable workflow.

Explore Alternative Models

A highly effective approach to building robust multi-agent systems is integrating models from different families. Each foundation model has unique strengths and weaknesses, and when combined strategically, they create a powerful synergy. For example, pairing Anthropic’s Claude Opus 4.1 with OpenAI’s GPT-5 leverages Claude’s precision in following nuanced instructions and enforcing strict guidelines, while GPT-5 excels in creative output and complex reasoning.

To maximize performance across tasks, consider assigning models based on their strengths:

  • Anthropic’s Claude models (e.g. Opus 4.1) for supervisory roles, where precise instruction-following is essential.
  • OpenAI models for creative generation and complex reasoning.
  • Specialized models like Meta’s Llama-3 for domains requiring fine-tuning.
  • Google’s Gemini models (e.g., 2.5 Pro) for processing lengthy contexts efficiently.

Memory-Driven Output Diversity

Long-term memory plays a critical role in enhancing output diversity in generative AI systems. Models naturally strive for orthogonal outputs when they can reference their previous work, effectively avoiding repetition. For creative tasks, maintaining a memory bank of prior outputs fosters natural diversity. For instance, a model with long-term memory, when asked to generate a new idea, intuitively avoids repeating concepts it has already explored in previous sessions.

This capability can be implemented through:

  • Vector databases that efficiently store and organize previous outputs
  • Semantic retrieval systems to surface relevant prior work.
  • Explicit instructional frameworks that encourage the model to generate content distinctly different from retrieved examples.

This approach dramatically improves the diversity of outputs in generative tasks without resorting to complex prompting strategies or parameter adjustments.

Right Tool for the Job

The current selection of AI agent frameworks is more diverse than ever. Let’s look at the dominant frameworks (as of September, 2025) that you can leverage to build successful multi-agent systems.

LangGraph is an open-source, graph-based framework built on top of LangChain, designed to streamline the creation and management of complex, stateful AI workflows. LangGraph represents workflows as directed graphs, enabling fine-grained control over complex, multi-step tasks. In this structure, each node performs a specific action — such as calling an LLM, transforming data, or interacting with external services — while the edges define how execution flows from one step to the next.

LangGraph can easily handle cyclical and conditional structures, enabling sophisticated, dynamic agent runtimes that move beyond traditional linear execution. This is particularly effective for decision-making systems, conversational AI agents, and advanced automation tasks.

LangGraph simplifies the development process with its debugging tools, including stateful execution monitoring, time travel, and replay functionalities. LangGraph also provides LangGraph Studio, a GUI designed for rapid prototyping and in-depth debugging of multi-agent workflows directly from the browser or within the application environment. Currently, LangGraph Studio is available exclusively on macOS.

Advantages:

  • Rich integration with the LangChain ecosystem (tools, vector stores, and LLM providers).
  • Great documentation, growing community, lots of integrations.
  • Fine-grained control through explicit graph-based orchestration of agents.
  • Strong modularity: agents are nodes, clearly separated, easy to debug and test individually.
  • Production-ready with the support for token streaming, state checkpointing, and built-in tracing. LangGraph is used by companies like Replit, Uber, LinkedIn, GitLab.
  • Good for rapid prototyping and end-to-end LLM workflows.

Disadvantages:

  • High complexity and boilerplate; steep learning curve.
  • APIs and abstractions change often, which can be painful for the production applications, requiring continuous adaptation.
  • Debugging can be more cumbersome compared to using raw API calls.

Use Cases:

  • Complex, structured workflows requiring precise orchestration (e.g., financial reporting as in the previous equity research example, autonomous customer support, and software development). In my opinion, LangGraph is an overkill for simple agent workflows or trivial tasks.
  • Production environments that demand comprehensive error handling, monitoring, and observability to ensure system reliability.
  • Tasks involving multiple specialized agents that use different LLMs or tools simultaneously, facilitating complex interactions and data processing.

LlamaIndex is an open-source framework that makes it easier to integrate private or domain-specific data into large language model (LLM) applications. It streamlines the entire pipeline — from ingesting data from sources like APIs, PDFs, and SQL databases, to structuring that data into formats optimized for LLM consumption, such as vector embeddings. Once indexed, the data can be accessed through natural language interfaces, whether via direct question-answering using query engines or multi-turn dialogue using chat engines. This allows for powerful, flexible interactions with custom datasets, enabling everything from document search to complex reasoning over proprietary knowledge.

LlamaIndex also provides a flexible foundation for building intelligent agents — LLM-powered assistants that can use tools, reason through tasks, and operate autonomously when needed. These agents can incorporate retrieval-augmented generation (RAG) pipelines and tap into external tools to handle tasks like research, data extraction, or decision-making. With no constraints on how LLMs are used — whether for autocomplete, chatbots, or full task automation — LlamaIndex simply makes the process more accessible. Its modular components, like data connectors, observability tools, and flexible workflows, support rigorous experimentation and continuous improvement, making it a powerful toolkit for creating robust and intelligent LLM-powered systems. Today, LlamaIndex is used by companies as Salesforce, KPMG, and Carlyle among others.

Advantages:

  • Excellent support for retrieval-augmented generation (RAG) and external data sources (databases, PDFs, vector stores). Simplifies data ingestion, chunking, and vector storage.
  • Mature, stable, and well-documented for knowledge-driven applications.
  • AgentWorkflow offers convenient high-level orchestration with concurrent tasks.
  • Balanced trade-off between developer speed and flexibility; excellent choice for intermediate complexity scenarios.
  • Features abstractions for advanced retrieval (e.g., reranking, sub-questions, streaming) and useful tools like sentence splitters and data loaders. Switching between vector stores and embedding models is seamless.

Disadvantages:

  • Historically single-agent focused; multi-agent capabilities still evolving.
  • While not necessary a disadvantage, AgentWorkflow’s async/event-driven model requires strong asynchronous development skills and leads to the async contagiousness.
  • Does not inherently support the notion of agent “teams” with distinct personas.
  • Can be buggy and inconsistent, especially outside OpenAI defaults.

Use Cases:

  • RAG-focused applications, including advanced question-answering systems and conversational agents and chatbots.
  • Data-intensive multi-step tasks, such as financial/legal analysis involving external knowledge.
  • Projects involving proprietary datasets, particularly in enterprise or highly regulated environments.

A role‑based multi‑agent framework that assigns specialized tasks to individual agents, streamlining collaboration and enabling production‑ready workflows with built‑in memory management and sequential or hierarchical execution.

Advantages:

  • Intuitive, high-level abstraction modeling agent interactions as “teams” with clear roles.
  • Extremely user-friendly for rapid prototyping, requiring minimal boilerplate code.
  • Optimized for speed and performance; demonstrated faster execution compared to alternatives like LangGraph.
  • Built-in tools (search, scraping, databases) and various memory types (short-term, long-term, shared).

Disadvantages:

  • Less granular control compared to LangGraph or custom solutions.
  • The “crew” paradigm can limit flexibility for unconventional agent workflows.
  • Newer framework; fewer community examples and troubleshooting resources available.
  • Enterprise production features (monitoring, analytics, security) might involve proprietary or closed-source components.

Use Cases:

  • Clearly defined multi-agent workflows mirroring real-world teams (software development tasks, marketing campaigns, content creation).
  • Business process automation (event planning, resume tailoring, customer support triaging).
  • Beginners or developers needing rapid results, especially for prototypes, hackathons, or MVPs.
  • Enterprises seeking built-in production readiness and high execution efficiency.

A framework centered on dynamic, conversational interactions that facilitates rapid prototyping through autonomous code generation and iterative dialogues, making it suitable for agile development despite potential challenges in debugging.

Advantages:

  • Simplest conceptual model: agents collaborate primarily via natural language conversations.
  • Mature framework developed by Microsoft Research, with established community usage.
  • Flexible conversational control, supporting deterministic or autonomous agent interactions.
  • Robust tool integration and extensibility through AutoGen extensions.

Disadvantages:

  • Conversation-based orchestration can be verbose and inefficient, potentially increasing token usage.
  • Does not provide structured workflows inherently; complex interactions or hierarchies must be explicitly programmed.
  • Future support uncertain due to its research-oriented nature; risk of becoming obsolete or subsumed by other Microsoft products.
  • Heavily dependent on high-quality prompting; requires strong prompt engineering skills.

Use Cases:

  • Multi-agent tasks naturally structured as dialogues (AI debates, pair-programming, critique loops).
  • Chat-based applications involving agents interacting conversationally with users or each other.
  • Research into agent communication, emergent behaviors, and conversational dynamics.
  • Prototyping tasks where conversational structure simplifies orchestration complexity.

Framework-Free

While these frameworks are great tools, your use case might require building code without leveraging any of these solutions. Preferred by experienced devs. Custom is harder at first, but leads to better long-term outcomes.

Advantages:

  • Maximum flexibility and complete control over agent logic, workflows, integrations, and execution.
  • Minimal dependencies and lightweight implementation; highly efficient and tailored specifically to your use case.
  • Hands on the wheel with the API costs, limiting any unexpected spending behavior when using third-party frameworks.
  • Strong educational value: deepens understanding of underlying LLM behaviors and best practices.
  • More control over LLM behavior, prompting, and integration.
  • Better performance and less dependency bloat.
  • Easier to debug and productionize when you know what you’re doing.
  • Frameworks often reimplement things that are 10–20 lines of native cod

Disadvantages:

  • High upfront development effort; must manually implement prompting, agent management, and tool integration.
  • Requires substantial expertise in LLMs, prompting, and software architecture.
  • Scaling and maintenance become increasingly complex as the system grows, potentially introducing significant technical debt.
  • Risk missing out on built-in optimizations provided by specialized frameworks.
  • More upfront work (data loaders, chunkers, retrievers, etc.).
  • Requires deeper understanding of LLM APIs and retrieval techniques.
  • Fewer pre-built tools; less “plug-and-play.”

Use Cases:

  • Cutting-edge research or highly specialized workflows that existing frameworks cannot support.
  • Small, simple, well-defined task pipelines where framework overhead isn’t justified.
  • Integration into legacy systems or existing infrastructures with stringent requirements (security, performance, compliance).
  • Projects requiring full transparency and auditability.

Frameworks like LlamaIndex and LangChain are great learning tools or for quick MVPs, especially if you’re less experienced or in prototyping mode. For production, most devs eventually opt for custom implementations using direct APIs (OpenAI, Ollama, FAISS/Qdrant, etc.). Hybrid approach is common: use frameworks for narrow tasks (e.g., chunking, loading), but handle retrieval, prompting, and agents yourself. My advice is to begin with a simpler, higher-level framework (CrewAI, AutoGen, or LlamaIndex) for rapid prototyping, and escalate complexity (LangGraph, Custom) only as specific project demands or control requirements dictate.

The End.

Building effective multi-agent systems is as much an art as it is a science. The core principles — limited scope, comprehensive context, concrete goals, appropriate hierarchy, complementary model selection, and memory management — have consistently delivered superior results across diverse applications.

As these systems evolve, the most exciting frontier might be architectures that can dynamically reconfigure based on task requirements — essentially, self-organizing multi-agent systems.

I hope these insights help you build more effective multi-agent systems. I would love to hear your experiences and additional insights in the comments below!

Published via Towards AI

LEAVE A REPLY

Please enter your comment!
Please enter your name here