AI/ML

10 min read

Published on 04/01/2025
Last updated on 04/01/2025
Published on 04/01/2025
Last updated on 04/01/2025
Building distributed multi-framework, multi-agent solutions
Share
In the field of artificial intelligence (AI) development, creating distributed multi-agent software offers both opportunities and challenges. Our recent project explores the use of frameworks like LangGraph, Swarm, and AutoGen to build a cohesive network of interacting agents. Central to this system is a principal service agent developed with LangGraph, which manages user interactions and coordinates subsidiary agents across different frameworks.
Despite appearing straightforward, the unpredictable nature of Large Language Models (LLMs) and the distinct behaviors of various frameworks pose significant challenges. This project aims to develop robust, scalable workflows that maximize LLM capabilities while tackling issues like non-determinism, inter-agent communication, observability, state and error management.
Our experiments reveal four key pillars for reliable agentic workflows: State management, observability, streaming and robust error handling. To create effective multi-agent systems, let’s explore the design elements of each pillar and insights we’ve gained from this project.
Architecture

The architecture comprises:
- Service 1: The main workflow coordinator and principal agent, using two LangGraph agents.
- Service 2: Handles Python code writing with three Swarm agents.
- Service 3: Conducts code reviews using three AutoGen agents.
For tasks like "sort numbers," service 1 initiates, service 2 writes the code, and service 3 reviews it, with revisions continuing until service 1 is satisfied or constrained.

Design pillar one: State management

In multi-framework applications, effective state management is crucial for ensuring seamless execution, data consistency, and efficient resource utilization across subgraphs. The principal agent coordinates multiple subgraphs built with different frameworks, such as LangGraph, AutoGen, and OpenAI Swarm, which introduces challenges, especially in tracking performance and managing state propagation.
Pain points and solutions:
- State propagation:
- Unlike building all three systems with a single framework, when building with multiple frameworks, automatic state inheritance between the principal agent and subgraphs breaks. This break unified tracing and detailed performance monitoring, showing a need to unify the traces and map the states.
- Solution: We introduced custom state channels in each subgraph to expose performance metrics like network time, agent processing time, and LLM call time. The principal agent collects these metrics via private channels to aggregate and track overall performance.
- Task parameter passing and state partitioning:
- Ensuring relevant state information (e.g., user preferences, execution constraints) is passed only to the required subgraph is vital for preventing overload and minimizing coupling.
- Solution: Parameters are encapsulated in the state and shared with only the relevant subgraphs, maintaining task-specific context.
- State persistence and recall:
- External memory is needed to store intermediate results or state data for long-term workflows, which helps with scaling and minimizing in-memory usage.
- Solution: Offload non-critical state to persistent storage (e.g., Redis, DynamoDB), ensuring minimal runtime overhead and recovery from interruptions.
- Cross-framework compatibility:
- Different frameworks may use incompatible state formats, causing issues when agents interact across frameworks.
- Solution: State transfers include version metadata, and we use standardized formats like JSON or Protocol Buffers to bridge compatibility gaps.
- Resilient state transfer and retry logic:
- Subgraphs may fail due to transient errors (e.g., network timeouts). Without proper state tracking, retries could result in redundant tasks or errors.
- Solution: Capture retry metadata in the state (e.g., attempt count, backoff intervals) to ensure continuity and prevent repeated failures.
- Asynchronous state synchronization:
- Asynchronous workflows in subgraphs can cause synchronization issues between the principal agent and subgraphs.
- Solution: State serves as a shared repository for pending or completed tasks, ensuring smooth synchronization across agents and frameworks.
Key principles for state management:
- Minimal sharing: Only necessary state is passed to subgraphs to reduce processing overhead and ensure security.
- Standardized formats: Use interoperable formats (JSON, Protocol Buffers) to ensure cross-framework compatibility.
- External memory: Use external storage for non-critical state to minimize memory usage.
- Security and privacy: Apply encryption or redaction for sensitive state data to protect privacy.
- Consistency and versioning: Ensure backward compatibility and versioning for state transfers between frameworks.
By harmonizing framework differences with custom state channels, version-aware state propagation, and centralized memory, we ensure efficient state management and enable granular performance tracking in multi-framework systems.
Design pillar two: Observability
Problems: Multi-framework multi-agent systems pose several challenges
- Debugging distributed workflows.
- Performance tuning across frameworks.
- Resource allocation.
- Cost attribution for individual agents and frameworks.
Solution: In addition to providing observability at the framework level, when switching from single framework to multi-framework multi-agent systems (MF-MAS), we need to provide a unified view of the system integrated with tracing, logging and subgraph performance metrics.
Unified tracing
Objective: Align all agents and subgraphs under a single trace in your Application Performance Monitoring (APM) or observability tool, such as OpenTelemetry (OTEL).
High-level concept
- Root span creation (principal agent A)
- Principal agent A initializes a root span, e.g., spanA_root.
- Additional child spans may be created during its own processing (e.g., spanA1).
- Trace context propagation
- When principal agent A triggers or calls agent B (a subgraph), it propagates the trace context, which includes:
- Trace ID
- Parent span ID (the ID of spanA_root)
- Trace state or baggage (if any additional metadata is maintained)
- When principal agent A triggers or calls agent B (a subgraph), it propagates the trace context, which includes:
- Subgraph span creation (agent B)
- Subgraph B receives the parent’s trace context.
- Subgraph B creates a new child span, e.g., spanB, which references the same Trace ID and parent as spanA_root.
- As a result, all spans—spanA_root, spanA1, and spanB—are part of the same trace in your monitoring tool.

Unified logging
Objective: Ensure all logs from different agents/subgraphs can be tied together using a common transaction ID or correlation key. Unified logging is particularly critical in large systems where multiple microservices, agents, or subgraphs handle portions of a single task.
- Transaction ID sharing
- Whenever a principal agent initiates a request, it generates a transaction ID (or uses an existing one).
- This ID is included in every log statement by both the principal agent and subgraph B.
- This approach allows operators or developers to filter logs by the same ID, quickly seeing all related actions for a single user request or workflow.
- Practical example
Principal agent A logs:
[transaction_id=12345] Starting request for user 6789...
Subgraph B logs:
[transaction_id=12345] Received request from principal agent A. Fetching data...
The logs can be aggregated by the transaction_id=12345, showing a continuous timeline across multiple services or agents.

Subgraph performance metrics
Objective: Provide the principal agent (and system operators) with detailed performance data for each subgraph’s execution. In a user-facing application, partial or final performance data can be displayed, indicating how the system spent time (particularly useful in advanced troubleshooting scenarios).
- Performance breakdown
- Execution time: Total time spent in the subgraph handling a request.
- Network latency: Time spent waiting for network responses (e.g., from APIs or internal services).
- LLM processing time: If the subgraph calls a LLM, measure how long the LLM takes to respond.
- Tool invocation time: Time spent on external tool calls (e.g., databases, third-party APIs).
- Reporting back to principal agent
- Once subgraph B completes its portion of the work, it returns performance metrics to principal agent A.
- A structured format (e.g., JSON) can be used:

- The principal agent aggregates these metrics across multiple subgraphs or micro-agents.
Putting it all together
When switching from single framework multi-agent system (like LangGraph) to multi-framework multi-agent system, we end up with fragmented traces. We needed to unify them under a single global traceid, and did so using OpenTelemetry and LangSmith. The figure below shows the approach used.

- Start of request
- Principal agent A creates a root span (spanA_root) and logs a transaction ID (e.g., 12345).
- Execution steps
- Principal agent A performs some tasks and logs/traces them with child spans (spanA1).
- Principal agent A calls subgraph B, passing along:
- Trace context (for unified tracing)
- Transaction ID (for unified logging)
- Subgraph B creates a child span (spanB) referencing spanA_root.
- Subgraph B logs actions using transaction_id=12345 and gathers performance metrics.
- Completion and aggregation
- Subgraph B returns performance data to principal agent A.
- Principal agent aggregates the entire operations spans and metrics in the observability/monitoring solution.
- End users, developers, or operations personnel can see a single, continuous trace and log history that includes performance details for each step.
By consistently sharing trace context, unified logging information, and performance metrics, a multi-framework application can deliver end-to-end visibility, simplify debugging, and improve reliability. This holistic approach ensures each piece of the puzzle is linked together, revealing how requests flow and how they perform across distributed agents.
Design pillar three: Streaming
In LLM-driven systems, streaming allows partial outputs to be delivered in real time, enhancing user experience, especially for long or complex tasks. The challenge arises when events or messages from agents in different subgraphs need to be bubbled up through a central orchestrator to the end user. Due to differences in frameworks, custom engineering solutions are needed for real-time streaming.
Levels of streaming
Level 0 (no streaming): The user gets a complete response after processing. Simple but slow for complex tasks.
Level 1 (synthetic progress): Handcrafted updates like "Working on it…" give users a sense of progress but lack real-time LLM outputs.
Level 2 (subgraph updates): Intermediate agents send updates (e.g., “Processing…”) without streaming LLM outputs, useful for transparency.
Level 3 (full streaming): Real-time token generation from the LLM streamed to the user, offering the most interactive experience.
Recommendations for implementing streaming across frameworks using a multi-stage event propagation flow.

- Subgraph component: An event originates in a component of a subgraph.
- Backend streaming: The event is streamed through swarm/autogen framework-specific methods (.stream, run_and_stream, etc.) or mechanism
- Server-sent events (SSE): The backend pushes this event as an SSE to the REST API client.
- Client processing: The principal multi-agent system (MAS) receives the SSE, transforms it, and dispatches it as a custom callback event to the .astream_event ( ) method in LangGraph which then surfaces it to the end user
To enhance user experience, consider streaming based on system capabilities, from synthetic updates to real-time LLM-token streaming. Even without full streaming support, incremental updates or progress messages can provide valuable feedback. The goal is to align each layer of the system with the chosen level of streaming.
Design pillar four: Error handling

Multi-framework agentic applications, integrating LangGraph, AutoGen, and OpenAI Swarm, face challenges in error handling due to varied error representation across frameworks. Errors in tools like Coder or Reviewer must navigate these differences and cascade through interconnected systems to reach the end user without losing context and clarity. The following gives a list of key error types that are likely to arise.
Error types
LLM misinformation
When an LLM generates incorrect responses (hallucinations) in one framework, these can propagate to tools or agents in other frameworks, amplifying misinformation across the system.
Tool call issues
Incorrect tool invocations, missing calls, or malformed parameters can block workflows across frameworks. For example, if OpenAI Swarm relies on AutoGen's Reviewer Subgraph, a failed tool call can break the entire chain.
Convergence failures
Systems may fail to reach solutions due to:
- Conflicting definitions of done across frameworks
- Unresolved task delegation
- Circular dependencies in cross-framework decision-making
- Communication breakdowns
- Cross-framework communication can fail when:
- Agents misinterpret messages
- Data formats are incompatible
- Updates fail due to protocol differences
- Cross-framework communication can fail when:
Environment mismatches
Incompatible versions of libraries, APIs, or runtime environments across frameworks can cause silent failures or crashes, especially when inter-framework dependencies are critical.
- Resource management
Multiple agents calling shared services simultaneously can trigger rate limits or resource exhaustion, disrupting workflows across frameworks.
To mitigate these, a robust error-handling strategy is needed to normalize error data while maintaining the core issue's meaning. At the principal agent node, errors should be translated into user-friendly messages that explain the root cause and suggest next steps, ensuring a seamless and intuitive user experience.
Conclusion and future directions
Our work offers early insights into the operational challenges developers may face when building agentic systems over a network. A defining trait of multi-agent software is its ability to operate seamlessly across an expanding ecosystem of agent frameworks, such as LangGraph, AutoGen, Swarm, CrewAI, and others. This interoperability empowers developers to integrate and build upon agentic applications created with different frameworks—a capability that is crucial for achieving the vision of an Internet of Agents.

Get emerging insights on innovative technology straight to your inbox.
Welcome to the future of agentic AI: The Internet of Agents
Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.

* No email required
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.
