The AGNTCY has been working to build the standards, protocols, and tools for the Internet of Agents, an open, interoperable internet for agent-to-agent collaboration across the entire multi-agent software lifecycle: Discover, Compose, Deploy, and Evaluate. We are pleased to announced that we have now released the foundational components of the Evaluate phase, the observability data schema and SDK. By making these available as open source, the AGNTCY is now beginning to realize the vision of end-to-end functionality for the Internet of Agents.

These components represent an important first step to providing artificial intelligence (AI) application developers with the deep end-to-end visibility required to evaluate the overall system's quality, validate its accuracy, and understand its decision-making processes. This blog discusses the AGNTCY’s new observability and evaluation framework. You’ll gain insight into the proposed architecture and framework metrics that we believe are critical to assessing the entire agentic workflow.

What is multi-agent software observability and evaluation?

Multi-agent software (MAS) refers to systems where autonomous AI agents work together to perform tasks, make decisions, and reach goals. These agents behave independently but collaborate for optimal outcomes. Some key features of MAS include:

Autonomous task handling: Agents operate independently, making real-time decisions without constant oversight.
High-level reasoning: Multi-agent software goes beyond basic problem-solving by analyzing data and making context-aware decisions.
Multi-step planning: Agents work through multiple stages to achieve long-term objectives efficiently.

It's the combination of autonomy and collaboration that makes these systems uniquely suited to solving complex, multifaceted challenges while also demanding that we put in place the ability to evaluate and observe the software’s performance.

We believe organizations must have access to deep insights regarding how multi-agent software operates, interacts, and makes decisions. Evaluation and observability help developers and IT teams understand system performance and identify what happens under the hood of these dynamic systems. Key elements include:

Transparency in decision-making: Revealing how agents reach a decision, ensuring their actions align with business logic and user expectations.
Inter-agent interactions: These tools monitor how agents collaborate, highlighting areas of efficiency or conflict.
Self-correction mechanisms: Tracking how MAS identifies and rectifies errors autonomously.

This level of insight ensures that multi-agent software-driven initiatives remain reliable, efficient, and adaptable.

Why observability and evaluation matters in multi-agent software

Promoting transparency and building trust

One of the greatest challenges in AI adoption is distrust in the decision-making processes. MAS observability and evaluation demystifies AI decisions by providing transparent and explainable data. Knowing why software chose one action over another builds trust with both internal and external stakeholders.

For instance, in industries like healthcare and finance, AI decisions must be explainable to ensure compliance with regulations and user confidence. Observability provides a crucial foundation for this accountability.

Proactively identifying issues

MAS observability and evaluation shines by uncovering performance challenges before they can escalate into bigger problems. By tracking KPIs like response times and agent collaboration success rates, observability and evaluation helps teams optimize performance.

Debugging and real-time monitoring

Real-time monitoring of agent interactions ensures that MAS systems align with expected outcomes. Observability and evaluation equip developers with actionable insights, such as highlighting bottlenecks within an agent’s task sequence or identifying performance inefficiencies in data processing.

This proactive debugging capability minimizes downtime and fosters continuous improvement.

Ensuring compliance with standards

Regulatory requirements and ethical considerations are critical in industries using MAS. Observability and evaluation serve as linchpins for compliance by logging systemic decisions, tracking data lineage, and ensuring alignment with industry standards. From GDPR requirements in Europe to emerging AI regulations globally, an observable MAS enables organizations to maintain compliance seamlessly.

Enhancing stakeholder confidence

Stakeholders, whether clients or investors, need to see a clear ROI and understand an AI system’s reliability. With observability and evaluation tools, AI teams can confidently showcase a MAS's value using evidence-backed metrics. This fosters stakeholder confidence and encourages further adoption of AI-based solutions.

The architecture required for multi-agent software observability and evaluation

The AGNTCY is working to introduce the multiple levels of visibility that are required – pipeline / workflow monitoring, model or agent behavior, user facing outcomes.

Multi-agent software architecture for observability and evaluation

In addition to the observability schema definition, our efforts are specifically directed toward the following areas:

A framework-agnostic observability SDK (1) developed to instrument agents to capture traces, metrics and events in accordance with the observability data schema introduced. The SDK exports the data to an Otel collector endpoint that in turn exports the data into a database for storage and further processing.
A translator template (2) to map third-party schemas into AGNTCY’s schema. The proposed translation mechanism will allow agents using various frameworks, SDKS and schemas to send a consistent set of observability data to an Otel collector.
An end-to-end multi-agent system aggregation and recomposition module (3) for OpenTelemetry reconstructing end-to-end sessions incorporating telemetry coming from agents and from other Internet of Agent components
Integration with OASF through OASF schema extensions. The integration will allow developers to easily publish and share metrics, traces, events, and logs as well as a description of observability instrumentation and of a performance rating within the Agent Directory. To track agent-to-agent communication and the impact of network connectivity, Internet of Agent components will be instrumented to capture data that will allow for the recomposition of an end-to-end session.
Observability data schema based on Open Telemetry and following the large language model (LLM) semantic convention.
An observability API layer (4) to retrieve the collected telemetry data.
Examples of derived observability and evaluation metrics along with a metrics computation engine (5).
If partners or MAS developers have integrated our SDK into their code, then our SDK will automatically emit the metrics in the AGNTCY’s schema, reconstruct the telemetry and use the metric computation engine to emit the metrics.

Critical metrics required for multi-agent software observability and evaluation

At the heart of MAS observability and evaluation lies a robust set of metrics. These metrics offer insights into system behaviors, outcomes, and bottlenecks. Here's what you need to measure when evaluating multi-agent software:

Tool utilization accuracy: Measures the application’s ability to select and use appropriate tools efficiently.
Tool error rate: Measures errors or failures during tool execution.
Workflow efficiency: Evaluates how quickly and cost-effectively an entire agent-based workflow completes its tasks. Key factors may include time to completion (how long a workflow takes from first agent invocation to final output), step minimization (number of calls or steps used versus an optimal baseline), idle vs active time (ratio that indicates the possible bottlenecks or waiting times).
Hallucination rate: This response quality metric indicates the frequency with which the application generates incorrect or fabricated information.
Agent collaboration & delegation: Analyzes multi-agent workflows and orchestration. Key factors in this measurement include the most frequent agent-to-agent interactions, agent collaboration success rate, and task delegation accuracy.
Completeness: This cognitive and understanding metric assesses how well the user interaction was executed without interruption while providing a comprehensive and satisfactory resolution to the issue.
Cycles per trajectory: Counts the number of cycles exercised by the agent as part of single end-to-end trajectory.
Error recovery rate: This reliability metric calculates the percent of failed trajectories that self-heal within N retries or fall back paths.

Refining multi-agent software using key observability metrics empowers your software to achieve its full potential. From improving workflow efficiency to reducing error rates and enhancing user satisfaction, these metrics build confidence in your system’s performance and reliability.

Join the discussion about multi-agent software observability and evaluation

At the core, observability and evaluation transform AI from a black-box enigma into a transparent and accountable innovation. For AI developers, IT leaders, and DevOps teams, the importance of implementing observability and evaluation can’t be overstated. AGNTCY's new observability data schema and SDK are designed to get you started.

We invite you to join the AGNTCY’s working group to help accelerate the framework’s standardization. You can follow it here. If you are interested in contributing to the work, we’d love to hear from you. Contact us to get involved!