Internet-Draft | Agent Task Management | July 2025 |
Xie & Li | Expires 8 January 2026 | [Page] |
This document specifies the Multimodal requirements for Agent-to-Agent Protocol, which enables autonomous agents to establish multi-channel communication sessions, negotiate heterogeneous data capabilities (e.g., text, file, real-time audio/video streams, sensor streams), and exchange synchronized multimodal content with adaptive QoS policies.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 8 January 2026.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
This document articulates the technical imperative for multimodal interaction to be natively supported in the protocols defining and standardizing interoperability among Artificial Intelligence Agents (AI Agents). The core rationale stems from the evolution of modern Large Language Models (LLMs) into Multimodal Models capable of processing, generating, and understanding multiple media types. Consequently, establishing multimodal channels to transmit multimodal media must be supported.¶
A typical use case in agent-agent communication using multimodal media:¶
A Housekeeping Robot Agent sends a task to a Monitoring Robot Agent to detect the incidents. The two Agents negotiate the necessary multimodal media channels and establish the session. Once the Monitoring Robot Agent detects a glass-breaking incident in the kitchen, it transmits audio and video stream to the Housekeeping Robot Agent, simultaneously it generates and sends a text alert ("CRITICAL: Glass break at 2025-10-05T14:30:15Z") to the Housekeeping Robot Agent through the established multimodal session.¶
With the rapid development of LLMs (large language models) technologies, LLMs have gradually developed from supporting single modal such as text to supporting multiple modals such as text, pictures, and video clips and their combinations. In particular, LLMs supporting real-time audio and video streams have emerged recently. The capabilities of LLMs are increasingly rich and perfect.¶
Agent often needs to understand multiple modal data, such as environment data, context data, audio, video, etc., to better understand and execute tasks. Therefore, various multimodal data needs to be transmitted between Agents.¶
Multimodal interaction capabilities supported by mainstream Agent communication protocols in the industry are shown as follows:¶
Therefore, the general Agent communication protocol should support rich multimodal interaction including text, file, real-time audio/video stream, etc.¶
When agents need to transmit multimodal media, especially real-time audio and video streams, a dedicated media channel is required to transmit these streams. Meanwhile, the audio and video codecs supported by different agents may be vary. Therefore, the audio and video codecs MUST be negotiated between agents before transmission.¶
There are many types of multimodal data. The transmission requirements of different multimodal data are varied, which requires transmission using different stream.¶
To reduce IP port resource overhead caused by flow connections, a multi-stream multiplexing capability needs to be supported, and different transmission priorities can be set for different flows.¶
In some use cases, multimodal data transmitted between agents needs to be synchronized. As shown in the example use case, when the housekeeping robot Agent arranges the monitoring robot Agent to perform environment monitoring, the audio stream, video stream, and text events collected by the monitoring robot Agent need to be synchronized to help the housekeeping robot better understand the monitoring result.¶
In other use cases, multimodal data may need to coordinate transmission policies. For example, when audio and video streams are transmitted at the same time, when video frame freezing occurs due to a decrease in connection bandwidth, the video resolution and bitrate need to be automatically reduced to ensure audio stream transmission quality.¶
Multimodal interaction constitutes a critical function for multi-agent collaboration. This document discusses the necessity of introducing multimodal interaction to address Agent collaboration. Consequently, it analyzes the requirements imposed by multimodal interaction on AI Agent protocol design, specifically concerning multimodal media channel establishment and multimodal media transmission.¶
This memo includes no request to IANA.¶
This document should not affect the security of the Internet.¶