| Internet-Draft | BIER Extension or New Protocol for AI P2 | July 2026 |
| McBride, et al. | Expires 2 January 2027 | [Page] |
AI workloads in data centers exhibit inherently point-to-multipoint (P2MP) communication patterns, particularly during collective operations such as AllReduce, AllGather and broadcast in distributed training. Unicast replication of these flows does not scale to large GPU clusters. This document analyzes two architectural approaches to addressing this problem: extending BIER (Bit Index Explicit Replication) to support AI P2MP requirements, or defining a new purpose-built protocol. The tradeoffs of each approach are discussed, including considerations around ACK aggregation, congestion control, RoCE/RDMA compatibility and operational complexity. This document does not define a protocol but is intended instead to help the mcast4ai community evaluate this problem space.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 2 January 2027.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Distributed AI training workloads generate communication patterns that are structurally point-to-multipoint (P2MP). During collective operations, a single sender must deliver identical data to a large number of GPU receivers simultaneously. At the scale of modern GPU clusters, hundreds to thousands of endpoints, replicating these flows via unicast imposes significant and unnecessary burden on the network.¶
BIER [RFC8279] provides a stateless multicast forwarding architecture that has seen deployment in WAN and some DC environments. A natural question is whether BIER can be extended to satisfy the requirements of AI P2MP traffic, or whether the unique characteristics of AI workloads, including tight latency requirements, RDMA transport semantics, ACK aggregation and congestion control, necessitate a new protocol. This document explores both approaches.¶
While IETF practice has historically favored extending existing protocols where feasible, there are cases where the requirements diverge sufficiently from the design assumptions of an existing protocol that a new protocol is more appropriate. This document aims to provide an objective basis for that determination.¶
The following requirements are specific to AI data center P2MP and distinguish this problem space from traditional multicast use cases.¶
AI training collectives (AllReduce, AllGather, ReduceScatter, Broadcast) require simultaneous delivery of data to all members of a communication group. For dense model training, group membership is established at job initialization and remains stable throughout the training run. In Mixture of Experts (MoE) architectures, All-to-All collectives implement token-dependent expert routing where each token is dispatched to a small subset of experts (typically top-K of N total), meaning the active receiver set changes with every token or batch. The full set of expert endpoints is fixed for the duration of the job, but the per-packet routing destination is highly dynamic. Both cases include P2MP communication patterns. In MoE, while the All-to-All expert dispatch collective is many-to-many rather than P2MP, gradient synchronization across expert replicas uses AllReduce over fixed groups, benefiting from P2MP multicast. MoE's expert dispatch introduces the additional requirement that the forwarding mechanism support dynamic, per-token receiver subsets.¶
GPU training steps are synchronization points. Straggler effects, where a single slow receiver delays the entire collective, make tail latency a critical metric. The multicast mechanism should minimize added latency and jitter relative to optimal unicast paths.¶
AI workloads primarily rely on RDMA (Remote Direct Memory Access) today for high-throughput, low-latency data transfer. In Ethernet-based AI data centers, RoCEv2 (RDMA over Converged Ethernet) is the dominant transport. RoCEv2 uses the InfiniBand Base Transport Header (BTH) encapsulated in UDP. Any multicast solution should be compatible with this transport model, or should define a clear interworking point.¶
In unicast RDMA, each receiver generates an ACK per received packet or message. In P2MP multicast, N receivers generating individual ACKs back toward the sender creates an ACK implosion problem. This imposes load on both the network (N reverse unicast flows) and the sender's RDMA NIC (processing N individual ACKs). An ACK aggregation mechanism is required that collapses N ACKs into a single ACK before reaching the sender's RDMA NIC, while preserving RoCE ACK semantics including BTH opcode and PSN fields.¶
RoCEv2 environments typically rely on Priority Flow Control (PFC) and DCQCN for congestion management. A multicast solution should either integrate with these mechanisms or define equivalent behavior. Packet spraying, increasingly adopted for load balancing (e.g., Nvidia AR, Broadcom DLB/GLB), complicates per-flow congestion signals.¶
GPU clusters in large AI data centers may contain thousands of endpoints organized into multiple communication groups running simultaneously. The multicast solution should scale in terms of group state, replication overhead and control plane complexity.¶
BIER [RFC8279] encodes the set of intended receivers as a bitstring in the packet header. Each router in the BIER domain forwards and replicates packets based on the bitstring without maintaining per-flow or per-group state. This stateless property is a key operational advantage.¶
[I-D.zzhang-bier-optimized] describes optimizations to BIER targeting AI data center environments, including mechanisms for ACK aggregation within the BIER forwarding plane.¶
The following table summarizes the tradeoffs across key dimensions.¶
| Dimension | Extend BIER | New Protocol |
|---|---|---|
| ACK aggregation | Requires significant extension; conflicts with stateless model | Can be designed as first-class feature |
| RoCE/BTH compatibility | Payload-agnostic; interworking not specified | Can be co-designed with RoCEv2 |
| Congestion control | Not in scope for BIER; requires additional work | Can integrate DCQCN natively |
| Scalability | Bitstring size limits; hierarchical domains add complexity | Designed for large GPU cluster scale |
| Deployment speed | Faster; builds on existing implementations | Slower; requires new implementations |
| Operational familiarity | Higher where BIER is deployed | Lower; new operational model |
| Standards risk | Lower; incremental extension | Higher; new charter/scope required |
| Architectural cleanliness | Lower; AI requirements are a poor fit for BIER assumptions | Higher; purpose-built for use case |
The authors observe that while extending BIER offers a faster path to a first implementation, the fundamental mismatch between BIER's stateless forwarding model and the stateful ACK aggregation requirement is a significant concern. ACK aggregation is not an optional feature; it is necessary for correctness at scale. Any extension to BIER that adds per-group ACK state effectively creates a new protocol that happens to reuse the BIER forwarding header.¶
The authors therefore lean toward defining a new protocol that is purpose-built for AI P2MP, while acknowledging that this position should be validated through working group discussion, particularly from operators who have deployed or are evaluating BIER in AI DC environments.¶
The following open issues represent challenges that any solution must address, regardless of whether BIER is extended or a new protocol is defined. They are not arguments for or against either approach, but rather constraints that any design must satisfy.¶
The aggregated ACK delivered to the sender's RDMA NIC must conform to RoCEv2 ACK semantics, including a valid BTH with opcode 0x11 (ACKNOWLEDGE) and a PSN representing the collective acknowledgment state of all receivers (typically the minimum PSN across all receivers to ensure no data is incorrectly freed).¶
Candidate aggregation points include, but are not limited to:¶
Each option has different tradeoffs in terms of where intelligence is placed, whether host CPU is involved, and what hardware capabilities are required. Further analysis is needed.¶
Data center switches, without additional hardware capabilities, treat RoCE ACKs as normal UDP data payload and cannot distinguish them from RoCE data packets, since the BTH opcode is carried inside the UDP payload and is not visible to standard switch forwarding logic. Classification of ACK packets in the network requires either P4-capable ASICs with custom parsers or vendor-specific deep packet inspection features. This is a practical constraint on any in-network ACK aggregation mechanism.¶
Packet spraying is increasingly adopted in AI DC networks for load balancing (e.g., Nvidia Adaptive Routing, Broadcom DLB/GLB, China Mobile GSE). Under packet spraying, ACK paths cannot be assumed to be symmetric with data paths, which complicates in-network ACK aggregation at arbitrary points. The source ToR aggregation point is robust to packet spraying since all ACKs must converge there regardless of the spray path taken.¶
This document does not define a protocol and introduces no new security considerations beyond those already discussed in [RFC8279] and related RDMA/RoCEv2 specifications. Security considerations for any protocol defined based on this analysis should address spoofing of aggregated ACKs, which could cause a sender to incorrectly advance its send window.¶
This document has no IANA actions.¶
The authors thank Kefei Liu for valuable discussion on ACK identification and aggregation point selection on the mcast4ai mailing list.¶