Internet-Draft BIER Extension or New Protocol for AI P2 July 2026
McBride, et al. Expires 2 January 2027 [Page]
Workgroup:
Network Working Group
Internet-Draft:
draft-mcbride-mcast4ai-bier-or-new-protocol-00
Published:
Intended Status:
Informational
Expires:
Authors:
M. McBride
Futurewei
Y. Liu
China Mobile
M. Zhangli
Huawei

BIER Extension or New Protocol for AI P2MP

Abstract

AI workloads in data centers exhibit inherently point-to-multipoint (P2MP) communication patterns, particularly during collective operations such as AllReduce, AllGather and broadcast in distributed training. Unicast replication of these flows does not scale to large GPU clusters. This document analyzes two architectural approaches to addressing this problem: extending BIER (Bit Index Explicit Replication) to support AI P2MP requirements, or defining a new purpose-built protocol. The tradeoffs of each approach are discussed, including considerations around ACK aggregation, congestion control, RoCE/RDMA compatibility and operational complexity. This document does not define a protocol but is intended instead to help the mcast4ai community evaluate this problem space.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 2 January 2027.

Table of Contents

1. Introduction

Distributed AI training workloads generate communication patterns that are structurally point-to-multipoint (P2MP). During collective operations, a single sender must deliver identical data to a large number of GPU receivers simultaneously. At the scale of modern GPU clusters, hundreds to thousands of endpoints, replicating these flows via unicast imposes significant and unnecessary burden on the network.

BIER [RFC8279] provides a stateless multicast forwarding architecture that has seen deployment in WAN and some DC environments. A natural question is whether BIER can be extended to satisfy the requirements of AI P2MP traffic, or whether the unique characteristics of AI workloads, including tight latency requirements, RDMA transport semantics, ACK aggregation and congestion control, necessitate a new protocol. This document explores both approaches.

While IETF practice has historically favored extending existing protocols where feasible, there are cases where the requirements diverge sufficiently from the design assumptions of an existing protocol that a new protocol is more appropriate. This document aims to provide an objective basis for that determination.

2. AI P2MP Requirements

The following requirements are specific to AI data center P2MP and distinguish this problem space from traditional multicast use cases.

2.1. Collective Communication Patterns

AI training collectives (AllReduce, AllGather, ReduceScatter, Broadcast) require simultaneous delivery of data to all members of a communication group. For dense model training, group membership is established at job initialization and remains stable throughout the training run. In Mixture of Experts (MoE) architectures, All-to-All collectives implement token-dependent expert routing where each token is dispatched to a small subset of experts (typically top-K of N total), meaning the active receiver set changes with every token or batch. The full set of expert endpoints is fixed for the duration of the job, but the per-packet routing destination is highly dynamic. Both cases include P2MP communication patterns. In MoE, while the All-to-All expert dispatch collective is many-to-many rather than P2MP, gradient synchronization across expert replicas uses AllReduce over fixed groups, benefiting from P2MP multicast. MoE's expert dispatch introduces the additional requirement that the forwarding mechanism support dynamic, per-token receiver subsets.

2.2. Latency and Jitter Sensitivity

GPU training steps are synchronization points. Straggler effects, where a single slow receiver delays the entire collective, make tail latency a critical metric. The multicast mechanism should minimize added latency and jitter relative to optimal unicast paths.

2.3. RDMA and RoCE Compatibility

AI workloads primarily rely on RDMA (Remote Direct Memory Access) today for high-throughput, low-latency data transfer. In Ethernet-based AI data centers, RoCEv2 (RDMA over Converged Ethernet) is the dominant transport. RoCEv2 uses the InfiniBand Base Transport Header (BTH) encapsulated in UDP. Any multicast solution should be compatible with this transport model, or should define a clear interworking point.

2.4. ACK Aggregation

In unicast RDMA, each receiver generates an ACK per received packet or message. In P2MP multicast, N receivers generating individual ACKs back toward the sender creates an ACK implosion problem. This imposes load on both the network (N reverse unicast flows) and the sender's RDMA NIC (processing N individual ACKs). An ACK aggregation mechanism is required that collapses N ACKs into a single ACK before reaching the sender's RDMA NIC, while preserving RoCE ACK semantics including BTH opcode and PSN fields.

2.5. Congestion Control

RoCEv2 environments typically rely on Priority Flow Control (PFC) and DCQCN for congestion management. A multicast solution should either integrate with these mechanisms or define equivalent behavior. Packet spraying, increasingly adopted for load balancing (e.g., Nvidia AR, Broadcom DLB/GLB), complicates per-flow congestion signals.

2.6. Scalability

GPU clusters in large AI data centers may contain thousands of endpoints organized into multiple communication groups running simultaneously. The multicast solution should scale in terms of group state, replication overhead and control plane complexity.

3. Extending BIER

3.1. BIER Overview

BIER [RFC8279] encodes the set of intended receivers as a bitstring in the packet header. Each router in the BIER domain forwards and replicates packets based on the bitstring without maintaining per-flow or per-group state. This stateless property is a key operational advantage.

[I-D.zzhang-bier-optimized] describes optimizations to BIER targeting AI data center environments, including mechanisms for ACK aggregation within the BIER forwarding plane.

3.2. Arguments for Extending BIER

3.3. Limitations of Extending BIER

4. Defining a New Protocol

4.1. Arguments for a New Protocol

4.2. Limitations of a New Protocol

5. Comparative Analysis

The following table summarizes the tradeoffs across key dimensions.

Table 1
Dimension Extend BIER New Protocol
ACK aggregation Requires significant extension; conflicts with stateless model Can be designed as first-class feature
RoCE/BTH compatibility Payload-agnostic; interworking not specified Can be co-designed with RoCEv2
Congestion control Not in scope for BIER; requires additional work Can integrate DCQCN natively
Scalability Bitstring size limits; hierarchical domains add complexity Designed for large GPU cluster scale
Deployment speed Faster; builds on existing implementations Slower; requires new implementations
Operational familiarity Higher where BIER is deployed Lower; new operational model
Standards risk Lower; incremental extension Higher; new charter/scope required
Architectural cleanliness Lower; AI requirements are a poor fit for BIER assumptions Higher; purpose-built for use case

The authors observe that while extending BIER offers a faster path to a first implementation, the fundamental mismatch between BIER's stateless forwarding model and the stateful ACK aggregation requirement is a significant concern. ACK aggregation is not an optional feature; it is necessary for correctness at scale. Any extension to BIER that adds per-group ACK state effectively creates a new protocol that happens to reuse the BIER forwarding header.

The authors therefore lean toward defining a new protocol that is purpose-built for AI P2MP, while acknowledging that this position should be validated through working group discussion, particularly from operators who have deployed or are evaluating BIER in AI DC environments.

6. Open Issues

The following open issues represent challenges that any solution must address, regardless of whether BIER is extended or a new protocol is defined. They are not arguments for or against either approach, but rather constraints that any design must satisfy.

6.1. ACK Aggregation Point

The aggregated ACK delivered to the sender's RDMA NIC must conform to RoCEv2 ACK semantics, including a valid BTH with opcode 0x11 (ACKNOWLEDGE) and a PSN representing the collective acknowledgment state of all receivers (typically the minimum PSN across all receivers to ensure no data is incorrectly freed).

Candidate aggregation points include, but are not limited to:

Each option has different tradeoffs in terms of where intelligence is placed, whether host CPU is involved, and what hardware capabilities are required. Further analysis is needed.

6.2. BTH Classification in the Network

Data center switches, without additional hardware capabilities, treat RoCE ACKs as normal UDP data payload and cannot distinguish them from RoCE data packets, since the BTH opcode is carried inside the UDP payload and is not visible to standard switch forwarding logic. Classification of ACK packets in the network requires either P4-capable ASICs with custom parsers or vendor-specific deep packet inspection features. This is a practical constraint on any in-network ACK aggregation mechanism.

6.3. Interaction with Packet Spraying

Packet spraying is increasingly adopted in AI DC networks for load balancing (e.g., Nvidia Adaptive Routing, Broadcom DLB/GLB, China Mobile GSE). Under packet spraying, ACK paths cannot be assumed to be symmetric with data paths, which complicates in-network ACK aggregation at arbitrary points. The source ToR aggregation point is robust to packet spraying since all ACKs must converge there regardless of the spray path taken.

7. Security Considerations

This document does not define a protocol and introduces no new security considerations beyond those already discussed in [RFC8279] and related RDMA/RoCEv2 specifications. Security considerations for any protocol defined based on this analysis should address spoofing of aggregated ACKs, which could cause a sender to incorrectly advance its send window.

8. IANA Considerations

This document has no IANA actions.

9. References

9.1. Informative References

[I-D.zzhang-bier-optimized]
Zhang, Z., "Optimized BIER for AI Data Center Environments", Work in Progress, Internet-Draft, draft-zzhang-bier-optimized, , <https://datatracker.ietf.org/doc/html/draft-zzhang-bier-optimized>.
[RFC8279]
Wijnands, IJ., Rosen, E., Dolganow, A., Przygienda, T., and S. Aldrin, "Multicast Using Bit Index Explicit Replication (BIER)", RFC 8279, , <https://www.rfc-editor.org/rfc/rfc8279>.

Acknowledgements

The authors thank Kefei Liu for valuable discussion on ACK identification and aggregation point selection on the mcast4ai mailing list.

Authors' Addresses

Michael McBride
Futurewei
Yisong Liu
China Mobile
Monica Zhangli
Huawei