BIER Extension or New Protocol for AI P2MP

BIER Extension or New Protocol for AI P2MP Futurewei

mmcbride7@gmail.com

China Mobile

liuyisong@chinamobile.com

Huawei

monica.zhangli@huawei.com

BIER multicast AI P2MP data center RoCE RDMA AI workloads in data centers exhibit inherently point-to-multipoint (P2MP) communication patterns, particularly during collective operations such as AllReduce, AllGather and broadcast in distributed training. Unicast replication of these flows does not scale to large GPU clusters. This document analyzes two architectural approaches to addressing this problem: extending BIER (Bit Index Explicit Replication) to support AI P2MP requirements, or defining a new purpose-built protocol. The tradeoffs of each approach are discussed, including considerations around ACK aggregation, congestion control, RoCE/RDMA compatibility and operational complexity. This document does not define a protocol but is intended instead to help the mcast4ai community evaluate this problem space.

Introduction Distributed AI training workloads generate communication patterns that are structurally point-to-multipoint (P2MP). During collective operations, a single sender must deliver identical data to a large number of GPU receivers simultaneously. At the scale of modern GPU clusters, hundreds to thousands of endpoints, replicating these flows via unicast imposes significant and unnecessary burden on the network. BIER provides a stateless multicast forwarding architecture that has seen deployment in WAN and some DC environments. A natural question is whether BIER can be extended to satisfy the requirements of AI P2MP traffic, or whether the unique characteristics of AI workloads, including tight latency requirements, RDMA transport semantics, ACK aggregation and congestion control, necessitate a new protocol. This document explores both approaches. While IETF practice has historically favored extending existing protocols where feasible, there are cases where the requirements diverge sufficiently from the design assumptions of an existing protocol that a new protocol is more appropriate. This document aims to provide an objective basis for that determination.

AI P2MP Requirements The following requirements are specific to AI data center P2MP and distinguish this problem space from traditional multicast use cases.

Collective Communication Patterns AI training collectives (AllReduce, AllGather, ReduceScatter, Broadcast) require simultaneous delivery of data to all members of a communication group. For dense model training, group membership is established at job initialization and remains stable throughout the training run. In Mixture of Experts (MoE) architectures, All-to-All collectives implement token-dependent expert routing where each token is dispatched to a small subset of experts (typically top-K of N total), meaning the active receiver set changes with every token or batch. The full set of expert endpoints is fixed for the duration of the job, but the per-packet routing destination is highly dynamic. Both cases include P2MP communication patterns. In MoE, while the All-to-All expert dispatch collective is many-to-many rather than P2MP, gradient synchronization across expert replicas uses AllReduce over fixed groups, benefiting from P2MP multicast. MoE's expert dispatch introduces the additional requirement that the forwarding mechanism support dynamic, per-token receiver subsets.

Latency and Jitter Sensitivity GPU training steps are synchronization points. Straggler effects, where a single slow receiver delays the entire collective, make tail latency a critical metric. The multicast mechanism should minimize added latency and jitter relative to optimal unicast paths.

RDMA and RoCE Compatibility AI workloads primarily rely on RDMA (Remote Direct Memory Access) today for high-throughput, low-latency data transfer. In Ethernet-based AI data centers, RoCEv2 (RDMA over Converged Ethernet) is the dominant transport. RoCEv2 uses the InfiniBand Base Transport Header (BTH) encapsulated in UDP. Any multicast solution should be compatible with this transport model, or should define a clear interworking point.

ACK Aggregation In unicast RDMA, each receiver generates an ACK per received packet or message. In P2MP multicast, N receivers generating individual ACKs back toward the sender creates an ACK implosion problem. This imposes load on both the network (N reverse unicast flows) and the sender's RDMA NIC (processing N individual ACKs). An ACK aggregation mechanism is required that collapses N ACKs into a single ACK before reaching the sender's RDMA NIC, while preserving RoCE ACK semantics including BTH opcode and PSN fields.

Congestion Control RoCEv2 environments typically rely on Priority Flow Control (PFC) and DCQCN for congestion management. A multicast solution should either integrate with these mechanisms or define equivalent behavior. Packet spraying, increasingly adopted for load balancing (e.g., Nvidia AR, Broadcom DLB/GLB), complicates per-flow congestion signals.

Scalability GPU clusters in large AI data centers may contain thousands of endpoints organized into multiple communication groups running simultaneously. The multicast solution should scale in terms of group state, replication overhead and control plane complexity.

Extending BIER

BIER Overview BIER encodes the set of intended receivers as a bitstring in the packet header. Each router in the BIER domain forwards and replicates packets based on the bitstring without maintaining per-flow or per-group state. This stateless property is a key operational advantage. describes optimizations to BIER targeting AI data center environments, including mechanisms for ACK aggregation within the BIER forwarding plane.

Arguments for Extending BIER

BIER is an existing IETF standard with implementations in multiple vendor platforms, reducing time to deployment.
BIER's stateless forwarding model aligns well with AI P2MP groups whose membership is known at the source, as is the case in dense model training, avoiding the need for network-wide group state. MoE workloads introduce additional complexity, as the subset of active expert endpoints varies per training step, even though the full set of experts is fixed for the duration of the job.
IETF practice favors reuse of existing protocols where requirements can be met, reducing fragmentation of the standards landscape.
Extending BIER leverages existing operational familiarity and tooling in networks where BIER is already deployed.
The BIER header can carry arbitrary payloads, including RoCEv2/BTH, without modification to the forwarding plane.

Limitations of Extending BIER

BIER was designed for IP multicast replacement and does not natively address ACK aggregation, which is a fundamental requirement for AI P2MP. Adding ACK aggregation to BIER requires significant new machinery that may be architecturally inconsistent with BIER's stateless forwarding model.
BIER bitstring size limits the number of BFERs (endpoints) in a BIER sub-domain. Large GPU clusters may require hierarchical BIER domains, adding complexity.
BIER does not define congestion control. Layering RoCEv2 congestion semantics over BIER requires additional specification work.
BIER is a forwarding plane mechanism. The control plane (BGP-BIER, OSPF-BIER) was not designed with the dynamic, job-scoped group lifecycle typical of AI training workloads, where groups may be set up and torn down thousands of times per day in a busy cluster.
ACK aggregation in a BIER context requires nodes to maintain per-group ACK state, which contradicts BIER's stateless design principle.

Defining a New Protocol

Arguments for a New Protocol

A purpose-built protocol can be designed from the ground up to satisfy AI P2MP requirements including ACK aggregation, RoCE/BTH compatibility, and congestion control integration, without being constrained by the design assumptions of BIER.
AI data center networks are a distinct environment from the WAN and campus networks where BIER was originally targeted. The scale, traffic patterns, and transport semantics differ significantly enough to justify a purpose-built solution.
A new protocol can define a clean ACK aggregation mechanism that is first-class rather than bolted on, with well-defined semantics for BTH PSN handling and aggregation point placement.
A new protocol can be co-designed with the RoCEv2 transport model, defining explicit interworking points (e.g., source ToR, smart NIC/DPU) for ACK aggregation without ambiguity.
Avoiding overloading BIER with AI-specific extensions preserves the clarity and deployability of BIER in its existing use cases.

Limitations of a New Protocol

A new protocol requires new implementations across the entire stack, switch ASICs, NICs, control plane software, increasing time to deployment and adoption risk.
A new protocol requires a new IETF working group charter or significant expansion of an existing charter, adding process overhead.
Operators may be reluctant to deploy a new protocol in production networks without a substantial track record.
There is a risk of fragmentation if multiple vendors define proprietary solutions in the absence of a timely standard.

Comparative Analysis The following table summarizes the tradeoffs across key dimensions.

Dimension	Extend BIER	New Protocol
ACK aggregation	Requires significant extension; conflicts with stateless model	Can be designed as first-class feature
RoCE/BTH compatibility	Payload-agnostic; interworking not specified	Can be co-designed with RoCEv2
Congestion control	Not in scope for BIER; requires additional work	Can integrate DCQCN natively
Scalability	Bitstring size limits; hierarchical domains add complexity	Designed for large GPU cluster scale
Deployment speed	Faster; builds on existing implementations	Slower; requires new implementations
Operational familiarity	Higher where BIER is deployed	Lower; new operational model
Standards risk	Lower; incremental extension	Higher; new charter/scope required
Architectural cleanliness	Lower; AI requirements are a poor fit for BIER assumptions	Higher; purpose-built for use case

The authors observe that while extending BIER offers a faster path to a first implementation, the fundamental mismatch between BIER's stateless forwarding model and the stateful ACK aggregation requirement is a significant concern. ACK aggregation is not an optional feature; it is necessary for correctness at scale. Any extension to BIER that adds per-group ACK state effectively creates a new protocol that happens to reuse the BIER forwarding header. The authors therefore lean toward defining a new protocol that is purpose-built for AI P2MP, while acknowledging that this position should be validated through working group discussion, particularly from operators who have deployed or are evaluating BIER in AI DC environments.

Open Issues The following open issues represent challenges that any solution must address, regardless of whether BIER is extended or a new protocol is defined. They are not arguments for or against either approach, but rather constraints that any design must satisfy.

ACK Aggregation Point The aggregated ACK delivered to the sender's RDMA NIC must conform to RoCEv2 ACK semantics, including a valid BTH with opcode 0x11 (ACKNOWLEDGE) and a PSN representing the collective acknowledgment state of all receivers (typically the minimum PSN across all receivers to ensure no data is incorrectly freed). Candidate aggregation points include, but are not limited to:

The source ToR switch, which is a natural convergence point for returning ACKs under packet spraying, as all receiver ACKs are destined for the source host and must traverse the source ToR. This requires the ToR to implement BTH-aware ACK aggregation, which is not supported by standard merchant silicon today.
A software agent on the aggregation source host, which intercepts returning ACKs in software before they reach the RDMA stack. This reintroduces CPU involvement and may compromise the kernel-bypass properties of RDMA.
A smart NIC or DPU (Data Processing Unit) on the aggregation source host, which performs BTH-aware ACK aggregation in NIC hardware without involving the host CPU. This preserves kernel-bypass semantics and is likely the most architecturally clean option, but requires DPU programmability.

Each option has different tradeoffs in terms of where intelligence is placed, whether host CPU is involved, and what hardware capabilities are required. Further analysis is needed.

BTH Classification in the Network Data center switches, without additional hardware capabilities, treat RoCE ACKs as normal UDP data payload and cannot distinguish them from RoCE data packets, since the BTH opcode is carried inside the UDP payload and is not visible to standard switch forwarding logic. Classification of ACK packets in the network requires either P4-capable ASICs with custom parsers or vendor-specific deep packet inspection features. This is a practical constraint on any in-network ACK aggregation mechanism.

Interaction with Packet Spraying Packet spraying is increasingly adopted in AI DC networks for load balancing (e.g., Nvidia Adaptive Routing, Broadcom DLB/GLB, China Mobile GSE). Under packet spraying, ACK paths cannot be assumed to be symmetric with data paths, which complicates in-network ACK aggregation at arbitrary points. The source ToR aggregation point is robust to packet spraying since all ACKs must converge there regardless of the spray path taken.

Security Considerations This document does not define a protocol and introduces no new security considerations beyond those already discussed in and related RDMA/RoCEv2 specifications. Security considerations for any protocol defined based on this analysis should address spoofing of aggregated ACKs, which could cause a sender to incorrectly advance its send window.

IANA Considerations This document has no IANA actions.

References Informative References Multicast Using Bit Index Explicit Replication (BIER) Optimized BIER for AI Data Center Environments

Acknowledgements The authors thank Kefei Liu for valuable discussion on ACK identification and aggregation point selection on the mcast4ai mailing list.