﻿<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="info"
     docName="draft-mcbride-mcast4ai-bier-or-new-protocol-00"
     ipr="trust200902"
     submissionType="IETF"
     xml:lang="en"
     tocInclude="true"
     tocDepth="4"
     symRefs="true"
     sortRefs="true"
     version="3">

  <front>
    <title abbrev="BIER Extension or New Protocol for AI P2MP">BIER Extension or New Protocol for AI P2MP</title>

    <seriesInfo name="Internet-Draft"
                value="draft-mcbride-mcast4ai-bier-or-new-protocol-00"/>

    <author fullname="Michael McBride" initials="M." surname="McBride">
      <organization>Futurewei</organization>
      <address>
        <email>mmcbride7@gmail.com</email>
      </address>
    </author>
    
     <author fullname="Yisong Liu" initials="Y." surname="Liu" >
      <organization>China Mobile</organization>
      <address>
        <email>liuyisong@chinamobile.com</email>
      </address>
    </author>
    
     <author fullname="Monica Zhangli" initials="M." surname="Zhangli" >
      <organization>Huawei</organization>
      <address>
        <email>monica.zhangli@huawei.com</email>
      </address>
    </author>    

    <date year="2026"/>

    <keyword>BIER</keyword>
    <keyword>multicast</keyword>
    <keyword>AI</keyword>
    <keyword>P2MP</keyword>
    <keyword>data center</keyword>
    <keyword>RoCE</keyword>
    <keyword>RDMA</keyword>

    <abstract>
      <t>
        AI workloads in data centers exhibit inherently point-to-multipoint
        (P2MP) communication patterns, particularly during collective operations
        such as AllReduce, AllGather and broadcast in distributed training.
        Unicast replication of these flows does not scale to large GPU clusters.
        This document analyzes two architectural approaches to addressing this
        problem: extending BIER (Bit Index Explicit Replication) to support AI
        P2MP requirements, or defining a new purpose-built protocol. The
        tradeoffs of each approach are discussed, including considerations around
        ACK aggregation, congestion control, RoCE/RDMA compatibility and
        operational complexity. This document does not define a protocol but is
        intended instead to help the mcast4ai community evaluate this problem space.
      </t>
    </abstract>

  </front>

  <middle>

    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>
        Distributed AI training workloads generate communication patterns that
        are structurally point-to-multipoint (P2MP). During collective
        operations, a single sender must deliver identical data to a large
        number of GPU receivers simultaneously. At the scale of modern GPU
        clusters, hundreds to thousands of endpoints, replicating these flows
        via unicast imposes significant and unnecessary burden on the network.
      </t>
      <t>
        BIER <xref target="RFC8279"/> provides a stateless multicast forwarding
        architecture that has seen deployment in WAN and some DC environments.
        A natural question is whether BIER can be extended to satisfy the
        requirements of AI P2MP traffic, or whether the unique characteristics
        of AI workloads, including tight latency requirements, RDMA transport
        semantics, ACK aggregation and congestion control, necessitate a new
        protocol. This document explores both approaches.
      </t>
      <t>
        While IETF practice has historically favored extending existing protocols 
        where feasible, there are cases where the requirements diverge sufficiently 
        from the design assumptions of an existing protocol that a new protocol is more 
        appropriate. This document aims to provide an objective basis for that determination.
      </t>
    </section>

    <section numbered="true" toc="default">
      <name>AI P2MP Requirements</name>
      <t>
        The following requirements are specific to AI data center P2MP and
        distinguish this problem space from traditional multicast use cases.
      </t>

      <section numbered="true" toc="default">
        <name>Collective Communication Patterns</name>
        <t>
         AI training collectives (AllReduce, AllGather, ReduceScatter, Broadcast) 
         require simultaneous delivery of data to all members of a communication 
         group. For dense model training, group membership is established 
         at job initialization and remains stable throughout the training run. In Mixture 
         of Experts (MoE) architectures, All-to-All collectives implement token-dependent 
         expert routing where each token is dispatched to a small subset of experts 
         (typically top-K of N total), meaning the active receiver set changes with every 
         token or batch. The full set of expert endpoints is fixed for the duration of the job, 
         but the per-packet routing destination is highly dynamic. Both cases include 
         P2MP communication patterns. In MoE, while the All-to-All expert dispatch 
         collective is many-to-many rather than P2MP, gradient synchronization across 
         expert replicas uses AllReduce over fixed groups, benefiting from P2MP 
         multicast. MoE's expert dispatch introduces the additional requirement that the 
         forwarding mechanism support dynamic, per-token receiver subsets.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>Latency and Jitter Sensitivity</name>
        <t>
          GPU training steps are synchronization points. Straggler effects, 
          where a single slow receiver delays the entire collective, make
          tail latency a critical metric. The multicast mechanism should minimize
          added latency and jitter relative to optimal unicast paths.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>RDMA and RoCE Compatibility</name>
        <t>
          AI workloads primarily rely on RDMA (Remote Direct Memory Access) today for
          high-throughput, low-latency data transfer. In Ethernet-based AI
          data centers, RoCEv2 (RDMA over Converged Ethernet) is the dominant
          transport. RoCEv2 uses the InfiniBand Base Transport Header (BTH)
          encapsulated in UDP. Any multicast solution should be compatible with
          this transport model, or should define a clear interworking point.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>ACK Aggregation</name>
        <t>
          In unicast RDMA, each receiver generates an ACK per received packet
          or message. In P2MP multicast, N receivers generating individual ACKs
          back toward the sender creates an ACK implosion problem. This imposes
          load on both the network (N reverse unicast flows) and the sender's
          RDMA NIC (processing N individual ACKs). An ACK aggregation mechanism
          is required that collapses N ACKs into a single ACK before reaching
          the sender's RDMA NIC, while preserving RoCE ACK semantics including
          BTH opcode and PSN fields.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>Congestion Control</name>
        <t>
          RoCEv2 environments typically rely on Priority Flow Control (PFC)
          and DCQCN for congestion management. A multicast solution should either
          integrate with these mechanisms or define equivalent behavior.
          Packet spraying, increasingly adopted for load balancing (e.g.,
          Nvidia AR, Broadcom DLB/GLB), complicates per-flow congestion signals.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>Scalability</name>
        <t>
          GPU clusters in large AI data centers may contain thousands of
          endpoints organized into multiple communication groups running
          simultaneously. The multicast solution should scale in terms of group
          state, replication overhead and control plane complexity.
        </t>
      </section>

    </section>

    <section numbered="true" toc="default">
      <name>Extending BIER</name>

      <section numbered="true" toc="default">
        <name>BIER Overview</name>
        <t>
          BIER <xref target="RFC8279"/> encodes the set of intended receivers
          as a bitstring in the packet header. Each router in the BIER domain
          forwards and replicates packets based on the bitstring without
          maintaining per-flow or per-group state. This stateless property is
          a key operational advantage.
        </t>
        <t>
          <xref target="I-D.zzhang-bier-optimized"/> describes optimizations to BIER 
          targeting AI data center environments, including mechanisms for ACK aggregation 
          within the BIER forwarding plane.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>Arguments for Extending BIER</name>
        <ul>
          <li>
              BIER is an existing IETF standard with implementations in
              multiple vendor platforms, reducing time to deployment.
          </li>
          <li>
              BIER's stateless forwarding model aligns well with AI P2MP 
              groups whose membership is known at the source, as is the 
              case in dense model training, avoiding the need for network-wide 
              group state. MoE workloads introduce additional complexity, as the 
              subset of active expert endpoints varies per training step, even 
              though the full set of experts is fixed for the duration of the job.
          </li>
          <li>
              IETF practice favors reuse of existing protocols where
              requirements can be met, reducing fragmentation of the
              standards landscape.
          </li>
          <li>
              Extending BIER leverages existing operational familiarity
              and tooling in networks where BIER is already deployed.
          </li>
          <li>
              The BIER header can carry arbitrary payloads, including
              RoCEv2/BTH, without modification to the forwarding plane.
          </li>
        </ul>
      </section>

      <section numbered="true" toc="default">
        <name>Limitations of Extending BIER</name>
        <ul>
          <li>
              BIER was designed for IP multicast replacement and does not
              natively address ACK aggregation, which is a fundamental
              requirement for AI P2MP. Adding ACK aggregation to BIER
              requires significant new machinery that may be architecturally
              inconsistent with BIER's stateless forwarding model.
          </li>
          <li>
              BIER bitstring size limits the number of BFERs (endpoints)
              in a BIER sub-domain. Large GPU clusters may require
              hierarchical BIER domains, adding complexity.
          </li>
          <li>
              BIER does not define congestion control. Layering RoCEv2
              congestion semantics over BIER requires additional specification
              work.
          </li>
          <li>
              BIER is a forwarding plane mechanism. The control plane
              (BGP-BIER, OSPF-BIER) was not designed with the dynamic,
              job-scoped group lifecycle typical of AI training workloads,
              where groups may be set up and torn down thousands of times
              per day in a busy cluster.
          </li>
          <li>
              ACK aggregation in a BIER context requires nodes to maintain
              per-group ACK state, which contradicts BIER's stateless design
              principle.
          </li>
        </ul>
      </section>

    </section>

    <section numbered="true" toc="default">
      <name>Defining a New Protocol</name>

      <section numbered="true" toc="default">
        <name>Arguments for a New Protocol</name>
        <ul>
          <li>
              A purpose-built protocol can be designed from the ground up
              to satisfy AI P2MP requirements including ACK aggregation,
              RoCE/BTH compatibility, and congestion control integration,
              without being constrained by the design assumptions of BIER.
          </li>
          <li>
              AI data center networks are a distinct environment from the
              WAN and campus networks where BIER was originally targeted.
              The scale, traffic patterns, and transport semantics differ
              significantly enough to justify a purpose-built solution.
          </li>
          <li>
              A new protocol can define a clean ACK aggregation mechanism
              that is first-class rather than bolted on, with well-defined
              semantics for BTH PSN handling and aggregation point placement.
          </li>
          <li>
              A new protocol can be co-designed with the RoCEv2 transport
              model, defining explicit interworking points (e.g., source ToR,
              smart NIC/DPU) for ACK aggregation without ambiguity.
          </li>
          <li>
              Avoiding overloading BIER with AI-specific extensions preserves
              the clarity and deployability of BIER in its existing use cases.
          </li>
        </ul>
      </section>

      <section numbered="true" toc="default">
        <name>Limitations of a New Protocol</name>
        <ul>
          <li>
              A new protocol requires new implementations across the entire
              stack, switch ASICs, NICs, control plane software, increasing
              time to deployment and adoption risk.
          </li>
          <li>
              A new protocol requires a new IETF working group charter or
              significant expansion of an existing charter, adding process
              overhead.
          </li>
          <li>
              Operators may be reluctant to deploy a new protocol in
              production networks without a substantial track record.
          </li>
          <li>
              There is a risk of fragmentation if multiple vendors define
              proprietary solutions in the absence of a timely standard.
          </li>
        </ul>
      </section>

    </section>

    <section numbered="true" toc="default">
      <name>Comparative Analysis</name>
      <t>
        The following table summarizes the tradeoffs across key dimensions.
      </t>
      <table>
        <thead>
          <tr>
            <th>Dimension</th>
            <th>Extend BIER</th>
            <th>New Protocol</th>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td>ACK aggregation</td>
            <td>Requires significant extension; conflicts with stateless model</td>
            <td>Can be designed as first-class feature</td>
          </tr>
          <tr>
            <td>RoCE/BTH compatibility</td>
            <td>Payload-agnostic; interworking not specified</td>
            <td>Can be co-designed with RoCEv2</td>
          </tr>
          <tr>
            <td>Congestion control</td>
            <td>Not in scope for BIER; requires additional work</td>
            <td>Can integrate DCQCN natively</td>
          </tr>
          <tr>
            <td>Scalability</td>
            <td>Bitstring size limits; hierarchical domains add complexity</td>
            <td>Designed for large GPU cluster scale</td>
          </tr>
          <tr>
            <td>Deployment speed</td>
            <td>Faster; builds on existing implementations</td>
            <td>Slower; requires new implementations</td>
          </tr>
          <tr>
            <td>Operational familiarity</td>
            <td>Higher where BIER is deployed</td>
            <td>Lower; new operational model</td>
          </tr>
          <tr>
            <td>Standards risk</td>
            <td>Lower; incremental extension</td>
            <td>Higher; new charter/scope required</td>
          </tr>
          <tr>
            <td>Architectural cleanliness</td>
            <td>Lower; AI requirements are a poor fit for BIER assumptions</td>
            <td>Higher; purpose-built for use case</td>
          </tr>
        </tbody>
      </table>
      <t>
        The authors observe that while extending BIER offers a faster path
        to a first implementation, the fundamental mismatch between BIER's
        stateless forwarding model and the stateful ACK aggregation requirement
        is a significant concern. ACK aggregation is not an optional feature; it is necessary for correctness at scale. Any extension to BIER that
        adds per-group ACK state effectively creates a new protocol that
        happens to reuse the BIER forwarding header.
      </t>
      <t>
        The authors therefore lean toward defining a new protocol that is
        purpose-built for AI P2MP, while acknowledging that this position
        should be validated through working group discussion, particularly
        from operators who have deployed or are evaluating BIER in AI DC
        environments.
      </t>
    </section>

    <section numbered="true" toc="default">
      <name>Open Issues</name>
      <t>
        The following open issues represent challenges that any solution
        must address, regardless of whether BIER is extended or a new
        protocol is defined. They are not arguments for or against either
        approach, but rather constraints that any design must satisfy.
      </t>

      <section numbered="true" toc="default">
        <name>ACK Aggregation Point</name>
        <t>
          The aggregated ACK delivered to the sender's RDMA NIC must conform
          to RoCEv2 ACK semantics, including a valid BTH with opcode 0x11
          (ACKNOWLEDGE) and a PSN representing the collective acknowledgment
          state of all receivers (typically the minimum PSN across all
          receivers to ensure no data is incorrectly freed).
        </t>
        <t>
          Candidate aggregation points include, but are not limited to:
        </t>
        <ul>
          <li>
              The source ToR switch, which is a natural convergence point
              for returning ACKs under packet spraying, as all receiver ACKs
              are destined for the source host and must traverse the source ToR.
              This requires the ToR to implement BTH-aware ACK aggregation,
              which is not supported by standard merchant silicon today.
          </li>
          <li>
              A software agent on the aggregation source host, which intercepts
              returning ACKs in software before they reach the RDMA stack.
              This reintroduces CPU involvement and may compromise the
              kernel-bypass properties of RDMA.
          </li>
          <li>
              A smart NIC or DPU (Data Processing Unit) on the aggregation
              source host, which performs BTH-aware ACK aggregation in NIC
              hardware without involving the host CPU. This preserves
              kernel-bypass semantics and is likely the most architecturally
              clean option, but requires DPU programmability.
          </li>
        </ul>
        <t>
          Each option has different tradeoffs in terms of where intelligence
          is placed, whether host CPU is involved, and what hardware
          capabilities are required. Further analysis is needed.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>BTH Classification in the Network</name>
        <t>
          Data center switches, without additional hardware capabilities, treat RoCE ACKs
          as normal UDP data payload and cannot distinguish them from RoCE data
          packets, since the BTH opcode is carried inside the UDP payload and is
          not visible to standard switch forwarding logic. Classification of ACK packets in the network requires either
          P4-capable ASICs with custom parsers or vendor-specific deep packet
          inspection features. This is a practical constraint on any in-network
          ACK aggregation mechanism.
        </t>
      </section>

      <section numbered="true" toc="default">
        <name>Interaction with Packet Spraying</name>
        <t>
          Packet spraying is increasingly adopted in AI DC networks for
          load balancing (e.g., Nvidia Adaptive Routing, Broadcom DLB/GLB,
          China Mobile GSE). Under packet spraying, ACK paths cannot be
          assumed to be symmetric with data paths, which complicates
          in-network ACK aggregation at arbitrary points. The source ToR
          aggregation point is robust to packet spraying since all ACKs
          must converge there regardless of the spray path taken.
        </t>
      </section>

    </section>

    <section numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>
        This document does not define a protocol and introduces no new
        security considerations beyond those already discussed in
        <xref target="RFC8279"/> and related RDMA/RoCEv2 specifications.
        Security considerations for any protocol defined based on this
        analysis should address spoofing of aggregated ACKs, which could
        cause a sender to incorrectly advance its send window.
      </t>
    </section>

    <section numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>
        This document has no IANA actions.
      </t>
    </section>

  </middle>

  <back>

    <references>
      <name>References</name>

      <references>
        <name>Informative References</name>

        <reference anchor="RFC8279">
          <front>
            <title>Multicast Using Bit Index Explicit Replication (BIER)</title>
            <author initials="IJ." surname="Wijnands" fullname="IJ. Wijnands"/>
            <author initials="E." surname="Rosen" fullname="E. Rosen"/>
            <author initials="A." surname="Dolganow" fullname="A. Dolganow"/>
            <author initials="T." surname="Przygienda" fullname="T. Przygienda"/>
            <author initials="S." surname="Aldrin" fullname="S. Aldrin"/>
            <date year="2017" month="November"/>
          </front>
          <seriesInfo name="RFC" value="8279"/>
        </reference>

        <reference anchor="I-D.zzhang-bier-optimized">
          <front>
            <title>Optimized BIER for AI Data Center Environments</title>
            <author initials="Z." surname="Zhang" fullname="Zhaohui Zhang"/>
            <date year="2026"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-zzhang-bier-optimized"/>
        </reference>

      </references>
    </references>

    <section numbered="false" toc="default">
      <name>Acknowledgements</name>
      <t>
        The authors thank Kefei Liu for valuable discussion on ACK
        identification and aggregation point selection on the mcast4ai
        mailing list.
      </t>
    </section>

  </back>

</rfc>