| Internet-Draft | SRv6 for AI Backends | June 2026 |
| Filsfils, et al. | Expires 5 December 2026 | [Page] |
This document describes how SRv6 uSID (NEXT-CSID) enables deterministic path placement in AI backend fabrics through L3-L4 integration: the transport stack on the NIC encodes each path as an ordered list of segments (a uSID network program) in the packet header, while the fabric forwards statelessly. It explains operational benefits including deterministic probing and alignment with hyperscale production deployments.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 December 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Hyperscale AI training clusters rely on massive GPU-to-GPU data exchanges, where training-step synchronization delays from congestion and packet loss directly degrade training performance and operational cost.¶
These workloads generate large, predictable flows that require ultra-low latency, high bandwidth, and precise congestion control to maintain efficiency. Traditional networking approaches, such as ECMP-based per-flow load balancing, suffer from poor entropy due to the limited number of RoCEv2 flows, leading to fabric hotspots, congestion, and slow reconvergence after failures.¶
SRv6 uSID (NEXT-CSID) enables L3-L4 integration in AI backend fabrics: the transport stack on the NIC (i.e., SmartNIC, DPU) controls which path each packet follows by encoding an ordered list of segments in the outer IPv6 header, while switches perform simple, static forwarding without per-flow state. This ensures predictable performance, fine-grained traffic control, and rapid reaction to congestion without fabric reconvergence.¶
This model is deployed at hyperscale. OpenAI, Microsoft, and Oracle Cloud Infrastructure operate training clusters using Multipath Reliable Connection (MRC) over static SRv6 source routing. MRC extends RoCEv2 with multipath packet spraying and transport-layer path selection; SRv6 provides a deterministic mapping from each path identifier to a unique physical path through the fabric. Section 10 summarizes these deployments and provides references.¶
[SRv6-E2E-Frontend-WAN] explains how SRv6 uSID (NEXT-CSID) is applied to an end-to-end DC Frontend and WAN fabric.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Micro-segment. Formally defined as NEXT-CSID in [RFC9800].¶
The term uSID (micro SID) predates the formal naming and has been widely adopted across the industry - including operators with large-scale deployments, vendors, open-source implementations, and used consistently in multi-vendor interoperability reports.¶
To maintain alignment with the formal specification while also acknowledging the widespread and practical use of the term, this document uses uSID and NEXT-CSID interchangeably.¶
AI workloads exhibit highly structured traffic patterns:¶
At hyperscale, faults are routine rather than exceptional. In a 54-day Llama 3 405B pre-training run on 16,384 GPUs, Meta reported 419 unexpected interruptions (about 78% hardware-related; 8.4% network switches and cables) [Llama3-Herd]. Synchronized training makes fabric congestion, loss, or degraded paths costly for jobs that run for weeks. Meta's large-scale RoCE backend design is described in [Meta-RoCE-SIGCOMM24].¶
The source encodes each path as a uSID network program in the packet header; transports spray packets across disjoint network paths, and choose a different path upon congestion. SRv6 enables L3-L4 integration: the transport stack on the NIC controls the AI workload traffic journey through the fabric by encoding an ordered list of segments in the packet header, while the network layer provides stateless static forwarding.¶
Control plane and orchestration: At bring-up, the orchestrator discovers topology and the SRv6 uSIDs configured on each link. These uSID instructions are statically configured on the routers and are independent of any dynamic routing protocol state.¶
Transport stack on the NIC: Before sending RoCEv2 traffic, the transport stack encapsulates the packet with an outer IPv6 header that carries the uSID program selected on the NIC for that packet or spray round.¶
In a dynamically routed fabric, protocols such as BGP maintain reachability across the Clos fabric. When a link fails or the topology changes, switches must reconverge: prefixes are withdrawn or re-advertised, each device updates its RIB, and forwarding entries are reprogrammed. A point-to-point link-down may be detected within milliseconds, but completing BGP reconvergence in a large datacenter fabric typically takes on the order of tens of milliseconds to a sub-second interval in optimized designs, and can be substantially longer when many paths or prefixes are affected [RFC7938]. At extreme scale, convergence may require many round-trip times across the fabric before forwarding is stable [MRC-SRv6-Paper].¶
In the statically provisioned SRv6 model in this document, GPU traffic paths do not depend on that reconvergence cycle. uSID instructions are configured at fabric bring-up and remain in place on the routers. When a link degrades or fails, the fabric does not wait for BGP or other dynamic routing to install a new path. The source NIC or transport stack detects the problem through loss, ECN, latency, or probing, selects another uSID network program from the topology information it already holds, and encodes it in the outer IPv6 Destination Address of subsequent packets. That change is local to the sender, takes effect within microseconds, and does not require signaling to intermediate switches or coordination with routing protocol timers [Microsoft-Fairwater].¶
Accurate visibility into fabric health is essential for AI backend operations: scheduling repairs, tuning performance, and correlating transport behavior with physical paths. Traditional approaches face limitations at scale.¶
With ECMP-based forwarding, probe and data packets may traverse different paths because hashing is sensitive to header fields. Mechanisms that send probes to remote nodes depend on remote availability and do not always localize faults precisely. ICMP probes to switches are often handled by the control plane, limiting probe frequency. Dynamic routing can change forwarding while probes are in flight, reducing the ground truth of measurements.¶
SRv6 uSID source routing enables deterministic probe pinning: a probe encoded with the same segment program as data traffic follows the identical physical path through the fabric. There is no ECMP ambiguity, no dependence on switch control-plane ICMP handling for path validation, and no interaction with dynamic routing reconvergence during measurement.¶
Deterministic probing simplifies denylist and spray/path-selection policies on the NIC, supports resurrection of paths after transient failures, and provides operations teams ground-truth telemetry independent of switch control-plane health.¶
The following figure depicts a typical 2-tier Clos topology.¶
Spine5 Spine6
| |
+--------+----+--------------+-------|-----+
| | | | |
| +---------|---+----------|---+---+-----|----+
| | | | | | | |
+--+------+ +--+------+ +--+------+ +--+------+
| Leaf1 | | Leaf2 | | Leaf3 | | Leaf4 |
+----+----+\ /+----+----+ +----+----+\ /+----+----+
| X | | X |
| / \ | | / \ |
| / \ | | / \ |
+----+----+ +----+----+ +----+----+ +----+----+
| DPU1 | | DPU2 | | DPU3 | | DPU4 |
| | | | | | | | | | | |
| GPU1 | | GPU2 | | GPU3 | | GPU4 |
+---------+ +---------+ +---------+ +---------+
The topology consists of two Spine devices. Each of the Spines is connected to four Leaf devices.¶
There are 4 NICs, which are connected through the host interface (e.g., PCIe) to a GPU. In this example each NIC is dual-homed to two Leaf devices.¶
At a day0 cluster build-up (fabric bring-up), the topology is provisioned with SRv6 SIDs on the Spine and Leafs devices. These SIDs are statically configured and thus independent of any routing protocol dynamic state. The following is provisioned:¶
During a collective, GPU1 and GPU2 send traffic to GPU3. The transport on each NIC sprays packets across disjoint uSID network programs, each encoded as an ordered segment list in the outer IPv6 header. The orchestrator supplies topology and uSID information at bring-up; path choice and spraying are performed by the NIC transport (e.g., MRC) at send time. Example programs:¶
When sending RoCEv2 traffic from GPU1 to GPU3:¶
NIC1: creates a ROCEv2 packet that must be sent to NIC3. NIC1 encapsulates the ROCEv2 packet with an outer IPv6 Header (H.Encaps.Red behavior).¶
Leaf1:¶
Spine5:¶
Leaf3:¶
Note that Leaf1, Spine5, and Leaf3 do not hold any state for this specific flow. It is a single uSID instruction per node instantiated upon cluster build-up and reused by all traffic using that program. GPU2->GPU3 uses the second example program; forwarding is the same stateless model on each hop.¶
While in this example we have used the uN instruction, it can also be encoded using uA instructions specifying the sequence of interfaces.¶
At any time during the execution of the AI job, Spine5 may experience congestion. The transport stack on NIC1 detects this via ECN, in-band latency, packet trimming, or loss feedback.¶
Within microseconds, without fabric signaling or new state at intermediate devices, the transport stack steers traffic to a different path. NIC1 switches the path from <Leaf1, Spine5, Leaf3> to <Leaf1, Spine6, Leaf3> by encapsulating new traffic for GPU1->GPU3 with IPv6 DA 5f00:0:0100:0600:0300::.¶
This is not switch adaptive routing (e.g., dynamic ECMP or BGP reconvergence). Path changes are made only at the source NIC; switches continue static SRv6 forwarding. The fabric is entirely stateless, and the packet path is encoded in the IPv6 header built at the source. Separating transport-controlled path selection from switch-based adaptive routing avoids unpredictable interactions at scale and is essential because AI workloads cannot tolerate slow reconvergence [Microsoft-Fairwater].¶
AI workloads are deployed across thousands of GPUs in multi-tier Clos networks, requiring a networking architecture that scales efficiently. SRv6 uSID (NEXT-CSID) ensures deterministic path placement while maintaining scalability through the following mechanisms:¶
Static SRv6 uSID (NEXT-CSID) source routing is deployed at hyperscale in production AI training clusters together with Multipath Reliable Connection (MRC), an RDMA transport developed collaboratively by OpenAI, Microsoft, NVIDIA, AMD, Intel, and Broadcom [MRC-SRv6-Paper] [OpenAI-MRC]. MRC extends RoCEv2 Reliable Connection with multipath packet spraying, selective retransmission, packet trimming for incast, and transport-layer path health management; it runs Ethernet in best-effort mode and relies on fast recovery at the transport layer rather than Priority Flow Control. In these fabrics, dynamic routing in switches is disabled: each packet's path is encoded in the outer IPv6 destination using uSID, path identifiers map algorithmically to SRv6 network programs, and the transport stack gains deterministic control while routers forward statelessly. This combination has been used to train frontier large language models on clusters exceeding 100,000 GPUs, with implementations on 400 and 800 Gb/s RDMA NICs and SRv6 forwarding across multiple switch platforms and NOS distributions [MRC-SRv6-Paper].¶
The same architecture spans OpenAI, Microsoft, and Oracle Cloud Infrastructure (OCI) training sites that operate as one coherent design rather than isolated experiments. OpenAI runs MRC over static SRv6 on its largest NVIDIA GB200 supercomputers [OpenAI-MRC] [MRC-SRv6-Paper]; Microsoft's Fairwater supercomputer applies the same model on a two-tier, multi-plane fabric, removing BGP and other dynamic routing from the scale-out network in favor of compact uSID source routing, with probe traffic following the same paths as data for ground-truth visibility [Microsoft-Fairwater]; and OCI's Oracle Acceleron multiplanar networking deployed MRC and SRv6 source-based routing at scale, including the Stargate datacenter in Abilene, Texas, with path intelligence at the NIC and simple static forwarding in the network [Oracle-Acceleron-MRC] [Oracle-Acceleron-Arch]. Across these environments, multipath spraying and SRv6 path pinning reduce flow collisions, improve utilization across network planes, and allow large synchronous training jobs to continue through link flaps and partial failures that previously caused restarts [OpenAI-MRC] [Microsoft-Fairwater]. Microsoft has additionally open-sourced MRC software interfaces and SONiC SRv6 enhancements for AI backend networks [Microsoft-Fairwater].¶
This document is informational and does not define a new protocol. Security for the SRv6 data plane and segment programming is covered in [RFC8986]. The deployment model assumes a dedicated AI backend fabric in a single administrative domain: an orchestrator statically provisions uSIDs on routers and supplies topology to NICs, and transport stacks on GPU hosts encode segment programs in each packet while the fabric forwards without per-flow state. Security therefore depends on integrity of provisioning, access control over router configuration, and path-selection logic on each host.¶
Because the source encodes the full segment list, a compromised or misconfigured host could steer traffic along unintended paths. Operators SHOULD limit which workloads may send SRv6-encapsulated traffic and constrain hosts to programs authorized by the orchestrator. Backend fabrics are typically isolated from untrusted networks, and operators SHOULD maintain that separation. Security of RoCEv2 and Multipath Reliable Connection (MRC) transports is outside the scope of this document.¶
The following person contributed significantly to this document:¶
Chris Martin (Cisco Systems), <martincj@cisco.com>¶
The authors thank the teams behind the MRC and SRv6 production deployments described in Section 10, including contributors at OpenAI, Microsoft, Oracle Cloud Infrastructure, NVIDIA, AMD, Intel, and Broadcom.¶
The authors would like to recognize the work of Lihua Yuan, Guohan Lu, Rita Hui, and Riff Jiang at Microsoft.¶
Pablo Camarillo and Rita Hui presented this use case at NANOG 96; a recording is available at https://www.segment-routing.net/conferences/2026-02-02-NANOG96-SRv6-AI-Backend-Microsoft. Clarence Filsfils and Guohan Lu presented related material at OCP EMEA 2026; a recording is available at https://www.segment-routing.net/conferences/2026-OCP-EMEA-summit-scalable-ai-protocol-stack.¶
The authors would like to acknowledge the work of the developers who have enabled this use-case in the open-source [SONiC] implementation. In particular: Carmine Scarpitta, Abhishek Dosi, Changrong Wu, Kumaresh Perumal, Eddie Ruan, Yuqing Zhao, Rajasekar Raja, and Vivek Venkatraman.¶