Internet-Draft | SRv6 for Deterministic Path Placement in | April 2025 |
Filsfils, et al. | Expires 6 October 2025 | [Page] |
This document describes the use of SRv6 to enable deterministic path placement in AI backends, optimizing load balancing and congestion control for predictable GPU workloads.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 6 October 2025.¶
Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Hyperscale AI training clusters rely on massive GPU-to-GPU data exchanges, where synchronization delays caused due to congestion delays and packet loss directly impact model convergence time and operational costs.¶
These workloads generate large, predictable flows that require ultra-low latency, high bandwidth, and precise congestion control to maintain efficiency. Traditional networking approaches, such as ECMP-based per-flow load balancing, suffer from poor entropy due to the limited number of RoCEv2 flows, leading to fabric hotspots, congestion, and slow reconvergence after failures.¶
SRv6 uSID (NEXT-CSID) provides the ability to steer in the fabric, allowing the NIC (i.e., SmartNIC, DPU) to perform deterministic path placement of ROCEv2 traffic through the fabric. This ensures predictable performance, fine-grained traffic control, and real-time adaptation to congestion in a stateless manner. ¶
Future revisions of this draft will cover additional use-cases (multi-path transport, stateless interaction between AI/LLM leasing a cluster infra and the operator managing the cluster, etc).¶
The document draft-filsfils-srv6-dc-frontend-wan explains how SRv6 uSID (NEXT-CSID) is applied to a converged DC Frontend and WAN fabric.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Micro-segment. Formally defined as NEXT-CSID in [I-D.ietf-spring-srv6-srh-compression].¶
The term uSID (micro SID) predates the formal naming and has been widely adopted across the industry - including operators with large-scale deployments, vendors, open-source implementations, and used consistently in multi-vendor interoperability reports.¶
To maintain alignment with the formal specification while also acknowledging the widespread and practical use of the term, this document uses uSID and NEXT-CSID interchangeably.¶
AI workloads exhibit highly structured traffic patterns:¶
SRv6 enables the NIC to directly control the AI workload traffic journey through the fabric by encoding an ordered list of segments in the packet header.¶
AI Scheduler: Upon AI job orchestration, the collectives' communications are defined (i.e., the GPU Topology). The AI scheduler determines the optimal fabric routed paths based on all the running jobs in the fabric, and the GPU topology for each one of them.¶
NIC: The NIC, before sending the ROCEv2 traffic, encapsulates with an outer IPv6 header and encodes in the packet header the sequence of instructions to enforce the precomputed path through the fabric.¶
The following figure depicts a typical 2-tier Clos topology.¶
Spine4 Spine5 | | +--------+----+--------------+-------|-----+ | | | | | | +---------|---+----------|---+---+-----|----+ | | | | | | | | +--+------+ +--+------+ +--+------+ +--+------+ | Leaf1 | | Leaf2 | | Leaf3 | | Leaf4 | +----+----+\ /+----+----+ +----+----+\ /+----+----+ | X | | X | | / \ | | / \ | | / \ | | / \ | +----+----+ +----+----+ +----+----+ +----+----+ | DPU1 | | DPU2 | | DPU3 | | DPU4 | | | | | | | | | | | | | | GPU1 | | GPU2 | | GPU3 | | GPU4 | +---------+ +---------+ +---------+ +---------+
The topology consists of two Spine devices. Each of the Spines is connected to four Leaf devices.¶
There are 4 NICs, which are connected through the host interface (e.g., PCIe) to a GPU. In this example each NIC is dual-homed to two Leaf devices.¶
At a day0 cluster build-up (fabric bring-up), the topology is provisioned with SRv6 SIDs on the Spine and Leafs devices. These SIDs are statically configured and thus independent of any routing protocol dynamic state. The following is provisioned:¶
In the fabric there is an AI job being orchestrated. As a result of the AI orchestration and the collectives' communication, it results that the GPU1 and GPU2 must send traffic periodically to GPU3.¶
The AI orchestration, based on the network topology, computes the paths which achieve homogenous utilization in the fabric to avoid congestion:¶
Upon AI job computation (at GPU synchronization time):¶
NIC1: creates a ROCEv2 packet that must be sent to NIC3. NIC1 encapsulates the ROCEv2 packet with an outer IPv6 Header (H.Encaps.Red behavior).¶
Leaf1:¶
Spine5:¶
Leaf3:¶
Note that Leaf1, Spine5, and Leaf3 do not hold any state for this specific flow. It is a single uSID instruction per node instantiated upon cluster build-up and reused by all flows.¶
The flow for the traffic from GPU2 to GPU3 leverages the path Leaf2, Spine6, Leaf4. It does so by using the uSID Network Program 5f00:0:0200:0600:0400:: .¶
While in this example we have used the uN instruction, it can also be encoded using uA instructions specifying the sequence of interfaces.¶
At any time, during the execution of the AI job, Spine5 experiences congestion. NIC1 learns about the congestion of Spine5.¶
Within usecs, without any fabric signaling or new state at intermediate devices, NIC1 steers the traffic into a different path through the fabric. NIC1 switches the path from <Leaf1, Spine5, Leaf3> to <Leaf1, Spine6, Leaf3>. This is done simply by encapsulating any new traffic of the flow GPU1->GPU3 with the IPv6 DA 5f00:0:0100:0600:0300:: .¶
Note that the change of path is instantaneous. There is no routing protocol or control plane notification to the network devices to change the path. The fabric is entirely stateless, and the packet path is encoded into the IPv6 header built by the source NIC. This is essential as AI workloads cannot be exposed to slow reconvergence.¶
AI workloads are deployed across thousands of GPUs in multi-tier Clos networks, requiring a networking architecture that scales efficiently. SRv6 uSID (NEXT-CSID) ensures deterministic path placement while maintaining scalability through the following mechanisms:¶
The deployment model described in this document is secured leveraging the mechanisms defined in [RFC8986].¶
The authors would like to recognize the work of Lihua Yuan, Guohan Lu, Rita Hui, and Riff Jiang at Microsoft.¶
Rita Hui presented this use-case at MPLS & SRv6 World Congress in March 2025. A recording is available here: https://www.segment-routing.net/conferences/Paris25-Microsoft-Rita-Hui/¶
The authors would like to acknowledge the work of the developers who have enabled this use-case in the open-source [SONiC] implementation. In particular: Carmine Scarpitta, Abhishek Dosi, Changrong Wu, Kumaresh Perumal, Eddie Ruan, and Yuqing Zhao.¶