Internet-Draft SRv6 for Deterministic Path Placement in April 2025
Filsfils, et al. Expires 6 October 2025 [Page]
Workgroup:
SPRING
Published:
Intended Status:
Informational
Expires:
Authors:
C. Filsfils
Cisco Systems
P. Camarillo, Ed.
Cisco Systems
A. Abdelsalam
Cisco Systems

SRv6 for Deterministic Path Placement in AI Backends

Abstract

This document describes the use of SRv6 to enable deterministic path placement in AI backends, optimizing load balancing and congestion control for predictable GPU workloads.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 6 October 2025.

Table of Contents

1. Introduction

Hyperscale AI training clusters rely on massive GPU-to-GPU data exchanges, where synchronization delays caused due to congestion delays and packet loss directly impact model convergence time and operational costs.

These workloads generate large, predictable flows that require ultra-low latency, high bandwidth, and precise congestion control to maintain efficiency. Traditional networking approaches, such as ECMP-based per-flow load balancing, suffer from poor entropy due to the limited number of RoCEv2 flows, leading to fabric hotspots, congestion, and slow reconvergence after failures.

SRv6 uSID (NEXT-CSID) provides the ability to steer in the fabric, allowing the NIC (i.e., SmartNIC, DPU) to perform deterministic path placement of ROCEv2 traffic through the fabric. This ensures predictable performance, fine-grained traffic control, and real-time adaptation to congestion in a stateless manner.

Future revisions of this draft will cover additional use-cases (multi-path transport, stateless interaction between AI/LLM leasing a cluster infra and the operator managing the cluster, etc).

The document draft-filsfils-srv6-dc-frontend-wan explains how SRv6 uSID (NEXT-CSID) is applied to a converged DC Frontend and WAN fabric.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Terminology

SRv6
Segment Routing over IPv6 [RFC8986].
uSID

Micro-segment. Formally defined as NEXT-CSID in [I-D.ietf-spring-srv6-srh-compression].

The term uSID (micro SID) predates the formal naming and has been widely adopted across the industry - including operators with large-scale deployments, vendors, open-source implementations, and used consistently in multi-vendor interoperability reports.

To maintain alignment with the formal specification while also acknowledging the widespread and practical use of the term, this document uses uSID and NEXT-CSID interchangeably.

ECMP
Equal-Cost Multi-Path
uN
The uN is a short notation for the End behavior with NEXT-CSID, PSP, and USD flavors as defined in [I-D.ietf-spring-srv6-srh-compression].
uA
The uA local behavior is a short notation for the End.X behavior with NEXT-CSID, PSP, and USD flavors [I-D.ietf-spring-srv6-srh-compression].
ROCEv2
RDMA over Converged Ethernet version 2 [IBTA-ROCEv2].
NIC
Network Interface Card, a hardware component that connects a computer to a network.
SmartNIC
A Network Interface Card with embedded processing capabilities, designed to offload network and storage tasks from the host CPU.
DPU
Data Processing Unit, a specialized processor designed to offload and accelerate data-centric tasks, often used in network and storage functions.
GPU
Graphics Processing Unit, a processor designed for rendering graphics and performing parallel computation tasks, commonly used for AI and machine learning workloads.

3. AI Traffic Characteristics and Challenges

AI workloads exhibit highly structured traffic patterns:

4. SRv6 for Deterministic Path Placement

SRv6 enables the NIC to directly control the AI workload traffic journey through the fabric by encoding an ordered list of segments in the packet header.

5. Illustration

The following figure depicts a typical 2-tier Clos topology.

          Spine4                      Spine5
            |                           |
   +--------+----+--------------+-------|-----+
   |             |              |       |     |
   |   +---------|---+----------|---+---+-----|----+
   |   |         |   |          |   |         |    |
+--+------+   +--+------+    +--+------+   +--+------+
|  Leaf1  |   |  Leaf2  |    |  Leaf3  |   |  Leaf4  |
+----+----+\ /+----+----+    +----+----+\ /+----+----+
     |      X      |              |      X      |
     |     / \     |              |     / \     |
     |    /   \    |              |    /   \    |
+----+----+   +----+----+    +----+----+   +----+----+
|  DPU1   |   |  DPU2   |    |  DPU3   |   |  DPU4   |
|    |    |   |    |    |    |    |    |   |    |    |
|  GPU1   |   |  GPU2   |    |  GPU3   |   |  GPU4   |
+---------+   +---------+    +---------+   +---------+
Figure 1: Reference Topology

The topology consists of two Spine devices. Each of the Spines is connected to four Leaf devices.

There are 4 NICs, which are connected through the host interface (e.g., PCIe) to a GPU. In this example each NIC is dual-homed to two Leaf devices.

5.1. SRv6 Fabric Provisioning

At a day0 cluster build-up (fabric bring-up), the topology is provisioned with SRv6 SIDs on the Spine and Leafs devices. These SIDs are statically configured and thus independent of any routing protocol dynamic state. The following is provisioned:

  • SRv6 SID Space in the fabric 5f00:0::/32
  • Leaf1 instantiates the SID 5f00:0:0100::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)
  • Leaf2 instantiates the SID 5f00:0:0200::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)
  • Leaf3 instantiates the SID 5f00:0:0300::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)
  • Leaf4 instantiates the SID 5f00:0:0400::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)
  • Spine5 instantiates the SID 5f00:0:0500::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)
  • Spine6 instantiates the SID 5f00:0:0600::/48 associated with the uN instruction (End with NEXT-CSID, PSP & USD)

5.2. SRv6-Based Deterministic Path Selection

In the fabric there is an AI job being orchestrated. As a result of the AI orchestration and the collectives' communication, it results that the GPU1 and GPU2 must send traffic periodically to GPU3.

The AI orchestration, based on the network topology, computes the paths which achieve homogenous utilization in the fabric to avoid congestion:

  • GPU1->GPU3: via Leaf1, Spine5, Leaf3
  • GPU2->GPU3: via Leaf2, Spine6, Leaf4

Upon AI job computation (at GPU synchronization time):

  • NIC1: creates a ROCEv2 packet that must be sent to NIC3. NIC1 encapsulates the ROCEv2 packet with an outer IPv6 Header (H.Encaps.Red behavior).

    • IPv6 DA: 5f00:0:0100:0500:0300::
    • The packet has no SRH.
  • Leaf1:

    • Packet in: (IPv6. DA=5f00:0:0100:0500:0300::)(ROCEv2)
    • Leaf1 has the SID 5f00:0:0100::/48 instantiated with the End with NEXT-CSID, PSP & USD behavior. As a result, it shifts, lookup, and forwards the packet.
    • Packet out: (IPv6. DA=5f00:0:0500:0300::)(ROCEv2)
  • Spine5:

    • Packet in: (IPv6. DA=5f00:0:0500:0300::)(ROCEv2)
    • Spine5 has the SID 5f00:0:0500::/48 instantiated with the End with NEXT-CSID, PSP & USD behavior. As a result, it shifts, lookup, and forwards the packet.
    • Packet out: (IPv6. DA=5f00:0:0500::)(ROCEv2)
  • Leaf3:

    • Packet in: (IPv6. DA=5f00:0:0300::)(ROCEv2)
    • Leaf3 has the SID 5f00:0:0400::/48 instantiated with the End with NEXT-CSID, PSP & USD behavior. As a result it removes the outer IPv6 header and forward the inner packet.
    • Packet out: (ROCEv2)
  • NIC3: receives the ROCEv2 packet, process it, and passes data to the GPU3.

Note that Leaf1, Spine5, and Leaf3 do not hold any state for this specific flow. It is a single uSID instruction per node instantiated upon cluster build-up and reused by all flows.

The flow for the traffic from GPU2 to GPU3 leverages the path Leaf2, Spine6, Leaf4. It does so by using the uSID Network Program 5f00:0:0200:0600:0400:: .

While in this example we have used the uN instruction, it can also be encoded using uA instructions specifying the sequence of interfaces.

5.3. Adaptive Routing with congestion feedback

At any time, during the execution of the AI job, Spine5 experiences congestion. NIC1 learns about the congestion of Spine5.

Within usecs, without any fabric signaling or new state at intermediate devices, NIC1 steers the traffic into a different path through the fabric. NIC1 switches the path from <Leaf1, Spine5, Leaf3> to <Leaf1, Spine6, Leaf3>. This is done simply by encapsulating any new traffic of the flow GPU1->GPU3 with the IPv6 DA 5f00:0:0100:0600:0300:: .

Note that the change of path is instantaneous. There is no routing protocol or control plane notification to the network devices to change the path. The fabric is entirely stateless, and the packet path is encoded into the IPv6 header built by the source NIC. This is essential as AI workloads cannot be exposed to slow reconvergence.

6. Benefits

7. Hyperscale

AI workloads are deployed across thousands of GPUs in multi-tier Clos networks, requiring a networking architecture that scales efficiently. SRv6 uSID (NEXT-CSID) ensures deterministic path placement while maintaining scalability through the following mechanisms:

8. Security Considerations

The deployment model described in this document is secured leveraging the mechanisms defined in [RFC8986].

9. Acknowledgements

The authors would like to recognize the work of Lihua Yuan, Guohan Lu, Rita Hui, and Riff Jiang at Microsoft.

Rita Hui presented this use-case at MPLS & SRv6 World Congress in March 2025. A recording is available here: https://www.segment-routing.net/conferences/Paris25-Microsoft-Rita-Hui/

The authors would like to acknowledge the work of the developers who have enabled this use-case in the open-source [SONiC] implementation. In particular: Carmine Scarpitta, Abhishek Dosi, Changrong Wu, Kumaresh Perumal, Eddie Ruan, and Yuqing Zhao.

10. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8986]
Filsfils, C., Ed., Camarillo, P., Ed., Leddy, J., Voyer, D., Matsushima, S., and Z. Li, "Segment Routing over IPv6 (SRv6) Network Programming", RFC 8986, DOI 10.17487/RFC8986, , <https://www.rfc-editor.org/info/rfc8986>.
[I-D.ietf-spring-srv6-srh-compression]
Cheng, W., Filsfils, C., Li, Z., Decraene, B., and F. Clad, "Compressed SRv6 Segment List Encoding (CSID)", Work in Progress, Internet-Draft, draft-ietf-spring-srv6-srh-compression-23, , <https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression-23>.

11. Informative References

[IBTA-ROCEv2]
InfiniBand Trade Association, "InfiniBand Architecture Specification Volume 1, Release 1.2.1, Annex A17: ROCEv2", , <https://web.archive.org/web/20200917012109/https://cw.infinibandta.org/document/dl/7781>.
[SONiC]
Linux Foundation, "SONiC", <https://sonicfoundation.dev/>.

Authors' Addresses

Clarence Filsfils
Cisco Systems
Belgium
Pablo Camarillo (editor)
Cisco Systems
Spain
Ahmed Abdelsalam
Cisco Systems
Italy