| Internet-Draft | Lightspeed Notification Protocol | March 2026 |
| Camarillo, et al. | Expires 3 September 2026 | [Page] |
This document defines the Lightspeed Notification Protocol (LSN), a hardware-accelerated signaling mechanism designed for sub-100 microsecond network convergence in AI/ML data center fabrics. By operating entirely within the forwarding plane, LSN bypasses traditional CPU-based latencies to propagate link failures and congestion via a hardware-efficient encoding. It serves as a high-speed complement to routing protocols like BGP, providing an immediate hardware "veto" to prune congested/failed paths while maintaining control-plane stability for path recovery.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 September 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Artificial Intelligence (AI) and Machine Learning (ML) workloads impose stringent demands on data center fabrics, characterized by high-bandwidth, synchronized and bursty collective communication patterns. As outlined in [draft-clad-rtgwg-ipfrr-aiml], these environments require network convergence times in the sub-100 microsecond range to avoid performance degradation caused by packet loss and jitter.¶
Traditional routing protocols and convergence mechanisms rely on control-plane intervention, where link state changes are processed by the CPU. This introduces activation latencies in the tens of milliseconds, which is orders of magnitude too slow for AI/ML requirements. Furthermore, existing mechanisms typically operate on a binary "up/down" model, failing to account for capacity degradation or congestion.¶
This document defines a Hardware-Accelerated Notification Protocol designed to operate entirely within the forwarding plane to handle both network failures and congestion events. This protocol serves as a complementary mechanism to traditional routing protocols (e.g., BGP). The interaction between the routing protocol and this notification mechanism is functionally equivalent to a boolean AND operation: a path is eligible for forwarding only if permitted by the routing protocol AND confirmed healthy by the hardware notification.¶
This enforces an asymmetric reaction: "bad news" (failure or congestion) acts as an immediate hardware veto, disabling the path in sub-microseconds. Conversely, "good news" (recovery) is gated by the routing protocol, ensuring the control plane has fully converged before traffic resumes. This approach combines sub-microsecond protection with routing coherence/stability.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Current IP Fast Reroute (IPFRR) and resiliency mechanisms face fundamental limitations when applied to the extreme performance requirements of AI/ML fabrics.¶
Latency of CPU-Based Processing¶
Existing failure detection and propagation mechanisms generally trigger interrupts that must be processed by a line-card or system CPU. As noted in [draft-clad-rtgwg-ipfrr-aiml], this CPU-mediated path introduces an activation delay typically in the range of 10-50 milliseconds. For AI training workloads requiring barrier synchronization, this latency results in unacceptable flow disruption. To achieve the required sub-100 microsecond convergence, the propagation and processing of network state changes must occur with sub-microsecond latency, necessitating a solution implemented directly in hardware (NPU) logic without CPU intervention.¶
Inability to Handle Congestion and Partial Failures¶
Standard routing protocols and mechanisms like ECMP operate on a binary failure model-a link is either available or unavailable. However, AI fabrics frequently experience "brownout" scenarios, such as partial capacity reduction (e.g., loss of a single lane in a port group) or acute congestion. Current mechanisms lack the granularity to signal these states, causing traffic to be blackholed or hashed onto congested paths.¶
Limited Visibility of Remote Network State¶
In multi-stage Clos topologies and irregular scale-across networks, a failure or congestion event often occurs multiple hops away from the ingress point. Local protection mechanisms (like classic LFA) implemented at the point of failure are often insufficient or lead to suboptimal "hairpin" routing that increases latency. Current routing protocols do not provide ingress nodes with the real-time visibility into remote link states or congestion levels required to prevent traffic from entering compromised paths or to optimize load balancing across the fabric.¶
The packet header is the following:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MAC Destination Address (DA) | +-------------------------------+-------------------------------+ | MAC DA (continued) | MAC Source Address (SA) | +-------------------------------+-------------------------------+ | MAC Source Address (continued) | +-------------------------------+-------------------------------+ | EtherType (TBD1) | OpCode (TBD2) | +-------------------------------+-------------------------------+ | | . LSN Payload . . . | | +---------------------------------------------------------------+ | CRC (FCS) | +---------------------------------------------------------------+
Notes:¶
The LSN Payload is defined as follows:¶
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type |R|Msg| Rsv | Dev-Range | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | | | Reachable-Devices-Bitmap | | (256 bits) | | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Field Descriptions:¶
Msg-type (2 bits): Indicates the information conveyed by the bitmap. It may represent reachability or congestion status. The congestion status is defined as 3 different congestion levels, which are implementation specific. The following Msg-Types are defined:¶
Reachable-Devices-Range (6 bits): Determines the range of the reachable devices bitmap. The range per packet is within [Reachable-Devices-Range*256, Reachable-Devices-Range *256 + 255].¶
This section defines the procedures for assigning node identifiers, generating notification messages, and processing received notifications in hardware.¶
To facilitate efficient bitmap-based signaling, every leaf node within the fabric is assigned a unique Global Node ID. This identifier maps the node to a specific bit position within the notification bitmap.¶
The allocation, synchronization, and lifecycle management of these Global Node IDs are performed by an external central controller or management plane. The specific mechanisms for ID allocation are outside the scope of this document.¶
Each node is responsible for periodically generating notification messages that convey the direct reachability or congestion state of potential destinations to the rest of the fabric.¶
Message Scope and Fragmentation¶
A single notification message supports a payload covering up to 256 destination devices.¶
MSG-Type and State Encoding¶
The interpretation of the bitmap payload is determined by the MSG-Type field in the header. The specific assignment of MSG-Type values is implementation-specific.¶
Reachability Information: When the MSG-Type indicates reachability, the bitmap encodes the binary status of the links:¶
As mentioned earlier, the messages are periodically generated. However, if there is a significant change in the quality of one path (congestion level change, of interface status change), a notification message is also triggered.¶
Upon receipt of a notification message, the receiving node MUST process the packet directly in the forwarding hardware without punting to the control plane CPU.¶
State Aggregation¶
The receiving node maintains a global state table representing the health of the fabric. When a message is received:¶
Forwarding Decision Logic¶
To derive the final forwarding decision, the hardware performs a bitwise operation between the standard routing table and the aggregated notification state.¶
This section illustrates the operation of the Hardware-Accelerated Notification Protocol within a massive scale 2-tier Spine-Leaf Data Center topology. This example focuses exclusively on reachability signaling to demonstrate the bitwise logic used to prune invalid paths.¶
Consider a fully connected Clos fabric consisting of 256 Spine switches and 256 Leaf switches.¶
A physical link failure occurs between Spine_A and the destination Leaf_5.¶
Detection and Generation: Spine_A detects the port connected to Leaf_5 is down. The hardware at Spine A immediately generates a notification packet to be transmitted to all connected leaves (including L_Ingress).¶
Bitmap from Spine A: [1, 1, 1, 1, 1, 0, 1, ... 1]
^
Index 5 (L5) set to 0
Hardware Processing at Ingress: The Ingress Leaf (L_Ingress) receives the notification from Spine_A. The forwarding engine performs a logical AND operation between the Routing Table state and the received Notification state specifically for the path via Spine_A.¶
This mechanism ensures sub-microsecond protection. When the link Spine_A-Leaf_5 is eventually restored, Spine_A will send notification message with an updated bitmap where index 5 is set to 1. However, traffic will not resume via Spine_A until BGP also re-installs the route, satisfying the 1 AND 1 condition.¶
Future revisions of this document will document how the mechanism defined here can be used for DCI/regional networks.¶
The Lightspeed Notification Protocol (LSN) is designed to achieve with sub-100 microsecond convergence directly in the forwarding hardware. To achieve this performance, LSN messages do not include cryptographic authentication or integrity checks. Consequently, the protocol introduces several security considerations that must be mitigated by the deployment architecture.¶
Spoofing and Denial of Service (DoS): Because LSN messages are unauthenticated, an attacker capable of injecting forged LSN frames into the fabric could broadcast false congestion or failure notifications. This would cause receiving nodes to prune valid paths from their forwarding tables, potentially leading to forced congestion, suboptimal routing, or a complete denial of service.¶
Mitigation via Boundary Filtering: LSN is intended exclusively for internal use within highly controlled, single-domain AI/ML data center fabrics. To prevent spoofing, boundary nodes and leaf switches MUST implement strict port-level filtering. LSN notification packets MUST be dropped unconditionally on any port facing a server, host, or external network. LSN processing MUST only be enabled on trusted, switch-to-switch infrastructure links.¶
Fail-Safe Routing Logic: As defined in this document, LSN acts only as a hardware "veto" (a logical AND operation with the routing protocol). While an attacker can maliciously prune a path, they cannot force the network to forward traffic into a blackhole or an invalid topology segment, because reachability requires the routing protocol (e.g., BGP) to also authorize the path. The control plane remains the ultimate source of truth for topology validation.¶
This document requests the following allocations.¶
Upon assignment of these values, the RFC Editor is requested to replace all instances of "TBD" for the Ethertype and Opcode with the newly allocated hexadecimal values.¶
The authors would like to acknowledge the following people: Kris Michielsen.¶