Congestion Notification for Pause

IP based VPN is often used to interconnect Data Center Networks (DCN), in which case the IP based VPN is also referred to as IP WAN. In the DCN, Priority-based Flow Control (PFC) is a widely deployed mechanism for congestion control. However, the PFC as an L2 pause mechanism is not suitable to be deployed in IP WAN, so an L3 pause mechanism is needed for use in IP WAN. This document describes the necessity and feasibility to introduce a mechanism of congestion notification for pause. Specifically, the problem statement is described in Sections 1 and 3, and the format of the congestion notification message sent from the Provider Edge (PE) node to the Provider (P) and/or PE node is defined in Section 4, and the solution on how the PE node knows the addresses of the destined P and/or PE node is defined in Section 4 and 5.

CE: Customer Edge DC: Data Center DCN: Data Center Networks DoS: Denial-of-Service IPC: IP Pause Capability LSA: Link State Advertisement P: Provider PE: Provider Edge PFC: Priority-based Flow Control RI: Router Information SRH: Segment Routing Header SRv6: Segment Routing over IPv6

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

As a congestion notification for pause mechanism used in DCN, the PFC is referred to as classical stepwise back pressure with dedicated Ethernet pause frame, as shown in Figure 1.

|Network|====>|Network|====>|Network|====>|Traffic | |Sender | |Node 1 | |Node 2 | |Node 3 | |Receiver| +--------+ +-------+ +-------+ +-------+ +--------+ Congestion Point ]]> With this congestion notification mechanism, the congested network node (Netwok Node 3 in ) asks the directly connected upstream network node (Network Node 2 in ) to pause the data traffic by a dedicated Ethernet pause frame called PFC frame, and then the upstream network node may stepwise ask its directly connected upstream network node to pause the data traffic by a PFC frame, until the most upstream network node (Network Node 1 in ) may ask the directly connected traffic sender to pause the data traffic by a PFC frame. details how this kind of congestion notification mechanism works. In the IP WAN for DC interconnect, the congestion notification mechanism triggered by the PFC frames from the destination DC gateway is referred to as back pressure with dedicated IP pause packet, as shown in .

| PE1 |====>| P1 |====>| PE2 |====>|DC2 | |Gateway | | | | | | | |Gateway | +--------+ +-------+ +-------+ +-------+ +--------+ Congestion Point ]]> With this congestion notification mechanism, while detecting congestion the congested egress Customer Edge (CE) node (DC2 gateway in ) asks the directly connected upstream egress PE node (PE2 in ) to pause the data traffic by sending PFC frames, and in response to receiving the PFC frames from DC2 which is in congestion, the egress PE generates IP flow pause packets corresponding to the IP flows which cause the congestion in DC2, and then the egress PE asks the upstream P node (P1 in ) and/or the upstream ingress PE node (PE1 in ) to pause (buffer) the data traffic of IP flows by sending the IP flow pause packets, until the ingress PE node may ask the directly connected upstream ingress CE node (DC1 gateway in ) to pause the data traffic by sending PFC pause frames. Note that the upstream P node and/or the upstream ingress PE node receiving the IP flow pause packets must be on the forwarding path of the IP flows and must have the buffering capability for the IP flows causing congestion. This document details how this kind of congestion notification mechanism works.

Once receiving the L2 pause frames from the destination DC gateway, the egress PE node needs to determine which IP flows cause the congestion. How the egress PE node figure out the IP flows causing congestion is implementation specific and outside the scope of this document. For each IP flow causing congestion, the egress PE node needs to identify the ingress PE node and the P nodes traversed by the IP flow and send congestion notification for pause message to each identified P/PE node. With respect to different WAN technologies, there are different ways for the egress PE node to identify the on-path PE and P nodes. When Segment Routing over IPv6 (SRv6) is deployed in the WAN, the egress PE node can use Segment Routing Header (SRH) to identify the on-path PE and P nodes; When native IPv6 is deployed in the WAN, the egress PE node can only use the source IP address to identify the ingress PE node. The congestion notification for pause message sent from the egress PE node to the identified on-path PE and P nodes can be a UDP message or an ICMP message, if a UDP message it's formatted as follows:

UDP Header: The UDP header as specified in includes the UDP source port, UDP destination port, UDP length, and UDP checksum. A well-known UDP destination port (TBD1) needs to be allocated for this Congestion Notification Message. IP Flow Identifier: When SRv6 is deployed in the WAN, the IP Flow Identifier includes the source IP address and the SRH; When native IPv6 is deployed in the WAN, the IP Flow Identifier includes the source IP address, destination IP address, and protocol number. Pause Time: This field can be either copied from the PFC Pause frames receiving from the DC gateway, or calculated based on the buffer size of the destined node advertised by IGP.

Considering that not all WAN routers support buffering IP flows, before the egress PE node can send the congestion notification for pause message to the on-path PE and P nodes, the egress PE node has to know which on-path P/PE nodes support buffering IP flows. The on-path P/PE nodes can notify the egress PE node of its support of buffering IP flows by advertising its IP Pause Capability (IPC) in advance.

The PE and P nodes advertise their support of buffering IP flows by inserting a new IPC sub-TLV into the IS-IS Router Capability . This sub-TLV SHOULD only be advertised once in the Router Capability TLV. This sub-TLV SHOULD be advertised WAN domain wide. The IP Pause Capability sub-TLV is structured as shown in .

where: Type: TBD2. Length: Variable, in octets, depending on the sub-sub-TLVs. The only supported sub-sub-TLV is the Buffer Size Sub-Sub-TLV. The Buffer Size advertised in the Buffer Size Sub-Sub-TLV represents the supported maximum IP flows' buffering space. Only a single Buffer Size Sub-Sub-TLV MAY be advertised in the IP Pause Capability Sub-TLV. If more than one Buffer Size Sub-Sub-TLV is present, all the Buffer Size Sub-Sub-TLVs MUST be ignored. The Buffer Size Sub-Sub-TLV is structured as shown in .

where: Type: 1. Length: This field MUST be set to 4. Buffer Size: This field indicates the maximum IP flows' buffering space supported by the advertising node. The unit for this field is KB (Kilo Bytes).

The PE and P nodes advertise their support of buffering IP flows by advertising a new IPC TLV of the OSPF Router Information (RI) Opaque Link State Advertisement (LSA) . This TLV is applicable to both OSPFv2 and OSPFv3. This TLV SHOULD only be advertised once in the RI Opaque LSA. This TLV SHOULD be advertised WAN domain wide. The IP Pause Capability TLV is structured as shown in .

where: Type: TBD3. Length: Variable, in octets, depending on the sub-TLVs. The only supported sub-TLV is the Buffer Size Sub-TLV. The Buffer Size advertised in the Buffer Size Sub-TLV represents the supported maximum IP flows' buffering space. Only a single Buffer Size Sub-TLV MAY be advertised in the IP Pause Capability TLV. If more than one Buffer Size Sub-TLV is present, all the Buffer Size Sub-TLVs MUST be ignored. The Buffer Size Sub-TLV is structured as shown in .

The congestion notification for pause from PE node receiving PFC frames to P/PE nodes MUST be applied in a specific controlled domain. A limited administrative domain provides the network administrator with the means to select, monitor, and control the access to the network, making it a trusted domain. To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting policies when generating and receiving congestion notification for pause messages. A deployment MUST ensure that border-filtering drops inbound congestion notification for pause message from outside of the domain and that drops outbound congestion notification for pause message leaving the domain. A deployment MUST support the configuration option to enable or disable the congestion notification for pause feature defined in this document. By default, the congestion notification for pause feature MUST be disabled.

This document requests the following allocations from IANA: A well-known UDP port number TBD1 from the System Ports range of the "Service Name and Transport Protocol Port Number" registry is requested to be assigned to the Congestion Notification for Pause Message. Specifically, IANA is requested to assign a UDP port as shown below for which the Assignee and Contact is the IESG and the IETF Chair, respectively. Service Name Port Number Transport Protocol Description Reference Congestion Notification for PauseTBD1udpReceiver Port for Congestion Notification for PauseSection 4 of THIS_DOCUMENT

This document requests IANA to make the following registration in the "IS-IS Sub-TLVs for IS-IS Router CAPABILITY TLV" registry: Value Description Reference TBD2IP Pause CapabilityThis document

IANA is requested to create the "IS-IS Sub-Sub-TLVs for IP Pause Capability Sub-TLV" registry under the "IS-IS TLV Codepoints" grouping for the assignment of sub-TLV types for the IP Pause Capability sub-TLV specified in this document. This registry defines sub-sub-TLVs for the IP Pause Capability sub-TLV (TBD2) advertised in the IS-IS Router CAPABILITY TLV (242). The registration procedure is "Expert Review", as defined in . Guidance for the designated experts is provided in . The Buffer Size sub-sub-TLV is defined by this document, and the initial contents of the registry are as follows: Value Description Reference 0ReservedThis document 1Buffer SizeThis document 2-255Unassigned

This document requests IANA to make the following registration in the "OSPF Router Information (RI) TLVs" registry: Value Description Reference TBD3IP Pause CapabilityThis document

IANA is requested to create the "OSPF IP Pause Parameter Sub-TLVs" registry under the "Open Shortest Path First (OSPF) Parameters" grouping. This registry defines sub-TLVs for the IP Pause Capability TLV (TBD3). The registration procedures are that the values in the range 1-34999 are to be allocated using the "Standards Action" registration procedure defined in , and the values in the range 35000-65499 are to be allocated using the "First Come First Served" registration procedure. The Buffer Size sub-TLV is defined by this document, and the initial contents of the registry are as follows: Value Description Reference 0ReservedThis document 1Buffer SizeThis document 2-65499Unassigned 65500-65534ExperimentalThis document 65535ReservedThis document

The authors would like to acknowledge Xiangyang Zhu and Yao Liu for the very helpful discussion.