<?xml version="1.0" encoding="iso-8859-1" ?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes" ?>
<?rfc subcompact="no" ?>

<rfc category="std" ipr="trust200902" docName="draft-xiao-rtgwg-congestion-notification-for-pause-01" consensus="true" submissionType="IETF">

<front>
        <title abbrev="Congestion Notification for Pause"> Congestion Notification for Pause </title>
 
  <author fullname="Xiao Min" initials="X" surname="Min">
      <organization>ZTE Corp.</organization>
     <address>
       <postal>
         <street/>

         <!-- Reorder these if your country does things differently -->

         <city>Nanjing</city>

         <region/>

         <code/>

         <country>China</country>
       </postal>

       <phone>+86 18061680168</phone>

       <email>xiao.min2@zte.com.cn</email>

       <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
  <author fullname="Kan Zhang" initials="K" surname="Zhang">
      <organization>China Mobile</organization>
     <address>
       <postal>
         <street></street>

         <!-- Reorder these if your country does things differently -->

         <city>Beijing</city>

         <region></region>

         <code></code>

         <country>China</country>
       </postal>

       <phone></phone>

       <email>zhangkan@chinamobile.com</email>

       <!-- uri and facsimile elements may also be added -->
     </address>
    </author>

    <date year="2026"/>
  
    <area>Routing</area>
    <workgroup>RTGWG Working Group</workgroup>

    <keyword>Request for Comments</keyword>
    <keyword>RFC</keyword>
    <keyword>Internet Draft</keyword>
    <keyword>I-D</keyword>

    <abstract>
  <t> This document describes the necessity and feasibility to introduce a mechanism of congestion notification for pause. After receiving the L2 
  pause frames from the destination data center gateway, the egress provider edge node sends the congestion notifications to the upstream provider 
  nodes and the ingress provider edge node in a format defined in this document. The upstream provider nodes and the ingress provider edge node must 
  pause the forwarding of IP flows identified by the congestion notifications. And then the ingress provider edge node may send the L2 pause frames 
  to the source data center gateway. </t>
    </abstract>
    
</front>
  
<middle>

  <section title="Introduction">
  
  <t> IP based VPN <xref target="RFC2764"/> is often used to interconnect Data Center Networks (DCN), in which case the IP based VPN is also referred 
  to as IP WAN. In the DCN, Priority-based Flow Control (PFC) <xref target="IEEE8021Q-2022"/> is a widely deployed mechanism for congestion control. 
  However, the PFC as an L2 pause mechanism is not suitable to be deployed in IP WAN, so an L3 pause mechanism is needed for use in IP WAN. </t>
   
  <t> This document describes the necessity and feasibility to introduce a mechanism of congestion notification for pause. Specifically, the problem 
  statement is described in Sections 1 and 3, and the format of the congestion notification message sent from the Provider Edge (PE) node to the Provider 
  (P) and/or PE node is defined in Section 4, and the solution on how the PE node knows the addresses of the destined P and/or PE node is defined in 
  Section 4 and 5. </t>
  
  </section>
  
  <section title="Conventions Used in This Document">
   
    <section title="Abbreviations">
      <t> CE: Customer Edge</t>
      <t> DC: Data Center</t>
      <t> DCN: Data Center Networks</t>
      <t> DoS: Denial-of-Service</t>
      <t> IPC: IP Pause Capability</t>
      <t> LSA: Link State Advertisement</t>
      <t> P: Provider</t>
      <t> PE: Provider Edge</t>
      <t> PFC: Priority-based Flow Control</t>
      <t> RI: Router Information</t>
      <t> SRH: Segment Routing Header</t>
      <t> SRv6: Segment Routing over IPv6</t>
    </section>
  
    <section title="Requirements Language">  
	  <t> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", 
	  and "OPTIONAL" in this document are to be interpreted as described in BCP 14  <xref target="RFC2119"/> <xref target="RFC8174"/> when, 
	  and only when, they appear in all capitals, as shown here.</t>	
    </section>
  
  </section>
  
  <section title="Congestion Notification Mechanisms">

  <t>  As a congestion notification for pause mechanism used in DCN, the PFC is referred to as classical stepwise back pressure with dedicated Ethernet 
  pause frame, as shown in Figure 1.</t>
  
  <figure anchor="Figure_1" title="Classical Stepwise Back Pressure with Dedicated Ethernet Pause Frame within DC">
  <artwork align="left"> <![CDATA[
        PFC Frame     PFC Frame      PFC Frame
    |<------------+|<-----------+|<------------+
    |             ||            ||             |
+--------+     +-------+     +-------+     +-------+     +--------+
|Traffic |====>|Network|====>|Network|====>|Network|====>|Traffic |
|Sender  |     |Node 1 |     |Node 2 |     |Node 3 |     |Receiver|
+--------+     +-------+     +-------+     +-------+     +--------+
                                           Congestion
                                           Point
]]>  </artwork>
  </figure>
  
  <t> With this congestion notification mechanism, the congested network node (Netwok Node 3 in <xref target="Figure_1"/>) asks the directly connected 
  upstream network node (Network Node 2 in <xref target="Figure_1"/>) to pause the data traffic by a dedicated Ethernet pause frame called PFC frame, and 
  then the upstream network node may stepwise ask its directly connected upstream network node to pause the data traffic by a PFC frame, until the most 
  upstream network node (Network Node 1 in <xref target="Figure_1"/>) may ask the directly connected traffic sender to pause the data traffic by a PFC 
  frame. <xref target="IEEE8021Q-2022"/> details how this kind of congestion notification mechanism works. </t>
  
  <t> In the IP WAN for DC interconnect, the congestion notification mechanism triggered by the PFC frames from the destination DC gateway is referred to 
  as back pressure with dedicated IP pause packet, as shown in <xref target="Figure_2"/>. </t>
  
  <figure anchor="Figure_2" title="Back Pressure with Dedicated IP Pause Packet within WAN">
  <artwork align="left"> <![CDATA[
                      Congestion Notification
                   |<--------------------------+
                   |   Congestion Notification |
        PFC Frame  |             |<-----------+|   PFC Frame
     |<-----------+|             |            |||<-----------+
     |            ||             |            |||            |
+--------+     +-------+     +-------+     +-------+     +--------+
|DC1     |====>|  PE1  |====>|  P1   |====>|  PE2  |====>|DC2     |
|Gateway |     |       |     |       |     |       |     |Gateway |
+--------+     +-------+     +-------+     +-------+     +--------+
                                                         Congestion
                                                         Point
]]>  </artwork>
  </figure>
  
  <t> With this congestion notification mechanism, while detecting congestion the congested egress Customer Edge (CE) node (DC2 gateway in <xref target="Figure_2"/>) 
  asks the directly connected upstream egress PE node (PE2 in <xref target="Figure_2"/>) to pause the data traffic by sending PFC frames, and in response 
  to receiving the PFC frames from DC2 which is in congestion, the egress PE generates IP flow pause packets corresponding to the IP flows which cause the 
  congestion in DC2, and then the egress PE asks the upstream P node (P1 in <xref target="Figure_2"/>) and/or the upstream ingress PE node (PE1 in 
  <xref target="Figure_2"/>) to pause (buffer) the data traffic of IP flows by sending the IP flow pause packets, until the ingress PE node may ask the 
  directly connected upstream ingress CE node (DC1 gateway in <xref target="Figure_2"/>) to pause the data traffic by sending PFC pause frames. Note that 
  the upstream P node and/or the upstream ingress PE node receiving the IP flow pause packets must be on the forwarding path of the IP flows and must have 
  the buffering capability for the IP flows causing congestion. This document details how this kind of congestion notification mechanism works. </t>
  
  </section>
  
  <section title="Congestion Notification for Pause Packet Format">
  
  <t> Once receiving the L2 pause frames from the destination DC gateway, the egress PE node needs to determine which IP flows cause the congestion. 
  How the egress PE node figure out the IP flows causing congestion is implementation specific and outside the scope of this document. For each IP flow 
  causing congestion, the egress PE node needs to identify the ingress PE node and the P nodes traversed by the IP flow and send congestion notification 
  for pause message to each identified P/PE node. With respect to different WAN technologies, there are different ways for the egress PE node to identify 
  the on-path PE and P nodes. When Segment Routing over IPv6 (SRv6) <xref target="RFC8754"/> is deployed in the WAN, the egress PE node can use Segment 
  Routing Header (SRH) to identify the on-path PE and P nodes; When native IPv6 is deployed in the WAN, the egress PE node can only use the source IP address 
  to identify the ingress PE node. </t>
  
  <t> The congestion notification for pause message sent from the egress PE node to the identified on-path PE and P nodes can be a UDP message or an ICMP 
  message, if a UDP message it's formatted as follows: </t>
  
  <figure anchor="Figure_3" title="Congestion Notification for Pause Message Format">
  <artwork align="left"> <![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|        UDP Source Port        |  UDP Destination Port = TBD1  |
+-------------------------------+-------------------------------+
|           UDP Length          |          UDP Checksum         |
+-------------------------------+-------------------------------+
|                                                               |
~                IP Flow Identifier + Pause Time                ~
|                                                               |
+---------------------------------------------------------------+
|           As much of the invoking packet as possible          |
+            without the UDP packet exceeding 576 bytes         +
|                 in IPv4 or 1280 bytes in IPv6                 |
]]>  </artwork>
  </figure>
  
  <t> UDP Header: The UDP header as specified in <xref target="RFC768"/> includes the UDP source port, UDP destination port, UDP length, and UDP checksum. 
  A well-known UDP destination port (TBD1) needs to be allocated for this Congestion Notification Message. </t>
  
  <t> IP Flow Identifier: When SRv6 is deployed in the WAN, the IP Flow Identifier includes the source IP address and the SRH; When native IPv6 is deployed 
  in the WAN, the IP Flow Identifier includes the source IP address, destination IP address, and protocol number. </t>
  
  <t> Pause Time: This field can be either copied from the PFC Pause frames receiving from the DC gateway, or calculated based on the buffer size of the 
  destined node advertised by IGP. </t>
  
  </section>
  
  <section title="Advertising IP Pause Capability Using IGP">

  <t> Considering that not all WAN routers support buffering IP flows, before the egress PE node can send the congestion notification for pause message to the 
  on-path PE and P nodes, the egress PE node has to know which on-path P/PE nodes support buffering IP flows. The on-path P/PE nodes can notify the egress PE 
  node of its support of buffering IP flows by advertising its IP Pause Capability (IPC) in advance. </t>

  <section title="Advertising IP Pause Capability Using IS-IS">
  
  <t> The PE and P nodes advertise their support of buffering IP flows by inserting a new IPC sub-TLV into the IS-IS Router Capability <xref target="RFC7981"/>. 
  This sub-TLV SHOULD only be advertised once in the Router Capability TLV. This sub-TLV SHOULD be advertised WAN domain wide. The IP Pause Capability sub-TLV 
  is structured as shown in <xref target="Figure_4"/>. </t>
  
  <figure anchor="Figure_4" title="IP Pause Capability Sub-TLV">
  <artwork align="center"> <![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Type = TBD2  |     Length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Sub-Sub-TLVs (variable)                    |
   +-                                                             -+
   |                                                               |
   +                                                               +
]]>  </artwork>
  </figure>
  
     <t> where:
		   <list>
		   <t> Type: TBD2.</t>
		   <t> Length: Variable, in octets, depending on the sub-sub-TLVs.</t>
		   </list>
	 </t>
  
  <t> The only supported sub-sub-TLV is the Buffer Size Sub-Sub-TLV. The Buffer Size advertised in the Buffer Size Sub-Sub-TLV represents the supported maximum IP flows' 
  buffering space. Only a single Buffer Size Sub-Sub-TLV MAY be advertised in the IP Pause Capability Sub-TLV. If more than one Buffer Size Sub-Sub-TLV is present, all 
  the Buffer Size Sub-Sub-TLVs MUST be ignored. The Buffer Size Sub-Sub-TLV is structured as shown in <xref target="Figure_5"/>. </t>
  
  <figure anchor="Figure_5" title="Buffer Size Sub-Sub-TLV">
  <artwork align="center"> <![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   Type = 1    |     Length    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Buffer Size                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
     <t> where:
		   <list>
		   <t> Type: 1.</t>
		   <t> Length: This field MUST be set to 4.</t>
		   <t> Buffer Size: This field indicates the maximum IP flows' buffering space supported by the advertising node. The unit for this field is KB (Kilo Bytes).</t>
		   </list>
	 </t>
  
  </section>
  
  <section title="Advertising IP Pause Capability Using OSPF">
	
  <t> The PE and P nodes advertise their support of buffering IP flows by advertising a new IPC TLV of the OSPF Router Information (RI) Opaque Link State 
  Advertisement (LSA) <xref target="RFC7770"/>. This TLV is applicable to both OSPFv2 and OSPFv3. This TLV SHOULD only be advertised once in the RI Opaque 
  LSA. This TLV SHOULD be advertised WAN domain wide. The IP Pause Capability TLV is structured as shown in <xref target="Figure_6"/>. </t>
  
  <figure anchor="Figure_6" title="IP Pause Capability TLV">
  <artwork align="center"> <![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Type = TBD3          |             Length            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sub-TLVs (variable)                    |
   +-                                                             -+
   |                                                               |
   +                                                               +
]]>  </artwork>
  </figure>
  
     <t> where:
		   <list>
		   <t> Type: TBD3.</t>
		   <t> Length: Variable, in octets, depending on the sub-TLVs.</t>
		   </list>
	 </t>
  
  <t> The only supported sub-TLV is the Buffer Size Sub-TLV. The Buffer Size advertised in the Buffer Size Sub-TLV represents the supported maximum IP flows' 
  buffering space. Only a single Buffer Size Sub-TLV MAY be advertised in the IP Pause Capability TLV. If more than one Buffer Size Sub-TLV is present, all 
  the Buffer Size Sub-TLVs MUST be ignored. The Buffer Size Sub-TLV is structured as shown in <xref target="Figure_7"/>. </t>
  
  <figure anchor="Figure_7" title="Buffer Size Sub-TLV">
  <artwork align="center"> <![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Type = 1           |             Length            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Buffer Size                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]>  </artwork>
  </figure>
  
     <t> where:
		   <list>
		   <t> Type: 1.</t>
		   <t> Length: This field MUST be set to 4.</t>
		   <t> Buffer Size: This field indicates the maximum IP flows' buffering space supported by the advertising node. The unit for this field is KB (Kilo Bytes).</t>
		   </list>
	 </t>
  
  </section>
  
  </section>
  
  <section title="Security Considerations">
  
  <t> The congestion notification for pause from PE node receiving PFC frames to P/PE nodes MUST be applied in a specific controlled domain. A limited 
  administrative domain provides the network administrator with the means to select, monitor, and control the access to the network, making it a 
  trusted domain.</t>
   
  <t> To avoid potential Denial-of-Service (DoS) attacks, it is RECOMMENDED that implementations apply rate-limiting policies when generating and 
  receiving congestion notification for pause messages.</t>
  
  <t> A deployment MUST ensure that border-filtering drops inbound congestion notification for pause message from outside of the domain and that drops 
  outbound congestion notification for pause message leaving the domain.</t>
  
  <t> A deployment MUST support the configuration option to enable or disable the congestion notification for pause feature defined in this document. By 
  default, the congestion notification for pause feature MUST be disabled.</t>
  
  </section>
  
  <section title="IANA Considerations"> 
  
   <section title="A well-known UDP Port">
   
  <t> This document requests the following allocations from IANA: </t>

  <t> A well-known UDP port number TBD1 from the System Ports range of the "Service Name and Transport Protocol Port Number" registry <xref target="RFC6335"/> 
	 is requested to be assigned to the Congestion Notification for Pause Message. Specifically, IANA is requested to assign a UDP port as shown below for 
	 which the Assignee and Contact is the IESG and the IETF Chair, respectively. </t>

  <texttable title="Service Name and Transport Protocol Port Number Registry">
   <ttcol align='left'>Service Name</ttcol>
   <ttcol align='left'>Port Number</ttcol>
   <ttcol align='left'>Transport Protocol</ttcol>
   <ttcol align='left'>Description</ttcol>
   <ttcol align='left'>Reference</ttcol>
   <c>Congestion Notification for Pause</c><c>TBD1</c><c>udp</c><c>Receiver Port for Congestion Notification for Pause</c><c>Section 4 of THIS_DOCUMENT</c>
  </texttable>

   </section> 
  
   <section title="IS-IS IP Pause Capability Sub-TLV">
   
   <t>This document requests IANA to make the following registration in the "IS-IS Sub-TLVs for IS-IS Router CAPABILITY TLV" registry:</t>
   
   <texttable title="New Sub-TLV in IS-IS Sub-TLVs for IS-IS Router CAPABILITY TLV Registry">
    <ttcol align='left'>Value</ttcol>
    <ttcol align='left'>Description</ttcol>
    <ttcol align='left'>Reference</ttcol>
    <c>TBD2</c><c>IP Pause Capability</c><c>This document</c>
	</texttable>

   </section> 
   
   <section title="IS-IS Sub-Sub-TLVs for the IP Pause Capability Sub-TLV Registry">
   
   <t>IANA is requested to create the "IS-IS Sub-Sub-TLVs for IP Pause Capability Sub-TLV" registry under the "IS-IS TLV Codepoints" grouping for 
   the assignment of sub-TLV types for the IP Pause Capability sub-TLV specified in this document. This registry defines sub-sub-TLVs for 
   the IP Pause Capability sub-TLV (TBD2) advertised in the IS-IS Router CAPABILITY TLV (242).</t>

   <t>The registration procedure is "Expert Review", as defined in <xref target="RFC8126"/>. Guidance for the designated experts is provided in 
   <xref target="RFC7370"/>. The Buffer Size sub-sub-TLV is defined by this document, and the initial contents of the registry are as follows: </t>
   
   <texttable title="IS-IS Sub-Sub-TLVs for IP Pause Capability Sub-TLV Registry">
    <ttcol align='left'>Value</ttcol>
    <ttcol align='left'>Description</ttcol>
    <ttcol align='left'>Reference</ttcol>
    <c>0</c><c>Reserved</c><c>This document</c>
    <c>1</c><c>Buffer Size</c><c>This document</c>
    <c>2-255</c><c>Unassigned</c><c></c>
	</texttable>

   </section> 
   
   <section title="OSPF IP Pause Capability TLV">
   
   <t>This document requests IANA to make the following registration in the "OSPF Router Information (RI) TLVs" registry:</t>
   
   <texttable title="New TLV in OSPF Router Information (RI) TLVs Registry">
    <ttcol align='left'>Value</ttcol>
    <ttcol align='left'>Description</ttcol>
    <ttcol align='left'>Reference</ttcol>
    <c>TBD3</c><c>IP Pause Capability</c><c>This document</c>
	</texttable>

   </section> 
   
   <section title="OSPF IP Pause Parameter Sub-TLVs Registry">
   
   <t>IANA is requested to create the "OSPF IP Pause Parameter Sub-TLVs" registry under the "Open Shortest Path First (OSPF) Parameters" grouping. 
   This registry defines sub-TLVs for the IP Pause Capability TLV (TBD3).</t>

   <t>The registration procedures are that the values in the range 1-34999 are to be allocated using the "Standards Action" registration procedure 
   defined in <xref target="RFC8126"/>, and the values in the range 35000-65499 are to be allocated using the "First Come First Served" registration 
   procedure. The Buffer Size sub-TLV is defined by this document, and the initial contents of the registry are as follows: </t>
   
   <texttable title="OSPF IP Pause Parameter Sub-TLVs Registry">
    <ttcol align='left'>Value</ttcol>
    <ttcol align='left'>Description</ttcol>
    <ttcol align='left'>Reference</ttcol>
    <c>0</c><c>Reserved</c><c>This document</c>
    <c>1</c><c>Buffer Size</c><c>This document</c>
    <c>2-65499</c><c>Unassigned</c><c></c>
    <c>65500-65534</c><c>Experimental</c><c>This document</c>
    <c>65535</c><c>Reserved</c><c>This document</c>
	</texttable>

   </section> 
  
  </section>

  <section title="Acknowledgements">
  <t> The authors would like to acknowledge Xiangyang Zhu and Yao Liu for the very helpful discussion.</t>
  </section>  
  
</middle>
  
<back>
    <references title="Normative References">
     <?rfc include="reference.RFC.2119"?>
     <?rfc include="reference.RFC.8174"?>
     <?rfc include="reference.RFC.768"?>
     <?rfc include="reference.RFC.7770"?>
     <?rfc include="reference.RFC.7981"?>
     <?rfc include="reference.RFC.8126"?>
     <?rfc include="reference.RFC.7370"?>
     <?rfc include="reference.RFC.6335"?>
    </references>
	
    <references title="Informative References">
     <?rfc include="reference.RFC.2764"?>
     <?rfc include="reference.RFC.8754"?>
     <reference anchor="IEEE8021Q-2022" target="https://ieeexplore.ieee.org/document/10004498" quoteTitle="true" derivedAnchor="IEEE8021Q-2022">
        <front>
            <title>IEEE Standard for Local and Metropolitan Area Networks--Bridges and Bridged Networks</title>
            <author>
              <organization showOnFrontPage="true">IEEE</organization>
            </author>
            <date month="December" year="2022"/>
        </front>
          <seriesInfo name="DOI" value="10.1109/IEEESTD.2022.10004498"/>
          <seriesInfo name="IEEE Std" value="802.1Q-2022"/>
     </reference>
    </references>	
</back>

</rfc>
