Internet-Draft | udp-ecn | November 2024 |
Duke | Expires 15 May 2025 | [Page] |
Explicit Congestion Notification (ECN) applies to all transport protocols in principle. However, it had limited deployment for UDP until QUIC became widely adopted. As a result, documentation of UDP socket APIs for ECN on various platforms is sparse. This document records the results of experimenting with these APIs in order to get ECN working on UDP for Chromium on Apple, Linux, and Windows platforms.¶
This note is to be removed before publishing as an RFC.¶
The latest revision of this draft can be found at https://martinduke.github.io/udp-ecn/draft-duke-tsvwg-udp-ecn.html. Status information for this document may be found at https://datatracker.ietf.org/doc/draft-duke-tsvwg-udp-ecn/.¶
Discussion of this document takes place on the Transport and Services Working Group Working Group mailing list (mailto:tsvwg@ietf.org), which is archived at https://mailarchive.ietf.org/arch/browse/tsvwg/. Subscribe at https://www.ietf.org/mailman/listinfo/tsvwg/.¶
Source for this draft and an issue tracker can be found at https://github.com/martinduke/udp-ecn.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 15 May 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
[RFC3168] reserves two bits in the IP header for Explicit Congestion Notification (ECN), which provides network feedback to endpoint congestion controllers. This has historically mostly been relevant to TCP ([RFC9293]), where any incoming ECN marks are internally consumed by the kernel, and therefore imply no application interface except enabling and disabling the capability.¶
The Stream Control Transport Protocol (SCTP) ([RFC9260]) has long supported ECN in its design. SCTP is sometimes carried over DTLS and UDP ([RFC8261]). In principle, user-space implementers might have leveraged UDP ECN APIs to deliver ECN markings between SCTP and the UDP socket. The author is not aware of any such efforts.¶
[RFC6679] defines ECN over RTP over UDP. The author is aware of a research implementation, but cannot confirm any commercial deployments.¶
However, QUIC [RFC9000] runs over UDP and has seen wider deployment than SCTP. The Low Latency, Low Loss, Scalable Throughput (L4S) experiment ([RFC9330]) and QUIC have combined to increase interest in ECN over UDP.¶
The Chromium Projects ([CHROMIUM]) provide a widely-deployed protocol library that includes QUIC. An effort to provide ECN support for QUIC on the many platforms on which Chromium is deployed revealed that many ECN-related UDP socket interfaces are poorly documented.¶
This document provides a record of that experience, to encourage further support for ECN in other QUIC implementations, and indeed any consumer of ECN markings that operates over UDP.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document is not a general tutorial on UDP socket programming, and assumes familiarity with basic socket concepts like binding, socket options, and common system error codes.¶
Network devices can change the ECN bits in the IP header. Since this feedback is required at the packet sender, the packet receiver needs to extract this codepoint from the UDP socket in order to report to the sender.¶
There are two components to this: setting the socket to report incoming ECN marks, and retrieving the value for each incoming packet.¶
To report ECN, applications set a socket option to true using a setsockopt() call.¶
IPv6 sockets require a socket option of level IPPROTO_IPV6 and name IPV6_RECVTCLASS.¶
IPv4 sockets require a socket option of level IPPROTO_IP and name IP_RECVTOS.¶
For dual-stack sockets, on Linux hosts the application sets both the IPV6_RECVTCLASS and IP_RECVTOS options to receive ECN markings on all incoming packets. On Apple and FreeBSD hosts, the application only sets the IPPROTO_IPV6-level socket option with name IPV6_RECVTCLASS; setting an IPPROTO_IP-level socket option on an IPv6 socket results in an error. In particular this applies to the IPPROTO_IP-level socket option with the name IP_RECVTOS.¶
At the time of writing, an example implementation can be found at [CHROMIUM-POSIX].¶
Windows documentation recommends using the function WSASetRecvIPEcn() to enable ECN reporting regardless of the IP version.¶
However, this can also be accomplished by calling setsockopt() and using options of level IPPROTO_IP and name IP_RECVECN for IPv4, and IPPROTO_IPV6 and IPV6_RECVECN for IPv6. The author was unable to identify any online documentation of these options at the time of writing.¶
For dual-stack sockets, WSASetRecvIPEcn() will not enable ECN reporting for IPv4. This requires a separate setsockopt() call using the IP_RECVECN option.¶
If a socket is bound to a IPv6-mapped IPv4 address (i.e. it is of the format ::ffff:<IPv4 address>), calls to WSASetRecvIpEcn() return error EINVAL. These sockets should instead use an explicit setsockopt() call to set IP_RECVECN.¶
At the time of writing, an example implementation can be found at [CHROMIUM-WINDOWS].¶
All platforms described in this document require the use of a recvmsg() call to read data from the socket to retrieve ECN information, because that information is encoded in the control data that is returned from that function. Those platforms all return zero or more "cmsg" that contain requested information about the arriving packet.¶
Examples of the technique described below can be found at [CHROMIUM-POSIX] and [CHROMIUM-WINDOWS].¶
If the incoming packet is IPv4, Linux will include a cmsg of level IPPROTO_IP and type IP_TOS.¶
If the incoming packet is IPv6, Linux will include a cmsg of level IPPROTO_IPV6 and type IP_TCLASS.¶
The resulting byte of data is the entire Type-of-Service byte from the IPv4 header or the Traffic Class byte from the IPv6 header. The ECN mark constitutes the two least-significant bits of this byte.¶
The same applies to the Linux-specific recvmmsg() call.¶
If a UDP message (UDP/IPv4) is received on an IPv4 socket, the ancillary data will contain a cmsg of level IPPROTO_IP and type IP_RECVTOS. The cmsg data contains an unsigned char.¶
If a UDP message (UDP/IPv6 or UDP/IPv4) is received on an IPv6 socket, the ancillary data will contain a cmsg or level IPPROTO_IPV6 and type IPV6_TCLASS. The cmsg data contains an int.¶
The provided data is the entire Type-of-Service (TOS) byte from the IPv4 header or the Traffic Class byte from the IPv6 header. The ECN mark constitutes the two least-significant bits of this byte.¶
If the incoming packet is IPv4, the socket will include a cmsg of level IPPROTO_IP and type IP_ECN.¶
If the incoming packet is IPv6, the socket will include a cmsg of level IPPROTO_IPV6 and type IPV6_ECN.¶
The resulting integer solely consists of the ECN mark, and requires no further bitwise operations.¶
Existing ECN specifications envision a particular connection consistently sending the same ECN marking. It might transition that marking after successfully completing a handshake, recognizing the path or the peer do not support ECN, or transitioning to a new path. Therefore, using a socket option to configure a consistent marking is generally more resource-efficient.¶
However, some server designs receive all incoming packets on a single socket. As the many connections that constitute this packet stream may have different support for ECN, it is suitable to configure outgoing ECN on a per-packet basis.¶
Both Linux and Apple platforms set the outgoing ECN for IPv4 packets with a socket option of level IPPROTO_IP and name IP_TOS.¶
For IPv6 packets, they use level IPPROTO_IPV6 and name IPV6_TCLASS.¶
This setsockopt() call also sets the Differentiated Services Code Point (DSCP) bits that make up the rest of the TOS byte. Applications making this call will generally want to preserve any existing DSCP setting, which might require a getsockopt() call.¶
For dual-stack sockets, we hypothesize that Linux sockets will require an additional setsockopt() call with IP_TOS. Apple sockets will not and will return an error if this call is made. Our experiments did not test this hypothesis.¶
An example of the technique described above can be found at [CHROMIUM-POSIX].¶
Packets can be individually marked with ECN codepoints using the control information that accompanies a sendmsg() call.¶
These platforms expect a cmsg with level IPPROTO_IP and type IP_TOS if the destination is an IPv4 address, or a IPv4-mapped IPv6 address.¶
Otherwise, they expect a cmsg with level IPPROTO_IPV6 and type IPV6_TCLASS.¶
The same applies to the Linux-specific sendmmsg() call.¶
The security implications of ECN are documented in [RFC3168] and [RFC9330]. This document is a guide to enabling these capabilities, which incurs no additional security considerations.¶
Note that implementing ECN capabilities on some platforms, but not others, can help peers identify the operating system in use by a host, which can have privacy implications. This document aims to mitigate that possibility.¶
This document has no IANA actions.¶
The author would like to thank Ryan Hamilton, who provided constant advice through this effort. Randall Meyer from Apple and Nick Grifka from Microsoft provided useful hints about the behavior of their respective operating systems. However, the author takes full responsibility for any errors above.¶
Will Hawkins, Max Inden, Colin Perkins, and Michael Tuexen made improvements to this draft.¶