| Internet-Draft | Stateful-TCP | May 2026 |
| Guo, et al. | Expires 19 November 2026 | [Page] |
This document specifies Stateful-TCP, an experimental sender-side mechanism that accelerates the startup of TCP connections by reusing path-bandwidth information estimated from earlier connections to the same destination. When a usable estimate is available, the sender bypasses Slow-Start and instead enters a paced startup phase whose initial congestion window and pacing rate are derived from the cached estimate. When no usable estimate is available, the sender falls back to standard Slow-Start.¶
Stateful-TCP is sender-only and does not require any change to the TCP receiver or to the wire format. It is orthogonal to the congestion control algorithm in use after the first round-trip time and may therefore be combined with existing TCP variants such as CUBIC. This document also specifies a Gap-Compensated Bandwidth Estimation (GCBE) procedure used to produce the cached per-destination estimate.¶
Stateful-TCP extends the framework of TCP Control Block Interdependence (RFC 9040) by sharing additional per-destination state across connections. The mechanism is published as Experimental to enable independent implementation, evaluation, and review.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 19 November 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
TCP [RFC9293] traditionally begins each new connection with a Slow-Start phase [RFC5681], during which the congestion window (cwnd) starts from a small initial value and grows exponentially as acknowledgments are received. Slow-Start is a conservative bandwidth-probing procedure designed to avoid overwhelming network paths whose capacity is unknown to the sender.¶
On modern high bandwidth-delay-product (BDP) paths, however, the time spent in Slow-Start can dominate the completion time of short and medium-sized flows. Existing mitigations include increasing the initial window [RFC6928], refining the Slow-Start exit condition (Hystart++ [RFC9406]), and Limited Slow-Start [RFC3742]. These approaches improve but do not eliminate the underlying problem: a sender that has no information about the path is forced to ramp its sending rate up gradually, regardless of how much bandwidth is actually available.¶
TCP Control Block (TCB) Interdependence [RFC9040] (which obsoletes [RFC2140]) already permits a sender to share a small set of TCB variables, including smoothed RTT and ssthresh, between connections to the same host. This document specifies an experimental extension to that framework. Specifically, an implementation of Stateful-TCP caches an estimated path bandwidth and a minimum RTT per destination and uses them, when available, to bypass Slow-Start on subsequent connections to that destination.¶
The mechanism has three properties that distinguish it from prior work:¶
The design and an evaluation of Stateful-TCP applied to CUBIC [RFC9438] -- referred to in the cited paper as S-Cubic -- are described in [STATEFUL-TCP]. That paper is the normative source of motivation, design rationale, and experimental results; this document defines the on-the-wire-equivalent specification suitable for independent implementation.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following terms and symbols are used throughout this document. Units are given in parentheses where applicable.¶
Stateful-TCP is designed to satisfy the following goals.¶
A Stateful-TCP implementation consists of two components that reside in the TCP sender:¶
Bandwidth estimation during the Estimation phase is performed by GCBE, specified in Section 6.¶
The mechanism extends the framework of [RFC9040]: bw_est and RTTmin are additional, per-destination cached items, used in addition to (not in place of) the items already shared under that framework.¶
A State Cache entry MUST contain at least the following three fields:¶
Implementations MAY store additional fields, such as the last update time (used for expiry) or fields inherited from the TCB Interdependence framework [RFC9040]. Such fields are out of scope of this specification but MUST NOT alter the semantics of the three required fields.¶
Implementations MAY implement the State Cache as a hash table indexed by an H-bit hash of peer_IP, where H is implementation-defined. Implementations SHOULD choose H, the table size, and the hash function so that the expected hash-collision rate is low for the workload anticipated.¶
The choice of hash function is local to the sender and is not visible on the wire.¶
A cache lookup performed at the start of the Startup phase produces exactly one of the following three outcomes:¶
In the Miss and Collision cases, the cache entry SHALL be updated by the new connection upon its termination, subject to the conditions in Section 4.4.¶
When a connection terminates (see Section 5.3), an implementation SHALL update the cache entry indexed by the peer_IP of that connection with the {bw_est, RTTmin} pair observed for the connection, except that the cache MUST NOT be updated when:¶
In the first case, retaining the entry is unnecessary because Stateful-TCP would not produce any benefit by setting the initial CWnd to below the default IW. In the latter two cases, the values would be invalid for subsequent connections.¶
Implementations SHOULD associate each cache entry with a maximum age and MUST NOT use an entry whose age exceeds the configured maximum. The maximum age is implementation-defined and SHOULD be configurable.¶
Implementations MUST bound the memory used by the cache. When the bound is reached, an implementation SHOULD evict entries using a least-recently-used (LRU) policy or an equivalent policy that preserves the entries most likely to be useful. This requirement is also a defence against the cache-exhaustion denial-of-service vector discussed in Section 8.¶
Each connection that uses Stateful-TCP transitions through three phases as illustrated in Figure 1.¶
+---------------------+
| 3-way handshake |
| completes |
+----------+----------+
|
v
+---------------------+
| Cache lookup |
+---+-------------+---+
| Miss / | Hit
| Collision |
v v
+-------------------+ +---------------------+
| Standard | | Startup phase (Hit) |
| Slow-Start | | cwnd <- max(bw_est * |
| per RFC 5681 | | RTTmin, IW) |
| | | |
| | | pacing on at |
| | | bw_est |
| | | rwnd suppressed |
| | | (except rwnd == 0) |
+---------+---------+ +----------+----------+
| |
| first ACK arrives
| |
| v
| +-----------------------+
| | cwnd <- max(in_flight,|
| | IW) |
| | pacing off |
| +-----------+-----------+
| |
+----------+-----------+
|
v
+---------------------+
| Estimation phase |
| (GCBE running; |
| CC algorithm |
| unchanged; |
| rwnd suppressed |
| except rwnd == 0) |
+----------+----------+
|
FIN / RST
|
v
+---------------------+
| Termination phase |
| (cache write, |
| subject to write |
| conditions) |
+---------------------+
The Startup phase begins immediately after the three-way handshake completes. The sender MUST perform a State Cache lookup as specified in Section 4.3.¶
On a Miss or Collision outcome, the sender MUST proceed using standard Slow-Start [RFC5681] and the Startup phase ends.¶
On a Hit outcome, the sender MUST apply all of the following before transmitting any data segment:¶
The Startup phase SHALL end when the first ACK that advances the cumulative acknowledgment number is received. At that point the sender MUST:¶
The rationale for the cwnd update is that the bytes in flight at this instant approximate the path's BDP.¶
Hystart-style premature exits from Slow-Start, including Hystart++ [RFC9406], are not applicable during the Startup phase of a Hit-branch connection because the connection is not in Slow-Start. An implementation MUST NOT apply Hystart-style exit triggers to the Hit-branch Startup phase.¶
The Estimation phase begins immediately after the Startup phase ends and continues until connection termination.¶
During the Estimation phase the connection's congestion control algorithm operates without modification. The sender MUST continue to suppress rwnd as defined in Section 5.4. The sender MUST run GCBE (Section 6) to maintain bw_est and MUST maintain RTTmin as the minimum of all valid RTT samples observed for the connection.¶
When the connection terminates -- whether gracefully via FIN exchange or abnormally via RST or local abort -- the sender SHALL attempt a cache update as specified in Section 4.4. The update MUST be applied to the cache bucket corresponding to the connection's peer_IP, regardless of whether the connection used the Hit, Miss, or Collision branch in its Startup phase, except that the update MUST NOT be performed if any of the conditions enumerated in Section 4.4 hold.¶
During the Startup and Estimation phases, the sender MUST compute its effective send window as equal to cwnd, ignoring rwnd, except in the case described below.¶
When rwnd equals zero, the sender MUST honour the zero window: no new data may be transmitted until the receiver advertises a non-zero rwnd, exactly as required by [RFC9293]. Stateful-TCP MUST NOT override the zero-window semantics.¶
The justification for suppressing rwnd outside the zero-window case is that on contemporary end systems the receiver is, in practice, almost always able to consume incoming data faster than the network delivers it, so flow-control back-pressure is rarely required; allowing a small initial rwnd to cap the sending rate would limit the performance gains of Stateful-TCP.¶
A bandwidth estimator that simply divides the bytes acknowledged in a measurement window by the duration of that window may underestimate the path bandwidth if the sender's transmission has been paused due to cwnd exhaustion. The pause shows up as a long inter-ACK gap, which inflates the denominator of the estimate but not the numerator. This effect is most pronounced when cwnd is small, e.g., during Slow-Start.¶
GCBE compensates by removing, from each measurement cycle, the single ACK whose inter-arrival time is the longest; both its acknowledged bytes and the gap it spans are excluded from the estimate. Note that GCBE does not replace the congestion control algorithm's bandwidth estimator. It operates in parallel and the estimated bandwidth is only used for cache update as described in Section 5.3.¶
Cycle i begins at time t_i. When an ACK is received at time t such that t - t_i is greater than or equal to d (the current SRTT), the cycle SHALL end with that ACK. The next cycle SHALL begin immediately, with t_{i+1} set to t.¶
Within cycle i, let¶
z_i = argmax ( h_{i,j} - h_{i,j-1} )
j in [1, n_i - 1]
¶
The ACK at index 0 of the cycle is excluded from the estimate because the time at which the data it acknowledges was placed on the wire is not known to the sender. The ACK at index z_i is excluded because its long inter-arrival gap is taken to be a transmission-suspension gap.¶
n_i - 1
-------
c_i = | sum a_{i,j} - a_{i, z_i} |
j = 1
--------------------------------------------------
( h_{i, n_i - 1} - h_{i, 0} )
- ( h_{i, z_i} - h_{i, z_i - 1} )
¶
If n_i is less than or equal to an implementation-defined threshold, then the cycle does not contain enough samples to apply the formula above; the implementation MUST in that case skip the cycle (produce no estimate) and continue with the next cycle.¶
The implementation MUST retain the L most recent values of c_i. When the cache is updated at connection termination, the value bw_est SHALL be set to the maximum of the retained values:¶
bw_est = max c_i
i in last L cycles of the connection
¶
Taking the maximum, rather than the most recent value or a smoothed value, prevents an isolated congestion or loss event near the end of the connection from skewing the recorded estimate downward.¶
If fewer than L cycles have completed by the time the connection terminates, the maximum is taken over the completed cycles only. If no cycle has completed, bw_est is unavailable and the cache MUST NOT be updated (see Section 4.4).¶
# Inputs:
# - on_ack(ack):
# called for every ACK that advances the cumulative
# acknowledgment number, with:
# ack.t = arrival time of the ACK
# ack.bytes = bytes newly acknowledged by the ACK
# srtt() = current SRTT in seconds
#
# Configurations:
# minACK # minimum number of ACKs required
# IW # initial congestion window (segments)
# MSS # maximum segment size (bytes)
# State (per connection):
# t_i # start time of current cycle
# acks[] # tuples (t, bytes) for the current cycle
# ring # ring buffer of size L holding completed c_i
# RTTmin # minimum RTT observed (UNAVAILABLE if none)
initialize_gcbe():
t_i = now()
acks = []
ring = empty ring buffer of capacity L
on_ack(ack):
acks.append( (ack.t, ack.bytes) )
if (ack.t - t_i) >= srtt():
# close the current cycle
if length(acks) > minACK:
# find z_i: index of the largest inter-arrival gap
z = 1
max_gap = acks[1].t - acks[0].t
for j in 2 .. length(acks) - 1:
gap = acks[j].t - acks[j-1].t
if gap > max_gap:
max_gap = gap
z = j
sum_bytes = 0
for j in 1 .. length(acks) - 1:
sum_bytes = sum_bytes + acks[j].bytes
sum_bytes = sum_bytes - acks[z].bytes
duration = acks[ length(acks) - 1 ].t - acks[0].t
duration = duration - max_gap
if duration > 0 and sum_bytes > 0:
c_i = sum_bytes / duration
ring.push(c_i)
# start a new cycle
t_i = ack.t
acks = []
acks.append( (ack.t, ack.bytes) )
on_connection_close():
if ring is not empty:
bw_est = max(ring)
else:
bw_est = UNAVAILABLE
# Section 4.4 write conditions; cache MUST NOT be updated when:
if bw_est == UNAVAILABLE:
return NO_CACHE_UPDATE
if RTTmin == UNAVAILABLE:
return NO_CACHE_UPDATE
if (bw_est * RTTmin) / MSS <= IW:
return NO_CACHE_UPDATE
return { bw_est, RTTmin }
¶
A Stateful-TCP sender MUST support sender pacing of outgoing data segments during the Hit branch of the Startup phase, at a rate equal to bw_est. In its absence, transmitting an enlarged initial cwnd at line rate is equivalent to the unpaced initial-cwnd schemes that motivated this work and would be expected to cause buffer overflow at the path bottleneck.¶
[RFC9040] specifies a framework for sharing a small set of TCB variables, including SRTT and ssthresh, between connections to the same host. Stateful-TCP extends this framework with two additional shared items (bw_est and RTTmin) that are used specifically to drive the Startup phase of new connections.¶
An implementation that already implements [RFC9040] MAY store the additional Stateful-TCP fields in the same per-destination state structure used for RFC 9040 sharing, provided that the conditions in Section 4.4 are observed for the additional fields.¶
TCP Fast Open (TFO, [RFC7413]) reduces startup latency by carrying data in the SYN. TFO and Stateful-TCP are orthogonal: TFO accelerates the first round-trip of a connection, while Stateful-TCP accelerates the bandwidth ramp-up that follows. Implementations MAY enable both simultaneously.¶
A Hit-branch Stateful-TCP connection ramps to its target rate in one RTT instead of over many RTTs. When such a connection shares a bottleneck with one or more standard (Slow-Start) connections, the standard connections will experience the Stateful-TCP connection as a long-lived flow that is already at steady state, rather than as a newcomer that is still ramping up. This shifts the short-term throughput share towards the Stateful-TCP connection.¶
The cited evaluation ([STATEFUL-TCP]) quantifies this effect for S-Cubic and reports that, at the link utilizations measured, both fairness among Stateful-TCP connections of the same kind and friendliness to standard connections remain within ranges considered acceptable for an experimental TCP variant. Implementers and operators SHOULD evaluate fairness in their own deployment environments before enabling Stateful-TCP at scale.¶
The accuracy of a cached estimate degrades as time passes since the estimate was recorded, because path conditions may change. Implementations SHOULD bound the maximum age of an entry as required in Section 4.5. An overestimate larger than the true path bandwidth at the time a Hit-branch connection starts will be corrected by the connection's ordinary congestion control response after the first RTT, although it may cause additional queueing or retransmissions during the first RTT. An underestimate will reduce the bandwidth utilization of the Hit-branch connection but will not cause additional congestion.¶
Stateful-TCP introduces sender-local state that is updated and consumed without any additional information on the wire. It does not change the TCP authentication or integrity properties of the connection itself. Nevertheless, the introduction of cached, peer-keyed state raises several considerations.¶
The cache is keyed on the peer_IP that the sender observes for the connection that wrote the entry. An off-path attacker cannot directly write entries because doing so would require completing a TCP handshake with the sender while spoofing the victim peer's IP address, which is not possible without on-path or address-spoofing capability.¶
An on-path attacker that can complete a TCP handshake from a spoofed peer_IP can induce the sender to record an arbitrary bw_est for that peer. The consequence on a subsequent connection is bounded: an inflated bw_est causes at most one round-trip of paced over-transmission to that peer before ordinary congestion control reasserts itself; a deflated bw_est causes at most one round-trip of under-utilization. Implementations SHOULD consider these bounds when choosing cache expiry policies in environments where on-path attackers are part of the threat model.¶
A peer that initiates connections from many distinct source addresses can cause an unbounded cache to grow without bound. As required in Section 4.5, implementations MUST bound cache memory and SHOULD use an LRU or equivalent eviction policy.¶
This document has no IANA actions.¶
The mechanism specified here was originally proposed and evaluated in [STATEFUL-TCP].¶