<?xml version="1.0" encoding="UTF-8"?>
<!-- generated by xml2rfc-friendly editor; conforms to RFC 7991 (xml2rfc v3) -->
<?xml-model href="rfc7991bis.rnc"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     ipr="trust200902"
     docName="draft-lee-tcpm-stateful-tcp-00"
     category="exp"
     submissionType="IETF"
     consensus="false"
     version="3"
     xml:lang="en"
     tocInclude="true"
     tocDepth="3"
     symRefs="true"
     sortRefs="true">

  <front>
    <title abbrev="Stateful-TCP">Stateful-TCP: Bypassing TCP Slow-Start Using Cached Per-Destination Path Bandwidth</title>
    <seriesInfo name="Internet-Draft" value="draft-lee-tcpm-stateful-tcp-00"/>

    <author fullname="Lingfeng Guo" initials="L." surname="Guo">
      <organization abbrev="CUHK">The Chinese University of Hong Kong</organization>
      <address>
        <postal>
          <street>Department of Information Engineering</street>
          <city>Shatin, N.T.</city>
          <country>Hong Kong</country>
        </postal>
        <email>gl016@ie.cuhk.edu.hk</email>
      </address>
    </author>

    <author fullname="Feiyu Xue" initials="F." surname="Xue">
      <organization abbrev="CUHK">The Chinese University of Hong Kong</organization>
      <address>
        <postal>
          <street>Department of Information Engineering</street>
          <city>Shatin, N.T.</city>
          <country>Hong Kong</country>
        </postal>
        <email>xf024@ie.cuhk.edu.hk</email>
      </address>
    </author>

    <author fullname="Jack Y. B. Lee" initials="J. Y. B." surname="Lee">
      <organization abbrev="CUHK">The Chinese University of Hong Kong</organization>
      <address>
        <postal>
          <street>Department of Information Engineering</street>
          <city>Shatin, N.T.</city>
          <country>Hong Kong</country>
        </postal>
        <email>jacklee@computer.org</email>
      </address>
    </author>

    <date year="2026" month="May" day="18"/>

    <area>Transport</area>
    <workgroup>TCP Maintenance and Minor Extensions (tcpm)</workgroup>

    <keyword>TCP</keyword>
    <keyword>Slow-Start</keyword>
    <keyword>congestion control</keyword>
    <keyword>bandwidth estimation</keyword>
    <keyword>TCB</keyword>
    <keyword>pacing</keyword>

    <abstract>
      <t>
        This document specifies Stateful-TCP, an experimental sender-side
        mechanism that accelerates the startup of TCP connections by
        reusing path-bandwidth information estimated from earlier
        connections to the same destination.  When a usable estimate is
        available, the sender bypasses Slow-Start and instead enters a
        paced startup phase whose initial congestion window and pacing
        rate are derived from the cached estimate.  When no usable
        estimate is available, the sender falls back to standard
        Slow-Start.
      </t>
      <t>
        Stateful-TCP is sender-only and does not require any change to
        the TCP receiver or to the wire format.  It is orthogonal to the
        congestion control algorithm in use after the first round-trip
        time and may therefore be combined with existing TCP variants
        such as CUBIC.  This document also specifies a Gap-Compensated
        Bandwidth Estimation (GCBE) procedure used to produce the
        cached per-destination estimate.
      </t>
      <t>
        Stateful-TCP extends the framework of TCP Control Block
        Interdependence (RFC 9040) by sharing additional per-destination
        state across connections.  The mechanism is published as
        Experimental to enable independent implementation, evaluation,
        and review.
      </t>
    </abstract>
  </front>

  <middle>

    <!-- ============================================================ -->
    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>
      <t>
        TCP <xref target="RFC9293"/> traditionally begins each new
        connection with a Slow-Start phase
        <xref target="RFC5681"/>, during which the congestion window
        (cwnd) starts from a small initial value and grows
        exponentially as acknowledgments are received.  Slow-Start is
        a conservative bandwidth-probing procedure designed to avoid
        overwhelming network paths whose capacity is unknown to the
        sender.
      </t>
      <t>
        On modern high bandwidth-delay-product (BDP) paths, however,
        the time spent in Slow-Start can dominate the completion time
        of short and medium-sized flows.  Existing mitigations
        include increasing the initial window
        <xref target="RFC6928"/>, refining the Slow-Start exit
        condition (Hystart++ <xref target="RFC9406"/>), and Limited
        Slow-Start <xref target="RFC3742"/>.  These approaches improve
        but do not eliminate the underlying problem: a sender that
        has no information about the path is forced to ramp its
        sending rate up gradually, regardless of how much bandwidth is
        actually available.
      </t>
      <t>
        TCP Control Block (TCB) Interdependence
        <xref target="RFC9040"/> (which obsoletes
        <xref target="RFC2140"/>) already permits a sender to share
        a small set of TCB variables, including smoothed RTT and
        ssthresh, between connections to the same host.  This
        document specifies an experimental extension to that
        framework.  Specifically, an implementation of Stateful-TCP
        caches an estimated path bandwidth and a minimum RTT per
        destination and uses them, when available, to bypass
        Slow-Start on subsequent connections to that destination.
      </t>
      <t>
        The mechanism has three properties that distinguish it from
        prior work:
      </t>
      <ul spacing="normal">
        <li>
          The startup phase uses both an enlarged initial cwnd
          <em>and</em> sender pacing.  Pacing the first round of
          transmissions at the previously estimated bottleneck rate
          prevents the line-rate burst that would otherwise occur and
          that is the principal cause of buffer overflow when an
          enlarged initial cwnd is used in isolation.
        </li>
        <li>
          The receiver advertised window (rwnd) is suppressed during
          the startup and estimation phases (except when rwnd is
          zero, which retains its semantics from
          <xref target="RFC9293"/>) so that
          small rwnd at the start of a connection does not cap the sending
          rate.
        </li>
        <li>
          The bandwidth estimation procedure (GCBE,
          <xref target="gcbe"/>) excludes inter-ACK gaps that arise
          from cwnd-limited transmission to avoid
          potential underestimation in classical bandwidth estimators
          during Slow-Start.
        </li>
      </ul>
      <t>
        The design and an evaluation of Stateful-TCP applied to CUBIC
        <xref target="RFC9438"/> -- referred to in the cited paper as
        S-Cubic -- are described in <xref target="STATEFUL-TCP"/>.
        That paper is the normative source of motivation, design
        rationale, and experimental results; this document defines
        the on-the-wire-equivalent specification suitable for
        independent implementation.
      </t>
    </section>

    <!-- ============================================================ -->
    <section anchor="conventions" numbered="true" toc="default">
      <name>Conventions and Definitions</name>
      <t>
        The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>",
        "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>",
        "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>",
        "<bcp14>SHOULD NOT</bcp14>",
        "<bcp14>RECOMMENDED</bcp14>",
        "<bcp14>NOT RECOMMENDED</bcp14>",
        "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this
        document are to be interpreted as described in BCP 14
        <xref target="RFC2119"/> <xref target="RFC8174"/> when, and
        only when, they appear in all capitals, as shown here.
      </t>
      <t>
        The following terms and symbols are used throughout this
        document.  Units are given in parentheses where applicable.
      </t>
      <dl newline="false" spacing="normal">
        <dt>MSS:</dt>
        <dd>TCP Maximum Segment Size, in bytes.</dd>

        <dt>cwnd:</dt>
        <dd>Sender congestion window, in bytes (or, equivalently,
        in units of MSS).</dd>

        <dt>IW:</dt>
        <dd>The default initial value of cwnd in the absence of any
        cached state, as defined by <xref target="RFC5681"/> and
        updated by <xref target="RFC6928"/>.</dd>

        <dt>rwnd:</dt>
        <dd>Receiver advertised window, in bytes, as carried in the
        TCP Window field <xref target="RFC9293"/>.</dd>

        <dt>SRTT:</dt>
        <dd>Smoothed round-trip time, computed as defined in
        <xref target="RFC6298"/>, in seconds.</dd>

        <dt>RTTmin:</dt>
        <dd>Minimum observed RTT for the connection, in seconds.</dd>

        <dt>BDP:</dt>
        <dd>Bandwidth-Delay Product, expressed in bytes, equal to
        the product of an estimated bandwidth and an estimated RTT.</dd>

        <dt>bw_est:</dt>
        <dd>The path bandwidth estimate produced by GCBE
        (<xref target="gcbe"/>), in bytes per second.</dd>

        <dt>peer_IP:</dt>
        <dd>The IP address of the remote endpoint of a TCP
        connection, as observed by the sender.</dd>

        <dt>State Cache:</dt>
        <dd>The sender-local data structure described in
        <xref target="cache"/> that maps peer_IP to a tuple
        {bw_est, RTTmin}.</dd>

        <dt>Startup phase:</dt>
        <dd>The phase of a connection running Stateful-TCP that
        extends from the completion of the three-way handshake
        until the first ACK that advances the cumulative
        acknowledgment number is received.  See
        <xref target="state-machine"/>.</dd>

        <dt>Estimation phase:</dt>
        <dd>The phase that follows the Startup phase and lasts until
        connection termination, during which GCBE updates the
        bandwidth estimate.  See <xref target="state-machine"/>.</dd>

        <dt>Termination phase:</dt>
        <dd>The processing performed when the connection is closed
        (whether gracefully via FIN or abnormally via RST), during
        which the cache is updated.  See
        <xref target="state-machine"/>.</dd>
      </dl>
    </section>

    <!-- ============================================================ -->
    <section anchor="overview" numbered="true" toc="default">
      <name>Overview</name>

      <section anchor="overview-design" numbered="true">
        <name>Design Goals</name>
        <t>
          Stateful-TCP is designed to satisfy the following goals.
        </t>
        <ul spacing="normal">
          <li>Avoid the bandwidth underutilization caused by Slow-Start
          on paths whose capacity is already known to the sender from
          earlier connections.</li>
          <li>Avoid the line-rate startup burst that occurs when an
          enlarged initial cwnd is used without pacing.</li>
          <li>Require no change to the TCP receiver, no new TCP option,
          and no change to the wire format.</li>
          <li>Coexist with any existing TCP congestion control
          algorithm; the mechanism takes effect only during the
          Startup phase, after which the connection behaves exactly
          as defined by its congestion control algorithm.</li>
          <li>Degrade gracefully to standard Slow-Start when no usable
          cached state is available or when a cache lookup
          collides.</li>
        </ul>
      </section>

      <section anchor="overview-arch" numbered="true">
        <name>Architecture</name>
        <t>
          A Stateful-TCP implementation consists of two components
          that reside in the TCP sender:
        </t>
        <ul spacing="normal">
          <li>A <em>State Cache</em>, which stores
          {bw_est, RTTmin} per peer_IP across connections and is
          described in <xref target="cache"/>.</li>
          <li>A <em>per-connection state machine</em>, comprising the
          Startup, Estimation, and Termination phases described in
          <xref target="state-machine"/>, that consumes and updates
          the cache and that drives initial cwnd, pacing, and rwnd
          suppression.</li>
        </ul>
        <t>
          Bandwidth estimation during the Estimation phase is
          performed by GCBE, specified in <xref target="gcbe"/>.
        </t>
        <t>
          The mechanism extends the framework of
          <xref target="RFC9040"/>: bw_est and RTTmin are
          additional, per-destination cached items, used in addition
          to (not in place of) the items already shared under that
          framework.
        </t>
      </section>
    </section>

    <!-- ============================================================ -->
    <section anchor="cache" numbered="true" toc="default">
      <name>The State Cache</name>

      <section anchor="cache-entry" numbered="true">
        <name>Entry Format</name>
        <t>
          A State Cache entry <bcp14>MUST</bcp14> contain at least
          the following three fields:
        </t>
        <dl newline="false" spacing="normal">
          <dt>peer_IP:</dt>
          <dd>The IP address of the peer for which the entry was
          recorded.  Implementations <bcp14>MUST</bcp14> store the
          full IP address (including address family) so that hash
          collisions can be detected as described in
          <xref target="cache-lookup"/>.</dd>

          <dt>bw_est:</dt>
          <dd>The most recent bandwidth estimate produced by GCBE
          for the connection that closed and updated this entry.</dd>

          <dt>RTTmin:</dt>
          <dd>The minimum RTT observed during that connection.</dd>
        </dl>
        <t>
          Implementations <bcp14>MAY</bcp14> store additional fields,
          such as the last update time (used for expiry) or fields
          inherited from the TCB Interdependence framework
          <xref target="RFC9040"/>.  Such fields are out of scope of
          this specification but <bcp14>MUST NOT</bcp14> alter the
          semantics of the three required fields.
        </t>
      </section>

      <section anchor="cache-indexing" numbered="true">
        <name>Indexing</name>
        <t>
          Implementations <bcp14>MAY</bcp14> implement the State
          Cache as a hash table indexed by an H-bit hash of peer_IP,
          where H is implementation-defined.  Implementations
          <bcp14>SHOULD</bcp14> choose H, the table size, and the
          hash function so that the expected hash-collision rate is
          low for the workload anticipated.
        </t>
        <t>
          The choice of hash function is local to the sender and is
          not visible on the wire.
        </t>
      </section>

      <section anchor="cache-lookup" numbered="true">
        <name>Lookup Outcomes</name>
        <t>
          A cache lookup performed at the start of the Startup phase
          produces exactly one of the following three outcomes:
        </t>
        <dl newline="false" spacing="normal">
          <dt>Miss:</dt>
          <dd>The hashed bucket is empty.  The connection
          <bcp14>MUST</bcp14> proceed using standard Slow-Start
          <xref target="RFC5681"/>.</dd>

          <dt>Hit:</dt>
          <dd>The hashed bucket is non-empty and the stored peer_IP
          equals the peer_IP of the new connection.  The connection
          <bcp14>MUST</bcp14> apply the Hit branch of the Startup
          phase as specified in <xref target="phase-startup"/>.</dd>

          <dt>Collision:</dt>
          <dd>The hashed bucket is non-empty but the stored peer_IP
          differs from the peer_IP of the new connection.  The
          connection <bcp14>MUST</bcp14> proceed using standard
          Slow-Start.</dd>
        </dl>
        <t>
          In the Miss and Collision cases, the cache entry
          <bcp14>SHALL</bcp14> be updated by the new connection upon
          its termination, subject to the conditions in
          <xref target="cache-write"/>.
        </t>
      </section>

      <section anchor="cache-write" numbered="true">
        <name>Write Conditions</name>
        <t>
          When a connection terminates (see
          <xref target="phase-termination"/>), an implementation
          <bcp14>SHALL</bcp14> update the cache entry indexed by the
          peer_IP of that connection with the {bw_est, RTTmin} pair
          observed for the connection, except that the cache
          <bcp14>MUST NOT</bcp14> be updated when:
        </t>
        <ul spacing="normal">
          <li>bw_est multiplied by RTTmin, expressed in MSS-sized
          segments, is less than or equal to IW; or</li>
          <li>bw_est is not available (for example, the Estimation
          phase did not complete a single GCBE measurement cycle);
          or</li>
          <li>RTTmin is not available (for example, no valid RTT
          sample was obtained).</li>
        </ul>
        <t>
          In the first case, retaining the entry is unnecessary
          because Stateful-TCP would not produce any benefit by setting the initial CWnd to
          below the default IW.  In the latter two cases, the values would be invalid for
          subsequent connections.
        </t>
      </section>

      <section anchor="cache-expiry" numbered="true">
        <name>Expiry and Eviction</name>
        <t>
          Implementations <bcp14>SHOULD</bcp14> associate each cache
          entry with a maximum age and <bcp14>MUST NOT</bcp14> use an
          entry whose age exceeds the configured maximum.  The
          maximum age is implementation-defined and
          <bcp14>SHOULD</bcp14> be configurable.
        </t>
        <t>
          Implementations <bcp14>MUST</bcp14> bound the memory used
          by the cache.  When the bound is reached, an
          implementation <bcp14>SHOULD</bcp14> evict entries using a
          least-recently-used (LRU) policy or an equivalent policy
          that preserves the entries most likely to be useful.  This
          requirement is also a defence against the cache-exhaustion
          denial-of-service vector discussed in
          <xref target="security"/>.
        </t>
      </section>
    </section>

    <!-- ============================================================ -->
    <section anchor="state-machine" numbered="true" toc="default">
      <name>Per-Connection State Machine</name>

      <t>
        Each connection that uses Stateful-TCP transitions through
        three phases as illustrated in <xref target="fig-fsm"/>.
      </t>

      <figure anchor="fig-fsm">
        <name>Stateful-TCP per-connection state machine.</name>
        <artwork align="left" type="ascii-art"><![CDATA[
                +---------------------+
                |  3-way handshake    |
                |     completes       |
                +----------+----------+
                           |
                           v
                +---------------------+
                |     Cache lookup    |
                +---+-------------+---+
                    | Miss /      | Hit
                    | Collision   |
                    v             v
        +-------------------+ +---------------------+
        | Standard          | | Startup phase (Hit) |
        | Slow-Start        | |  cwnd <- max(bw_est * |
        | per RFC 5681      | |          RTTmin, IW) |
        |                   | |                     |
        |                   | |  pacing on at       |
        |                   | |          bw_est     |
        |                   | |  rwnd suppressed    |
        |                   | |  (except rwnd == 0) |
        +---------+---------+ +----------+----------+
                  |                      |
                  |              first ACK arrives
                  |                      |
                  |                      v
                  |          +-----------------------+
                  |          | cwnd <- max(in_flight,|
                  |          |             IW)       |
                  |          | pacing off            |
                  |          +-----------+-----------+
                  |                      |
                  +----------+-----------+
                             |
                             v
                +---------------------+
                |  Estimation phase   |
                |  (GCBE running;     |
                |   CC algorithm      |
                |   unchanged;        |
                |   rwnd suppressed   |
                |   except rwnd == 0) |
                +----------+----------+
                           |
                       FIN / RST
                           |
                           v
                +---------------------+
                |  Termination phase  |
                |  (cache write,      |
                |   subject to write  |
                |   conditions)       |
                +---------------------+
]]></artwork>
      </figure>

      <section anchor="phase-startup" numbered="true">
        <name>Startup Phase</name>
        <t>
          The Startup phase begins immediately after the three-way
          handshake completes.  The sender <bcp14>MUST</bcp14>
          perform a State Cache lookup as specified in
          <xref target="cache-lookup"/>.
        </t>
        <t>
          On a <em>Miss</em> or <em>Collision</em> outcome, the
          sender <bcp14>MUST</bcp14> proceed using standard
          Slow-Start <xref target="RFC5681"/> and the Startup phase
          ends.
        </t>
        <t>
          On a <em>Hit</em> outcome, the sender <bcp14>MUST</bcp14>
          apply all of the following before transmitting any data
          segment:
        </t>
        <ol spacing="normal">
          <li>Set cwnd to BDP, where BDP is computed as the cached
          bw_est multiplied by the cached RTTmin and expressed in
          bytes.  The result <bcp14>MUST</bcp14> be at least IW.</li>
          <li>Suppress the receiver advertised window, as defined in
          <xref target="rwnd-suppression"/>.</li>
          <li>Enable sender pacing for outgoing data segments at a
          rate equal to the cached bw_est.</li>
        </ol>
        <t>
          The Startup phase <bcp14>SHALL</bcp14> end when the first
          ACK that advances the cumulative acknowledgment number is
          received.  At that point the sender <bcp14>MUST</bcp14>:
        </t>
        <ol spacing="normal">
          <li>Disable pacing.</li>
          <li>Set cwnd to the maximum of (a) the number of bytes
          currently in flight and (b) IW.</li>
        </ol>
        <t>
          The rationale for the cwnd update is that the bytes in
          flight at this instant approximate the path's BDP.
        </t>
        <t>
          Hystart-style premature exits from Slow-Start, including
          Hystart++ <xref target="RFC9406"/>, are not applicable
          during the Startup phase of a Hit-branch connection
          because the connection is not in Slow-Start.  An
          implementation <bcp14>MUST NOT</bcp14> apply Hystart-style
          exit triggers to the Hit-branch Startup phase.
        </t>
      </section>

      <section anchor="phase-estimation" numbered="true">
        <name>Estimation Phase</name>
        <t>
          The Estimation phase begins immediately after the Startup
          phase ends and continues until connection termination.
        </t>
        <t>
          During the Estimation phase the connection's congestion
          control algorithm operates without modification.  The
          sender <bcp14>MUST</bcp14> continue to suppress rwnd as
          defined in <xref target="rwnd-suppression"/>.  The sender
          <bcp14>MUST</bcp14> run GCBE
          (<xref target="gcbe"/>) to maintain bw_est and
          <bcp14>MUST</bcp14> maintain RTTmin as the minimum of all
          valid RTT samples observed for the connection.
        </t>
      </section>

      <section anchor="phase-termination" numbered="true">
        <name>Termination Phase</name>
        <t>
          When the connection terminates -- whether gracefully via
          FIN exchange or abnormally via RST or local abort -- the
          sender <bcp14>SHALL</bcp14> attempt a cache update as
          specified in <xref target="cache-write"/>.  The update
          <bcp14>MUST</bcp14> be applied to the cache bucket
          corresponding to the connection's peer_IP, regardless of
          whether the connection used the Hit, Miss, or Collision
          branch in its Startup phase, except that the update
          <bcp14>MUST NOT</bcp14> be performed if any of the
          conditions enumerated in
          <xref target="cache-write"/> hold.
        </t>
      </section>

      <section anchor="rwnd-suppression" numbered="true">
        <name>Receiver Window Suppression</name>
        <t>
          During the Startup and Estimation phases, the sender
          <bcp14>MUST</bcp14> compute its effective send window as
          equal to cwnd, ignoring rwnd, except in the case described
          below.
        </t>
        <t>
          When rwnd equals zero, the sender <bcp14>MUST</bcp14>
          honour the zero window: no new data may be transmitted
          until the receiver advertises a non-zero rwnd, exactly as
          required by <xref target="RFC9293"/>.  Stateful-TCP
          <bcp14>MUST NOT</bcp14> override the zero-window
          semantics.
        </t>
        <t>
          The justification for suppressing rwnd outside the
          zero-window case is that on contemporary end systems the
          receiver is, in practice, almost always able to consume
          incoming data faster than the network delivers it, so
          flow-control back-pressure is rarely required; allowing a
          small initial rwnd to cap the sending rate would limit
          the performance gains of Stateful-TCP.
        </t>
      </section>
    </section>

    <!-- ============================================================ -->
    <section anchor="gcbe" numbered="true" toc="default">
      <name>Gap-Compensated Bandwidth Estimation (GCBE)</name>

      <section anchor="gcbe-rationale" numbered="true">
        <name>Rationale</name>
        <t>
          A bandwidth estimator that simply divides the bytes
          acknowledged in a measurement window by the duration of
          that window may
          underestimate the path bandwidth if the sender's transmission
          has been paused due to cwnd exhaustion.  The pause
          shows up as a long inter-ACK gap, which inflates the
          denominator of the estimate but not the numerator.
          This effect is most pronounced when cwnd is small, e.g., during Slow-Start.
        </t>
        <t>
          GCBE compensates by removing, from each measurement cycle,
          the single ACK whose inter-arrival time is the longest;
          both its acknowledged bytes and the gap it spans are
          excluded from the estimate.  Note that GCBE
          does not replace the congestion control algorithm's bandwidth estimator.
          It operates in parallel and the estimated bandwidth is only used for
          cache update as described in <xref target="phase-termination"/>.
        </t>
      </section>

      <section anchor="gcbe-vars" numbered="true">
        <name>Variables</name>
        <dl newline="false" spacing="normal">
          <dt>i:</dt>
          <dd>Index of an estimation cycle; i = 0, 1, 2, ...</dd>

          <dt>t_i:</dt>
          <dd>Wall-clock time at which cycle i begins.</dd>

          <dt>d:</dt>
          <dd>Current SRTT value, in seconds.</dd>

          <dt>n_i:</dt>
          <dd>Number of ACKs received during cycle i.</dd>

          <dt>h_{i,j}:</dt>
          <dd>Wall-clock arrival time of ACK j in cycle i,
          for j = 0, 1, ..., n_i - 1.</dd>

          <dt>a_{i,j}:</dt>
          <dd>Number of bytes newly acknowledged by ACK j in cycle
          i.</dd>

          <dt>z_i:</dt>
          <dd>Index, within cycle i, of the ACK whose inter-arrival
          time from its predecessor is the largest.</dd>

          <dt>c_i:</dt>
          <dd>Bandwidth estimate produced by cycle i, in bytes per
          second.</dd>

          <dt>L:</dt>
          <dd>Implementation-defined number of recent estimates
          retained.</dd>

          <dt>m:</dt>
          <dd>Total number of completed measurement cycles for the
          connection.</dd>

          <dt>bw_est:</dt>
          <dd>The bandwidth value that will be written to the cache
          on connection termination.</dd>
        </dl>
      </section>

      <section anchor="gcbe-cycle" numbered="true">
        <name>Cycle Boundary</name>
        <t>
          Cycle i begins at time t_i.  When an ACK is received at
          time t such that t - t_i is greater than or equal to d
          (the current SRTT), the cycle <bcp14>SHALL</bcp14> end
          with that ACK.  The next cycle <bcp14>SHALL</bcp14> begin
          immediately, with t_{i+1} set to t.
        </t>
      </section>

      <section anchor="gcbe-cycle-estimate" numbered="true">
        <name>Per-Cycle Estimate</name>
        <t>
          Within cycle i, let
        </t>
        <artwork align="left" type="ascii-art"><![CDATA[
   z_i = argmax  ( h_{i,j} - h_{i,j-1} )
        j in [1, n_i - 1]
]]></artwork>
        <t>
          The ACK at index 0 of the cycle is excluded from the
          estimate because the time at which the data it
          acknowledges was placed on the wire is
          not known to the sender.  The ACK at index z_i
          is excluded because its long inter-arrival gap is taken
          to be a transmission-suspension gap.
        </t>
        <artwork align="left" type="ascii-art"><![CDATA[
                          n_i - 1
                          -------
   c_i =                  | sum a_{i,j}  -  a_{i, z_i}    |
                          j = 1
          --------------------------------------------------
                  ( h_{i, n_i - 1} - h_{i, 0} )
                          - ( h_{i, z_i} - h_{i, z_i - 1} )
]]></artwork>
        <t>
          If n_i is less than or equal to an implementation-defined threshold,
          then the cycle does not contain enough samples to apply the formula above;
          the implementation <bcp14>MUST</bcp14> in that case skip the
          cycle (produce no estimate) and continue with the next
          cycle.
        </t>
      </section>

      <section anchor="gcbe-store" numbered="true">
        <name>Stored Value</name>
        <t>
          The implementation <bcp14>MUST</bcp14> retain the L most
          recent values of c_i.  When the cache is
          updated at connection termination, the value
          bw_est <bcp14>SHALL</bcp14> be set to the maximum of the
          retained values:
        </t>
        <artwork align="left" type="ascii-art"><![CDATA[
   bw_est = max  c_i
            i in last L cycles of the connection
]]></artwork>
        <t>
          Taking the maximum, rather than the most recent value or a
          smoothed value, prevents an isolated congestion or loss
          event near the end of the connection from skewing the
          recorded estimate downward.
        </t>
        <t>
          If fewer than L cycles have completed by the time the
          connection terminates, the maximum is taken over the
          completed cycles only.  If no cycle has completed, bw_est
          is unavailable and the cache <bcp14>MUST NOT</bcp14> be
          updated (see <xref target="cache-write"/>).
        </t>
      </section>

      <section anchor="gcbe-pseudocode" numbered="true">
        <name>Pseudocode</name>
        <sourcecode type="pseudocode"><![CDATA[
# Inputs:
#   - on_ack(ack):
#       called for every ACK that advances the cumulative
#       acknowledgment number, with:
#         ack.t        = arrival time of the ACK
#         ack.bytes    = bytes newly acknowledged by the ACK
#         srtt()       = current SRTT in seconds
#
# Configurations:
#   minACK           # minimum number of ACKs required
#   IW               # initial congestion window (segments)
#   MSS              # maximum segment size (bytes)
# State (per connection):
#   t_i              # start time of current cycle
#   acks[]           # tuples (t, bytes) for the current cycle
#   ring             # ring buffer of size L holding completed c_i
#   RTTmin           # minimum RTT observed (UNAVAILABLE if none)

initialize_gcbe():
    t_i      = now()
    acks     = []
    ring     = empty ring buffer of capacity L

on_ack(ack):
    acks.append( (ack.t, ack.bytes) )

    if (ack.t - t_i) >= srtt():
        # close the current cycle
        if length(acks) > minACK:
            # find z_i: index of the largest inter-arrival gap
            z = 1
            max_gap = acks[1].t - acks[0].t
            for j in 2 .. length(acks) - 1:
                gap = acks[j].t - acks[j-1].t
                if gap > max_gap:
                    max_gap = gap
                    z = j

            sum_bytes = 0
            for j in 1 .. length(acks) - 1:
                sum_bytes = sum_bytes + acks[j].bytes
            sum_bytes = sum_bytes - acks[z].bytes

            duration  = acks[ length(acks) - 1 ].t - acks[0].t
            duration  = duration - max_gap

            if duration > 0 and sum_bytes > 0:
                c_i = sum_bytes / duration
                ring.push(c_i)

        # start a new cycle
        t_i    = ack.t
        acks   = []
        acks.append( (ack.t, ack.bytes) )

on_connection_close():
    if ring is not empty:
        bw_est = max(ring)
    else:
        bw_est = UNAVAILABLE
    # Section 4.4 write conditions; cache MUST NOT be updated when:
    if bw_est == UNAVAILABLE:
        return NO_CACHE_UPDATE
    if RTTmin == UNAVAILABLE:
        return NO_CACHE_UPDATE
    if (bw_est * RTTmin) / MSS <= IW:
        return NO_CACHE_UPDATE
    return { bw_est, RTTmin }
]]></sourcecode>
      </section>

      <section anchor="gcbe-edge-cases" numbered="true">
        <name>Edge Cases</name>
        <ul spacing="normal">
          <li>If SRTT changes during a cycle, the change applies
          starting from the next cycle.</li>
          <li>If the path is non-bottlenecked during a cycle (cwnd
          never limits transmission), the longest inter-arrival gap
          identified by the formula is simply one of the regular
          inter-ACK gaps, and excluding it amounts to dropping a
          single sample.</li>
          <li>If retransmissions occur within a cycle, the bytes
          newly acknowledged by an ACK (a_{i,j}) include only the
          bytes that the cumulative acknowledgment number advances
          past for the first time; bytes covered solely by SACK
          ranges <xref target="RFC2018"/> are not counted in
          a_{i,j}.  Implementations that integrate GCBE with
          modern loss-detection mechanisms such as RACK-TLP
          <xref target="RFC8985"/> need to apply the same rule.</li>
        </ul>
      </section>
    </section>

    <!-- ============================================================ -->
    <section anchor="ops" numbered="true" toc="default">
      <name>Operational Considerations</name>

      <section anchor="ops-pacing" numbered="true">
        <name>Pacing</name>
        <t>
          A Stateful-TCP sender <bcp14>MUST</bcp14> support sender
          pacing of outgoing data segments during the Hit branch of
          the Startup phase, at a rate equal to bw_est.  In its
          absence, transmitting an enlarged initial cwnd at line
          rate is equivalent to the unpaced initial-cwnd schemes
          that motivated this work and would be expected to cause
          buffer overflow at the path bottleneck.
        </t>
      </section>

      <section anchor="ops-relation-9040" numbered="true">
        <name>Relation to RFC 9040 (TCB Interdependence)</name>
        <t>
          <xref target="RFC9040"/> specifies a framework for sharing
          a small set of TCB variables, including SRTT and ssthresh,
          between connections to the same host.  Stateful-TCP
          extends this framework with two additional shared items
          (bw_est and RTTmin) that are used specifically to drive
          the Startup phase of new connections.
        </t>
        <t>
          An implementation that already implements
          <xref target="RFC9040"/> <bcp14>MAY</bcp14> store the
          additional Stateful-TCP fields in the same per-destination
          state structure used for RFC 9040 sharing, provided that
          the conditions in <xref target="cache-write"/> are
          observed for the additional fields.
        </t>
      </section>

      <section anchor="ops-relation-tfo" numbered="true">
        <name>Relation to TCP Fast Open</name>
        <t>
          TCP Fast Open (TFO, <xref target="RFC7413"/>) reduces
          startup latency by carrying data in the SYN.  TFO and
          Stateful-TCP are orthogonal: TFO accelerates the first
          round-trip of a connection, while Stateful-TCP accelerates
          the bandwidth ramp-up that follows.  Implementations
          <bcp14>MAY</bcp14> enable both simultaneously.
        </t>
      </section>

      <section anchor="ops-fairness" numbered="true">
        <name>Fairness</name>
        <t>
          A Hit-branch Stateful-TCP connection ramps to its target
          rate in one RTT instead of over many RTTs.  When such a
          connection shares a bottleneck with one or more standard
          (Slow-Start) connections, the standard connections will
          experience the Stateful-TCP connection as a long-lived
          flow that is already at steady state, rather than as a
          newcomer that is still ramping up.  This shifts the
          short-term throughput share towards the Stateful-TCP
          connection.
        </t>
        <t>
          The cited evaluation (<xref target="STATEFUL-TCP"/>)
          quantifies this effect for S-Cubic and reports that, at
          the link utilizations measured, both fairness among
          Stateful-TCP connections of the same kind and friendliness
          to standard connections remain within ranges considered
          acceptable for an experimental TCP variant.  Implementers
          and operators <bcp14>SHOULD</bcp14> evaluate fairness in
          their own deployment environments before enabling
          Stateful-TCP at scale.
        </t>
      </section>

      <section anchor="ops-aged" numbered="true">
        <name>Aged Estimates</name>
        <t>
          The accuracy of a cached estimate degrades as time passes
          since the estimate was recorded, because path conditions
          may change.  Implementations <bcp14>SHOULD</bcp14> bound
          the maximum age of an entry as required in
          <xref target="cache-expiry"/>.  An overestimate larger
          than the true path bandwidth at the time a Hit-branch
          connection starts will be corrected by the connection's
          ordinary congestion control response after the first RTT,
          although it may cause additional queueing or
          retransmissions during the first RTT.  An underestimate
          will reduce the bandwidth utilization of the Hit-branch
          connection but will not cause additional congestion.
        </t>
      </section>
    </section>

    <!-- ============================================================ -->
    <section anchor="security" numbered="true" toc="default">
      <name>Security Considerations</name>

      <t>
        Stateful-TCP introduces sender-local state that is updated
        and consumed without any additional information on the wire.
        It does not change the TCP authentication or integrity
        properties of the connection itself.  Nevertheless, the
        introduction of cached, peer-keyed state raises several
        considerations.
      </t>

      <section anchor="sec-poisoning" numbered="true">
        <name>Cache Poisoning</name>
        <t>
          The cache is keyed on the peer_IP that the sender observes
          for the connection that wrote the entry.  An off-path
          attacker cannot directly write entries because doing so
          would require completing a TCP handshake with the sender
          while spoofing the victim peer's IP address, which is not
          possible without on-path or address-spoofing
          capability.
        </t>
        <t>
          An on-path attacker that can complete a TCP handshake from
          a spoofed peer_IP can induce the sender to record an
          arbitrary bw_est for that peer.  The consequence on a
          subsequent connection is bounded: an inflated bw_est
          causes at most one round-trip of paced over-transmission
          to that peer before ordinary congestion control reasserts
          itself; a deflated bw_est causes at most one round-trip of
          under-utilization.  Implementations
          <bcp14>SHOULD</bcp14> consider these bounds when choosing
          cache expiry policies in environments where on-path
          attackers are part of the threat model.
        </t>
      </section>

      <section anchor="sec-dos" numbered="true">
        <name>Cache Exhaustion</name>
        <t>
          A peer that initiates connections from many distinct
          source addresses can cause an unbounded cache to grow
          without bound.  As required in
          <xref target="cache-expiry"/>, implementations
          <bcp14>MUST</bcp14> bound cache memory and
          <bcp14>SHOULD</bcp14> use an LRU or equivalent eviction
          policy.
        </t>
      </section>

    </section>

    <!-- ============================================================ -->
    <section anchor="iana" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>This document has no IANA actions.</t>
    </section>

    <!-- ============================================================ -->
    <section anchor="ack" numbered="true" toc="default">
      <name>Acknowledgments</name>
      <t>
        The mechanism specified here was originally proposed and
        evaluated in <xref target="STATEFUL-TCP"/>.
      </t>
    </section>

  </middle>

  <back>

    <references>
      <name>References</name>

      <references>
        <name>Normative References</name>

        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="Scott Bradner" initials="S." surname="Bradner"/>
            <date year="1997" month="March"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>

        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="Barry Leiba" initials="B." surname="Leiba"/>
            <date year="2017" month="May"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>

        <reference anchor="RFC5681" target="https://www.rfc-editor.org/info/rfc5681">
          <front>
            <title>TCP Congestion Control</title>
            <author fullname="Mark Allman" initials="M." surname="Allman"/>
            <author fullname="Vern Paxson" initials="V." surname="Paxson"/>
            <author fullname="Ethan Blanton" initials="E." surname="Blanton"/>
            <date year="2009" month="September"/>
          </front>
          <seriesInfo name="RFC" value="5681"/>
          <seriesInfo name="DOI" value="10.17487/RFC5681"/>
        </reference>

        <reference anchor="RFC9293" target="https://www.rfc-editor.org/info/rfc9293">
          <front>
            <title>Transmission Control Protocol (TCP)</title>
            <author fullname="Wesley Eddy" initials="W." surname="Eddy" role="editor"/>
            <date year="2022" month="August"/>
          </front>
          <seriesInfo name="STD" value="7"/>
          <seriesInfo name="RFC" value="9293"/>
          <seriesInfo name="DOI" value="10.17487/RFC9293"/>
        </reference>

        <reference anchor="RFC9040" target="https://www.rfc-editor.org/info/rfc9040">
          <front>
            <title>TCP Control Block Interdependence</title>
            <author fullname="Joe Touch" initials="J." surname="Touch"/>
            <author fullname="Michael Welzl" initials="M." surname="Welzl"/>
            <author fullname="Safiqul Islam" initials="S." surname="Islam"/>
            <date year="2021" month="July"/>
          </front>
          <seriesInfo name="RFC" value="9040"/>
          <seriesInfo name="DOI" value="10.17487/RFC9040"/>
        </reference>

        <reference anchor="RFC6298" target="https://www.rfc-editor.org/info/rfc6298">
          <front>
            <title>Computing TCP's Retransmission Timer</title>
            <author fullname="Vern Paxson" initials="V." surname="Paxson"/>
            <author fullname="Mark Allman" initials="M." surname="Allman"/>
            <author fullname="Jerry Chu" initials="J." surname="Chu"/>
            <author fullname="Matt Sargent" initials="M." surname="Sargent"/>
            <date year="2011" month="June"/>
          </front>
          <seriesInfo name="RFC" value="6298"/>
          <seriesInfo name="DOI" value="10.17487/RFC6298"/>
        </reference>

        <reference anchor="RFC9438" target="https://www.rfc-editor.org/info/rfc9438">
          <front>
            <title>CUBIC for Fast and Long-Distance Networks</title>
            <author fullname="Lisong Xu" initials="L." surname="Xu"/>
            <author fullname="Sangtae Ha" initials="S." surname="Ha"/>
            <author fullname="Injong Rhee" initials="I." surname="Rhee"/>
            <author fullname="Vidhi Goel" initials="V." surname="Goel"/>
            <author fullname="Lars Eggert" initials="L." surname="Eggert" role="editor"/>
            <date year="2024" month="August"/>
          </front>
          <seriesInfo name="RFC" value="9438"/>
          <seriesInfo name="DOI" value="10.17487/RFC9438"/>
        </reference>
      </references>

      <references>
        <name>Informative References</name>

        <reference anchor="RFC2140" target="https://www.rfc-editor.org/info/rfc2140">
          <front>
            <title>TCP Control Block Interdependence</title>
            <author fullname="Joe Touch" initials="J." surname="Touch"/>
            <date year="1997" month="April"/>
          </front>
          <seriesInfo name="RFC" value="2140"/>
          <seriesInfo name="DOI" value="10.17487/RFC2140"/>
        </reference>

        <reference anchor="RFC2018" target="https://www.rfc-editor.org/info/rfc2018">
          <front>
            <title>TCP Selective Acknowledgment Options</title>
            <author fullname="Matthew Mathis" initials="M." surname="Mathis"/>
            <author fullname="Jamshid Mahdavi" initials="J." surname="Mahdavi"/>
            <author fullname="Sally Floyd" initials="S." surname="Floyd"/>
            <author fullname="Allyn Romanow" initials="A." surname="Romanow"/>
            <date year="1996" month="October"/>
          </front>
          <seriesInfo name="RFC" value="2018"/>
          <seriesInfo name="DOI" value="10.17487/RFC2018"/>
        </reference>

        <reference anchor="RFC3742" target="https://www.rfc-editor.org/info/rfc3742">
          <front>
            <title>Limited Slow-Start for TCP with Large Congestion Windows</title>
            <author fullname="Sally Floyd" initials="S." surname="Floyd"/>
            <date year="2004" month="March"/>
          </front>
          <seriesInfo name="RFC" value="3742"/>
          <seriesInfo name="DOI" value="10.17487/RFC3742"/>
        </reference>

        <reference anchor="RFC6928" target="https://www.rfc-editor.org/info/rfc6928">
          <front>
            <title>Increasing TCP's Initial Window</title>
            <author fullname="Jerry Chu" initials="J." surname="Chu"/>
            <author fullname="Nandita Dukkipati" initials="N." surname="Dukkipati"/>
            <author fullname="Yuchung Cheng" initials="Y." surname="Cheng"/>
            <author fullname="Matt Mathis" initials="M." surname="Mathis"/>
            <date year="2013" month="April"/>
          </front>
          <seriesInfo name="RFC" value="6928"/>
          <seriesInfo name="DOI" value="10.17487/RFC6928"/>
        </reference>

        <reference anchor="RFC7413" target="https://www.rfc-editor.org/info/rfc7413">
          <front>
            <title>TCP Fast Open</title>
            <author fullname="Yuchung Cheng" initials="Y." surname="Cheng"/>
            <author fullname="Jerry Chu" initials="J." surname="Chu"/>
            <author fullname="Sivasankar Radhakrishnan" initials="S." surname="Radhakrishnan"/>
            <author fullname="Arvind Jain" initials="A." surname="Jain"/>
            <date year="2014" month="December"/>
          </front>
          <seriesInfo name="RFC" value="7413"/>
          <seriesInfo name="DOI" value="10.17487/RFC7413"/>
        </reference>

        <reference anchor="RFC8985" target="https://www.rfc-editor.org/info/rfc8985">
          <front>
            <title>The RACK-TLP Loss Detection Algorithm for TCP</title>
            <author fullname="Yuchung Cheng" initials="Y." surname="Cheng"/>
            <author fullname="Neal Cardwell" initials="N." surname="Cardwell"/>
            <author fullname="Nandita Dukkipati" initials="N." surname="Dukkipati"/>
            <author fullname="Priyaranjan Jha" initials="P." surname="Jha"/>
            <date year="2021" month="February"/>
          </front>
          <seriesInfo name="RFC" value="8985"/>
          <seriesInfo name="DOI" value="10.17487/RFC8985"/>
        </reference>

        <reference anchor="RFC9406" target="https://www.rfc-editor.org/info/rfc9406">
          <front>
            <title>HyStart++: Modified Slow Start for TCP</title>
            <author fullname="Praveen Balasubramanian" initials="P." surname="Balasubramanian"/>
            <author fullname="Yi Huang" initials="Y." surname="Huang"/>
            <author fullname="Matt Olson" initials="M." surname="Olson"/>
            <date year="2023" month="May"/>
          </front>
          <seriesInfo name="RFC" value="9406"/>
          <seriesInfo name="DOI" value="10.17487/RFC9406"/>
        </reference>

        <reference anchor="STATEFUL-TCP" target="https://doi.org/10.1109/ACCESS.2020.3032208">
          <front>
            <title>Stateful-TCP - A New Approach to Accelerate TCP Slow-Start</title>
            <author fullname="Lingfeng Guo" initials="L." surname="Guo"/>
            <author fullname="Jack Y. B. Lee" initials="J. Y. B." surname="Lee"/>
            <date year="2020" month="October"/>
          </front>
          <refcontent>IEEE Access, vol. 8, pp. 195955-195970</refcontent>
          <seriesInfo name="DOI" value="10.1109/ACCESS.2020.3032208"/>
        </reference>

        <reference anchor="WESTWOOD">
          <front>
            <title>TCP Westwood: Bandwidth Estimation for Enhanced Transport over Wireless Links</title>
            <author fullname="Claudio Casetti" initials="C." surname="Casetti"/>
            <author fullname="Mario Gerla" initials="M." surname="Gerla"/>
            <author fullname="Saverio Mascolo" initials="S." surname="Mascolo"/>
            <author fullname="M. Y. Sanadidi" initials="M. Y." surname="Sanadidi"/>
            <author fullname="Ren Wang" initials="R." surname="Wang"/>
            <date year="2001" month="July"/>
          </front>
          <refcontent>Proc. ACM MobiCom, pp. 287-297</refcontent>
        </reference>

        <reference anchor="BBR">
          <front>
            <title>BBR: Congestion-Based Congestion Control</title>
            <author fullname="Neal Cardwell" initials="N." surname="Cardwell"/>
            <author fullname="Yuchung Cheng" initials="Y." surname="Cheng"/>
            <author fullname="C. Stephen Gunn" initials="C. S." surname="Gunn"/>
            <author fullname="Soheil Hassas Yeganeh" initials="S. H." surname="Yeganeh"/>
            <author fullname="Van Jacobson" initials="V." surname="Jacobson"/>
            <date year="2016" month="September"/>
          </front>
          <refcontent>ACM Queue, vol. 14, no. 5</refcontent>
        </reference>
      </references>

    </references>

  </back>

</rfc>
