<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp "&#160;">
  <!ENTITY zwsp "&#8203;">
  <!ENTITY nbhy "&#8209;">
  <!ENTITY wj "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="info"
     docName="draft-gaikwad-llm-benchmarking-terminology-00"
     ipr="trust200902"
     obsoletes=""
     updates=""
     submissionType="independent"
     xml:lang="en"
     version="3">

  <front>
    <title abbrev="LLM Benchmarking Terminology">Benchmarking Terminology for Large Language Model Serving</title>
    
    <seriesInfo name="Internet-Draft" value="draft-gaikwad-llm-benchmarking-terminology-00"/>
    
    <author fullname="Madhava Gaikwad" initials="M." surname="Gaikwad">
      <organization>Independent Researcher</organization>
      <address>
        <email>gaikwad.madhav@gmail.com</email>
      </address>
    </author>
    
    <date year="2026" month="January"/>
    
    <area>General</area>
    <workgroup>Network Working Group</workgroup>
    
    <keyword>LLM</keyword>
    <keyword>benchmarking</keyword>
    <keyword>inference</keyword>
    <keyword>performance</keyword>
    <keyword>latency</keyword>
    <keyword>throughput</keyword>
    
    <abstract>
      <t>This document defines terminology for benchmarking the performance of
      Large Language Model (LLM) inference serving systems. It establishes
      a shared vocabulary for latency, throughput, resource utilization,
      and quality metrics applicable to inference engines, application
      gateways, and compound agentic systems. This document defines
      terminology only and does not prescribe benchmark methodologies or
      acceptance thresholds.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>
      
      <t>Large Language Model inference serving has emerged as a distinct
      category of network service with performance characteristics unlike
      traditional request-response systems. The autoregressive generation
      process produces output tokens sequentially, creating a streaming
      response pattern where latency has multiple meaningful definitions.
      The prefill and decode phases exhibit different computational
      profiles. Memory consumption grows with sequence length due to
      key-value cache requirements.</t>
      
      <t>Large Language Model serving systems are increasingly deployed as 
      Internet-facing services, often exposed via standardized APIs and shared 
      infrastructure. Their performance characteristics influence availability, 
      fairness, and side-channel risk in multi-tenant environments. Establishing 
      consistent benchmarking terminology enables clearer communication among 
      implementers, operators, and researchers.</t>
      
      <t>Despite widespread deployment of LLM serving systems, no standard
      terminology exists for describing their performance. Different
      implementations, benchmarks, and academic publications use
      inconsistent definitions for terms such as "throughput," "latency,"
      and "tokens per second." This inconsistency hinders meaningful
      comparison across systems and creates confusion for practitioners.</t>
      
      <t>This document addresses the terminology gap by providing precise
      definitions for LLM serving performance metrics. The structure and
      approach follow <xref target="RFC2647"/>, which established benchmarking
      terminology for firewall performance. Each term includes a
      definition, discussion of context and implementation considerations,
      unit of measurement, open issues, and cross-references to related
      terms.</t>
      
      <t>This document defines terminology only. It does not specify benchmark
      methodologies, workload profiles, or acceptance criteria. Companion
      documents may address those topics.</t>
      
      <t>The metrics in this document apply to transformer-based autoregressive
      language models. Other model architectures such as diffusion models,
      encoder-only models, or non-autoregressive decoders require different
      terminology not covered here.</t>
    </section>

    <section anchor="scope" numbered="true" toc="default">
      <name>Scope and System Under Test Boundary</name>
      
      <t>A prerequisite for benchmarking is defining the System Under Test
      (SUT) boundary. The same metric measured at different boundaries
      yields different values. This document identifies three SUT boundary
      categories:</t>
      
      <dl>
        <dt>Model Engine:</dt>
        <dd>The inference runtime executing model forward passes. This
        boundary excludes network transport, authentication, request
        routing, and safety filtering. Metrics at this boundary reflect
        raw inference capability.</dd>
        
        <dt>Application Gateway:</dt>
        <dd>The model engine plus request handling, authentication, rate
        limiting, input validation, output filtering, and safety
        mechanisms. Metrics at this boundary reflect the performance
        users observe when calling an API endpoint.</dd>
        
        <dt>Compound System:</dt>
        <dd>An application gateway plus orchestration logic, retrieval
        components, tool execution, and multi-step reasoning. Metrics at
        this boundary reflect end-to-end task completion performance for
        agentic or retrieval-augmented workloads.</dd>
      </dl>
      
      <t>Testers <bcp14>MUST</bcp14> declare the SUT boundary when reporting metrics. Metrics
      from different SUT boundaries <bcp14>MUST NOT</bcp14> be directly compared without
      adjustment.</t>
      
      <t>The measurement point within the SUT also affects results. For
      latency metrics, testers <bcp14>MUST</bcp14> specify whether measurement occurs at
      the client, at the network edge, or within the serving infrastructure.</t>
    </section>

    <section anchor="requirements" numbered="true" toc="default">
      <name>Requirements Language</name>
      
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", 
      "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", 
      "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are 
      to be interpreted as described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> 
      when, and only when, they appear in all capitals, as shown here.</t>
    </section>

    <section anchor="terminology" numbered="true" toc="default">
      <name>Terminology</name>
      
      <section anchor="request-response-timing" numbered="true" toc="default">
        <name>Request and Response Timing Metrics</name>
        
        <section anchor="end-to-end-latency" numbered="true" toc="default">
          <name>End-to-End Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time between request initiation by a client and
            receipt of the complete response by that client.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>End-to-end latency encompasses all processing stages: network
            transmission, queuing, prefill computation, decode iterations,
            output filtering, and return transmission. For streaming
            responses, end-to-end latency is measured until the final token
            is received.</t>
            
            <t>End-to-end latency depends on output length. Longer responses
            require more decode iterations and produce higher latency. When
            comparing systems, testers <bcp14>SHOULD</bcp14> control for output length or
            report latency normalized by output token count.</t>
            
            <t>Client-side measurement includes network round-trip time.
            Server-side measurement excludes external network latency but
            may still include internal network hops within a distributed
            serving system.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms) or seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Measurement point (client-side vs server-side)</li>
              <li>Treatment of failed or truncated responses</li>
              <li>Clock synchronization for distributed measurement</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Time to First Token (<xref target="ttft"/>),
            Time per Output Token (<xref target="tpot"/>),
            Decode Latency (<xref target="decode-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="ttft" numbered="true" toc="default">
          <name>Time to First Token</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time between request initiation and receipt of the
            first output token.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Time to First Token (TTFT) measures how long a user waits before
            any response content appears. For interactive applications, TTFT
            determines perceived responsiveness independent of total response
            length.</t>
            
            <t>TTFT includes network transmission, authentication, admission
            control, queue wait time, and prefill computation. Under low load
            with short prompts, prefill dominates TTFT. Under high load,
            queue wait time may dominate. With long prompts, prefill
            computation scales with input token count.</t>
            
            <t>For non-streaming responses, TTFT equals end-to-end latency
            because all tokens arrive together. Testers <bcp14>SHOULD</bcp14> specify
            whether the response mode is streaming or non-streaming.</t>
            
            <t>Some systems emit an empty or whitespace-only first token before
            substantive content. Testers <bcp14>MUST</bcp14> specify whether TTFT measures
            time to any token or time to first non-empty token.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Streaming vs non-streaming response modes</li>
              <li>Definition of "first token" when initial tokens lack content</li>
              <li>Client-side vs server-side measurement point</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefill Latency (<xref target="prefill-latency"/>),
            Queue Wait Time (<xref target="queue-wait-time"/>),
            End-to-End Latency (<xref target="end-to-end-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="itl" numbered="true" toc="default">
          <name>Inter-Token Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time between consecutive output token emissions,
            measured at the server.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Inter-Token Latency (ITL) measures the generation interval between
            adjacent tokens during the decode phase. ITL reflects decode
            efficiency and is affected by batch size, model architecture, and
            memory bandwidth.</t>
            
            <t>ITL varies across tokens within a single request due to batching
            dynamics. When other requests join or leave the batch, the
            per-request compute allocation changes. Testers <bcp14>SHOULD</bcp14> report ITL
            distribution statistics rather than a single value.</t>
            
            <t>ITL is measured server-side and excludes network transmission
            delay. For client-observed intervals, see Time Between Tokens
            (<xref target="tbt"/>).</t>
            
            <t>When aggregating ITL across requests, token-weighted averaging
            counts each token equally. Request-weighted averaging counts each
            request equally regardless of length. These methods yield
            different results. Testers <bcp14>MUST</bcp14> specify the aggregation method.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Token-weighted vs request-weighted aggregation</li>
              <li>Variation within a single request due to batching</li>
              <li>Exclusion of prefill-to-first-token interval</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Time Between Tokens (<xref target="tbt"/>),
            Time per Output Token (<xref target="tpot"/>),
            Batch Utilization (<xref target="batch-utilization"/>)</dd>
          </dl>
        </section>

        <section anchor="tbt" numbered="true" toc="default">
          <name>Time Between Tokens</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time between receipt of consecutive output tokens at
            the client.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Time Between Tokens (TBT) measures the client-observed interval
            between adjacent tokens. TBT equals ITL plus network transmission
            variability and buffering effects.</t>
            
            <t>Network jitter, TCP buffering, and intermediary proxies cause TBT
            to differ from ITL. Multiple tokens may arrive in a single network
            packet, producing near-zero TBT followed by a longer gap.</t>
            
            <t>TBT directly affects user-perceived streaming smoothness. High TBT
            variance creates a "stuttering" appearance even when average TBT
            is acceptable.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Network buffering causing token bunching</li>
              <li>Proxy and CDN effects on delivery timing</li>
              <li>Measurement requires client-side instrumentation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Inter-Token Latency (<xref target="itl"/>),
            Token Delivery Jitter (<xref target="token-delivery-jitter"/>)</dd>
          </dl>
        </section>

        <section anchor="tpot" numbered="true" toc="default">
          <name>Time per Output Token</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd><t>The average time to generate each output token after the first
            token, computed as:</t>
            <t>TPOT = (End-to-End Latency - TTFT) / (Output Token Count - 1)</t></dd>
            
            <dt>Discussion:</dt>
            <dd><t>Time per Output Token (TPOT) summarizes decode-phase performance
            in a single value. Unlike ITL, which measures each interval, TPOT
            averages across all decode steps for a request.</t>
            
            <t>TPOT excludes the first token because TTFT captures that interval
            separately. For single-token outputs, TPOT is undefined.</t>
            
            <t>TPOT is request-weighted by construction: each request contributes
            one TPOT value regardless of output length. When aggregating
            across requests, report the distribution rather than only the
            mean.</t>
            
            <t>TPOT relates to user-perceived generation speed. A TPOT of 50ms
            corresponds to 20 tokens per second, approximately 900 words per
            minute for English text.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds per token (ms/token)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Undefined for single-token outputs</li>
              <li>Request-weighted aggregation differs from token-weighted ITL</li>
              <li>Denominator uses (Output Token Count - 1)</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Inter-Token Latency (<xref target="itl"/>),
            End-to-End Latency (<xref target="end-to-end-latency"/>),
            Time to First Token (<xref target="ttft"/>)</dd>
          </dl>
        </section>

        <section anchor="normalized-latency" numbered="true" toc="default">
          <name>Normalized Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>End-to-end latency divided by output token count, yielding a
            length-independent latency measure.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Normalized latency enables comparison across requests with
            different output lengths. It amortizes fixed overhead (TTFT)
            across all tokens.</t>
            
            <t>For short outputs, TTFT dominates normalized latency. For long
            outputs, TPOT dominates. Testers <bcp14>SHOULD</bcp14> report output length
            distribution alongside normalized latency to enable
            interpretation.</t>
            
            <t>Normalized latency obscures user-facing behavior because users
            experience TTFT and TPOT separately. Testers <bcp14>SHOULD NOT</bcp14> report
            normalized latency as the sole latency metric.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds per token (ms/token)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Obscures distinction between TTFT and decode latency</li>
              <li>Sensitive to output length distribution</li>
              <li>Not directly interpretable as user experience</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>End-to-End Latency (<xref target="end-to-end-latency"/>),
            Time per Output Token (<xref target="tpot"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="phase-specific-latency" numbered="true" toc="default">
        <name>Phase-Specific Latency Metrics</name>
        
        <section anchor="prefill-latency" numbered="true" toc="default">
          <name>Prefill Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time to process input tokens and compute the initial
            key-value cache prior to generating the first output token.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Prefill performs a forward pass over all input tokens to populate
            the key-value cache. This computation is parallelizable across
            the input sequence and is compute-bound on current hardware.</t>
            
            <t>Prefill latency scales approximately linearly with input token
            count for uncached inputs. With prefix caching enabled, prefill
            processes only the uncached suffix, reducing latency for requests
            sharing common prefixes.</t>
            
            <t>Prefill latency is a component of TTFT. Other TTFT components
            include queue wait time, network latency, and authentication
            overhead.</t>
            
            <t>Chunked prefill implementations split long prefills into smaller
            segments interleaved with decode steps. This reduces head-of-line
            blocking but increases total prefill time. Testers <bcp14>MUST</bcp14> specify
            whether chunked prefill is enabled.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Chunked vs monolithic prefill execution</li>
              <li>Prefix caching effects on effective prefill length</li>
              <li>Distinction from TTFT in reporting</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Time to First Token (<xref target="ttft"/>),
            Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>),
            Head-of-Line Blocking (<xref target="hol-blocking"/>)</dd>
          </dl>
        </section>

        <section anchor="decode-latency" numbered="true" toc="default">
          <name>Decode Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The cumulative elapsed time spent generating output tokens after
            the first token.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Decode latency equals (Output Token Count - 1) multiplied by
            average ITL. It measures total time in the decode phase.</t>
            
            <t>Decode is memory-bandwidth-bound on current hardware because each
            step reads the full model weights and growing key-value cache
            while producing a single token.</t>
            
            <t>Decode latency grows linearly with output token count under
            constant batching conditions. Variable batch membership during
            generation causes decode latency to deviate from simple linear
            scaling.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms) or seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Variation due to dynamic batching</li>
              <li>Growth of key-value cache during decode</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Inter-Token Latency (<xref target="itl"/>),
            End-to-End Latency (<xref target="end-to-end-latency"/>),
            Prefill Latency (<xref target="prefill-latency"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="throughput-capacity" numbered="true" toc="default">
        <name>Throughput and Capacity Metrics</name>
        
        <section anchor="output-token-throughput" numbered="true" toc="default">
          <name>Output Token Throughput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of output tokens generated per second by the system
            across all concurrent requests.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Output token throughput measures system-wide generation capacity.
            It increases with offered load until the system saturates, then
            plateaus or declines.</t>
            
            <t>Throughput measurement requires specifying the token counting
            method. Subword tokenization produces different counts than
            word-level tokenization for the same text. Testers <bcp14>MUST</bcp14> identify
            the tokenizer used.</t>
            
            <t>Some implementations pad sequences to uniform length within a
            batch. Testers <bcp14>MUST</bcp14> specify whether throughput counts include or
            exclude padding tokens.</t>
            
            <t>Throughput measurement requires a defined time window. Short
            windows capture instantaneous throughput. Longer windows smooth
            over load variations. Testers <bcp14>MUST</bcp14> specify the measurement window
            duration.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per second (tok/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Tokenizer-dependent token counts</li>
              <li>Inclusion or exclusion of padding tokens</li>
              <li>Measurement window duration</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Input Token Throughput (<xref target="input-token-throughput"/>),
            Request Throughput (<xref target="request-throughput"/>),
            Non-Padding Token Throughput (<xref target="non-padding-throughput"/>)</dd>
          </dl>
        </section>

        <section anchor="input-token-throughput" numbered="true" toc="default">
          <name>Input Token Throughput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of input tokens processed per second by the system
            across all concurrent requests.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Input token throughput measures prefill capacity. Systems with
            efficient batched prefill achieve higher input throughput than
            those processing prompts sequentially.</t>
            
            <t>Input throughput differs from output throughput because prefill
            and decode have different computational characteristics. A system
            optimized for long-context prefill may show high input throughput
            but lower output throughput, and vice versa.</t>
            
            <t>Prefix caching affects input throughput measurement. With caching,
            input tokens divide into cache hits (not processed) and cache
            misses (processed). Testers <bcp14>MUST</bcp14> specify whether input throughput
            counts all input tokens or only cache misses.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per second (tok/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Treatment of cached vs processed input tokens</li>
              <li>Different from output throughput due to phase characteristics</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Output Token Throughput (<xref target="output-token-throughput"/>),
            Prefill Latency (<xref target="prefill-latency"/>),
            Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="request-throughput" numbered="true" toc="default">
          <name>Request Throughput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of requests completed per second.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Request throughput counts completed requests regardless of token
            counts. A system completing many short requests achieves higher
            request throughput than one completing fewer long requests, even
            at equal token throughput.</t>
            
            <t>Request throughput is relevant for capacity planning when requests
            have predictable lengths or when per-request overhead dominates.</t>
            
            <t>Failed requests require specified handling. Testers <bcp14>MUST</bcp14> specify
            whether request throughput includes only successful completions or
            also counts failures.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>requests per second (req/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Treatment of failed or truncated requests</li>
              <li>Sensitivity to request length distribution</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Output Token Throughput (<xref target="output-token-throughput"/>),
            End-to-End Latency (<xref target="end-to-end-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="non-padding-throughput" numbered="true" toc="default">
          <name>Non-Padding Token Throughput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>Output token throughput excluding padding or alignment tokens.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Batched inference pads sequences to uniform length. Padding tokens
            consume compute but carry no information. Non-padding throughput
            measures useful work.</t>
            
            <t>Systems using variable-length batching or continuous batching may
            avoid padding entirely. For these systems, non-padding throughput
            equals total output throughput.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per second (tok/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Applicable only to padded batching schemes</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Output Token Throughput (<xref target="output-token-throughput"/>),
            Batch Utilization (<xref target="batch-utilization"/>)</dd>
          </dl>
        </section>

        <section anchor="offered-load" numbered="true" toc="default">
          <name>Offered Load</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The request arrival rate or concurrency level imposed on the
            system by the workload generator.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Offered load characterizes workload intensity. Two forms exist:</t>
            
            <t>Open-loop load specifies a request arrival rate (requests per
            second) independent of system response. New requests arrive
            according to a specified distribution regardless of outstanding
            request count.</t>
            
            <t>Closed-loop load specifies a fixed concurrency level. Each
            completed request triggers a new request, maintaining constant
            outstanding requests.</t>
            
            <t>Open-loop load reveals system behavior under overload. Closed-loop
            load cannot exceed system capacity by construction. Testers <bcp14>MUST</bcp14>
            specify the load model and its parameters.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>requests per second (req/s) for open-loop; concurrent requests for closed-loop</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Open-loop vs closed-loop model selection</li>
              <li>Arrival distribution for open-loop (Poisson, uniform, bursty)</li>
              <li>Think time between requests for closed-loop</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Sustainable Load (<xref target="sustainable-load"/>),
            Queue Depth (<xref target="queue-depth"/>)</dd>
          </dl>
        </section>

        <section anchor="sustainable-load" numbered="true" toc="default">
          <name>Sustainable Load</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The maximum offered load at which the system continues to meet
            declared service objectives.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Sustainable load identifies the operating region boundary. Below
            sustainable load, the system meets latency and quality targets.
            Above sustainable load, latency increases unboundedly or requests
            fail.</t>
            
            <t>Sustainable load depends on the service objectives. Stricter
            latency targets yield lower sustainable load. Testers <bcp14>MUST</bcp14>
            declare the service objectives when reporting sustainable load.</t>
            
            <t>Sustainable load also depends on workload characteristics. Longer
            prompts or outputs reduce sustainable load. Testers <bcp14>MUST</bcp14>
            characterize the workload profile.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>requests per second (req/s) or concurrent requests</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires declared service objectives</li>
              <li>Sensitive to workload characteristics</li>
              <li>May differ for different SLO percentiles</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Offered Load (<xref target="offered-load"/>),
            Service Level Objective (<xref target="slo"/>),
            SLO Attainment Rate (<xref target="slo-attainment"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="latency-distribution" numbered="true" toc="default">
        <name>Latency Distribution Metrics</name>
        
        <section anchor="latency-percentiles" numbered="true" toc="default">
          <name>Latency Percentiles</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>Values below which a specified percentage of observations fall,
            reported as P50, P90, P95, P99, and P99.9 for latency metrics.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Mean latency obscures distribution shape. A system with low mean
            but high variance provides inconsistent user experience.
            Percentiles characterize the distribution.</t>
            
            <t>P50 (median) indicates typical experience. P99 indicates
            worst-case experience for most users. P99.9 indicates extreme
            tail behavior relevant for high-volume services.</t>
            
            <t>Percentile computation requires sufficient sample size. For P99
            accuracy, at least 1000 samples are needed. For P99.9, at least
            10000 samples. Testers <bcp14>MUST</bcp14> report sample size alongside
            percentiles.</t>
            
            <t>Percentiles apply to TTFT, TPOT, ITL, and end-to-end latency.
            Testers <bcp14>SHOULD</bcp14> report percentiles for multiple metrics rather
            than a single summary.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Same as the underlying latency metric (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Sample size requirements for tail percentiles</li>
              <li>Appropriate percentile selection for use case</li>
              <li>Confidence intervals for reported percentiles</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>End-to-End Latency (<xref target="end-to-end-latency"/>),
            Time to First Token (<xref target="ttft"/>),
            Time per Output Token (<xref target="tpot"/>)</dd>
          </dl>
        </section>

        <section anchor="latency-distribution-full" numbered="true" toc="default">
          <name>Latency Distribution</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The complete statistical distribution of latency observations,
            represented as a histogram or cumulative distribution function.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Full distributions reveal structure that percentiles miss.
            Multimodal distributions indicate distinct operating regimes.
            Heavy tails indicate outlier sensitivity.</t>
            
            <t>Histogram bin width affects resolution. Narrow bins reveal detail
            but require more samples. Testers <bcp14>SHOULD</bcp14> use logarithmic binning
            for latency distributions spanning multiple orders of magnitude.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Histogram counts or cumulative probability</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Bin width selection</li>
              <li>Sample size for distribution estimation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Latency Percentiles (<xref target="latency-percentiles"/>)</dd>
          </dl>
        </section>

        <section anchor="token-delivery-jitter" numbered="true" toc="default">
          <name>Token Delivery Jitter</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The variance or standard deviation of inter-token intervals within
            a single request.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Jitter measures streaming smoothness. Low jitter indicates
            consistent token pacing. High jitter indicates irregular delivery
            that users perceive as stuttering.</t>
            
            <t>Jitter arises from batching dynamics, memory bandwidth contention,
            and garbage collection pauses. Systems with continuous batching
            show higher jitter than static batching due to variable batch
            membership.</t>
            
            <t>Jitter is computed per-request, then aggregated. Report the
            distribution of per-request jitter values.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms) as standard deviation</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Aggregation method across requests</li>
              <li>Separating server-side jitter from network jitter</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Inter-Token Latency (<xref target="itl"/>),
            Time Between Tokens (<xref target="tbt"/>)</dd>
          </dl>
        </section>

        <section anchor="max-pause-duration" numbered="true" toc="default">
          <name>Maximum Pause Duration</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The longest inter-token interval observed within a single request.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Maximum pause captures the worst interruption in streaming output.
            A single long pause degrades user experience even when average ITL
            is acceptable.</t>
            
            <t>Long pauses arise from garbage collection, KV cache operations,
            batch recomputation after preemption, and request scheduling
            delays.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Distinguishing generation pauses from network delays</li>
              <li>Threshold for "pause" vs normal variation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Inter-Token Latency (<xref target="itl"/>),
            Token Delivery Jitter (<xref target="token-delivery-jitter"/>),
            Preemption Recovery Latency (<xref target="preemption-recovery-latency"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="scheduling-multitenancy" numbered="true" toc="default">
        <name>Scheduling and Multi-Tenancy Metrics</name>
        
        <section anchor="hol-blocking" numbered="true" toc="default">
          <name>Head-of-Line Blocking</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The additional delay experienced by short requests when scheduled
            behind long requests in a shared queue or batch.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>First-come-first-served scheduling causes head-of-line (HOL)
            blocking. A long prefill operation delays all subsequent requests
            regardless of their size.</t>
            
            <t>Chunked prefill mitigates HOL blocking by limiting the maximum
            uninterruptible computation. Shortest-job-first scheduling
            eliminates HOL blocking but requires output length prediction.</t>
            
            <t>HOL blocking is measured as the difference between observed
            latency and latency under an idealized scheduler with no blocking.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires baseline comparison to quantify</li>
              <li>Depends on workload length distribution</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefill Latency (<xref target="prefill-latency"/>),
            Queue Wait Time (<xref target="queue-wait-time"/>),
            Fairness Index (<xref target="fairness-index"/>)</dd>
          </dl>
        </section>

        <section anchor="queue-depth" numbered="true" toc="default">
          <name>Queue Depth</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of requests waiting for service at a measurement
            instant.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Queue depth indicates load relative to capacity. Growing queue
            depth signals approaching overload. Stable queue depth indicates
            balanced load.</t>
            
            <t>Queue depth has multiple measurement points: admission queue,
            prefill queue, decode batch wait queue. Testers <bcp14>MUST</bcp14> specify the
            queue measured.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>requests (count)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Multiple queue stages in serving systems</li>
              <li>Instantaneous vs time-averaged measurement</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Offered Load (<xref target="offered-load"/>),
            Queue Wait Time (<xref target="queue-wait-time"/>)</dd>
          </dl>
        </section>

        <section anchor="queue-wait-time" numbered="true" toc="default">
          <name>Queue Wait Time</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time between request arrival and scheduling for
            processing.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Queue wait time is a component of TTFT representing time spent
            waiting rather than computing. Under low load, queue wait
            approaches zero. Under high load, queue wait dominates TTFT.</t>
            
            <t>Systems with multiple queues (admission, prefill, decode) have
            corresponding wait times. Total queue wait is the sum across
            stages.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Multiple queue stages</li>
              <li>Inclusion of admission control delay</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Time to First Token (<xref target="ttft"/>),
            Queue Depth (<xref target="queue-depth"/>),
            Admission Rate (<xref target="admission-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="fairness-index" numbered="true" toc="default">
          <name>Fairness Index</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>A measure of equity in latency or throughput across concurrent
            requests or tenants.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Jain's Fairness Index quantifies allocation equality:</t>
            
            <t>J(x) = (sum(x_i))^2 / (n * sum(x_i^2))</t>
            
            <t>where x_i is the allocation to request or tenant i, and n is the
            count. J ranges from 1/n (maximally unfair) to 1 (perfectly fair).</t>
            
            <t>Fairness applies to latency (lower is better, so use reciprocal),
            throughput, or SLO attainment. Testers <bcp14>MUST</bcp14> specify the measured
            quantity.</t>
            
            <t>Multi-tenant systems require per-tenant fairness measurement.
            Single-tenant systems measure fairness across concurrent requests.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>dimensionless, range [1/n, 1]</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Choice of measured quantity (latency, throughput, SLO attainment)</li>
              <li>Tenant definition in multi-tenant systems</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Head-of-Line Blocking (<xref target="hol-blocking"/>),
            SLO Attainment Rate (<xref target="slo-attainment"/>)</dd>
          </dl>
        </section>

        <section anchor="batch-utilization" numbered="true" toc="default">
          <name>Batch Utilization</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The ratio of active tokens processed per batch to the maximum
            batch capacity.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Static batching pads sequences to uniform length. Batch
            utilization measures the fraction of compute applied to real
            tokens versus padding.</t>
            
            <t>Continuous batching achieves high utilization by avoiding padding.
            For continuous batching systems, batch utilization approaches 1.0
            and is less informative.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Not meaningful for continuous batching systems</li>
              <li>Varies with workload length distribution</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Non-Padding Token Throughput (<xref target="non-padding-throughput"/>),
            Output Token Throughput (<xref target="output-token-throughput"/>)</dd>
          </dl>
        </section>

        <section anchor="admission-rate" numbered="true" toc="default">
          <name>Admission Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of arriving requests accepted for processing.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Admission control rejects requests to prevent overload. Admission
            rate below 1.0 indicates load shedding.</t>
            
            <t>Rejected requests may receive an error response or be redirected.
            Testers <bcp14>MUST</bcp14> specify the rejection behavior.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Rejection behavior (error vs redirect)</li>
              <li>Distinction from content-based refusals</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Offered Load (<xref target="offered-load"/>),
            Sustainable Load (<xref target="sustainable-load"/>),
            False Refusal Rate (<xref target="false-refusal-rate"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="preemption-resource" numbered="true" toc="default">
        <name>Preemption and Resource Management Metrics</name>
        
        <section anchor="preemption-rate" numbered="true" toc="default">
          <name>Preemption Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of in-flight requests evicted from processing before
            completion.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Memory pressure or priority policies cause request preemption.
            Preempted requests lose their key-value cache state and must
            recompute it upon resumption.</t>
            
            <t>High preemption rates indicate memory over-subscription or
            aggressive scheduling. Preemption degrades latency for affected
            requests.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Distinction between temporary preemption and permanent eviction</li>
              <li>Treatment of requests that are preempted multiple times</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Preemption Loss (<xref target="preemption-loss"/>),
            Preemption Recovery Latency (<xref target="preemption-recovery-latency"/>),
            KV Cache Swap Rate (<xref target="kv-cache-swap-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="preemption-loss" numbered="true" toc="default">
          <name>Preemption Loss</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The computational work discarded when a request is preempted,
            measured as tokens generated before preemption that must be
            recomputed.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Preemption discards key-value cache state. Upon resumption, the
            system recomputes prefill for all tokens (input plus previously
            generated output). This recomputation is wasted work.</t>
            
            <t>Preemption loss contributes to tail latency. Requests preempted
            late in generation lose more work than those preempted early.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens or milliseconds of recomputation</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Measurement requires tracking recomputed tokens</li>
              <li>Multiple preemptions accumulate loss</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Preemption Rate (<xref target="preemption-rate"/>),
            Preemption Recovery Latency (<xref target="preemption-recovery-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="starvation-rate" numbered="true" toc="default">
          <name>Starvation Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of requests waiting longer than a specified threshold
            before receiving any service.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Starvation occurs when scheduling policies indefinitely delay
            certain requests. Priority inversion and unbounded queue growth
            cause starvation.</t>
            
            <t>The starvation threshold depends on application requirements.
            Testers <bcp14>MUST</bcp14> declare the threshold when reporting starvation rate.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Threshold selection is application-dependent</li>
              <li>Distinction from queue wait time distribution tail</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Queue Wait Time (<xref target="queue-wait-time"/>),
            Fairness Index (<xref target="fairness-index"/>)</dd>
          </dl>
        </section>

        <section anchor="preemption-recovery-latency" numbered="true" toc="default">
          <name>Preemption Recovery Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time from preemption to resumption of token
            generation.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Recovery latency includes: wait time for scheduling, KV cache
            reload or recomputation, and prefill of previously generated
            tokens.</t>
            
            <t>Systems with KV cache offloading to host memory recover faster
            than those requiring full recomputation. Recovery latency varies
            with the amount of generated output at preemption time.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Variation based on progress at preemption</li>
              <li>Offloading vs recomputation recovery strategies</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Preemption Rate (<xref target="preemption-rate"/>),
            Preemption Loss (<xref target="preemption-loss"/>),
            KV Cache Swap Rate (<xref target="kv-cache-swap-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="kv-cache-swap-rate" numbered="true" toc="default">
          <name>KV Cache Swap Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The frequency at which key-value cache blocks are migrated between
            accelerator memory and host memory.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Memory-constrained systems swap KV cache to host memory when
            accelerator memory is exhausted. Swapping enables higher
            concurrency at the cost of swap latency.</t>
            
            <t>Swap rate indicates memory pressure. High swap rates degrade
            latency due to PCIe transfer overhead.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>swaps per second or bytes per second</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Granularity of swap operations (per-request vs per-block)</li>
              <li>Distinction between swap-out and swap-in rates</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Preemption Rate (<xref target="preemption-rate"/>),
            Page Fault Latency (<xref target="page-fault-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="page-fault-latency" numbered="true" toc="default">
          <name>Page Fault Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The latency incurred when accessing KV cache blocks not resident
            in accelerator memory.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Paged attention systems fault in KV cache blocks on demand.
            Page fault latency includes PCIe transfer time from host memory
            or storage.</t>
            
            <t>Fault latency increases ITL for affected decode steps. Prefetching
            strategies aim to hide fault latency.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Prefetching effectiveness</li>
              <li>Storage tier latency (DRAM vs SSD)</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>KV Cache Swap Rate (<xref target="kv-cache-swap-rate"/>),
            Inter-Token Latency (<xref target="itl"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="prefix-caching" numbered="true" toc="default">
        <name>Prefix Caching Metrics</name>
        
        <section anchor="prefix-cache-hit-rate" numbered="true" toc="default">
          <name>Prefix Cache Hit Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of input tokens whose key-value representations are
            retrieved from cache rather than computed.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Prefix caching stores KV cache state for common prompt prefixes.
            Subsequent requests sharing a cached prefix skip prefill for those
            tokens.</t>
            
            <t>Hit rate depends on workload locality. Workloads with shared
            system prompts achieve high hit rates. Workloads with unique
            prompts achieve low hit rates.</t>
            
            <t>Hit rate is computed as cached tokens divided by total input
            tokens across requests. Per-request hit rate varies; report the
            distribution.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Granularity of cache matching (exact prefix vs subsequence)</li>
              <li>Multi-tenant cache isolation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefill Latency (<xref target="prefill-latency"/>),
            Time to First Token (<xref target="ttft"/>),
            Cache Eviction Rate (<xref target="cache-eviction-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="prefix-cache-capacity" numbered="true" toc="default">
          <name>Prefix Cache Capacity</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The total memory allocated for storing reusable KV cache entries.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Cache capacity limits the number and length of prefixes stored.
            Larger capacity enables more prefixes or longer prefixes at the
            cost of memory available for active requests.</t>
            
            <t>Capacity is often expressed as a fraction of total accelerator
            memory or as maximum cacheable token count.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>bytes, tokens, or percentage of accelerator memory</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Trade-off with memory for active requests</li>
              <li>Dynamic vs static allocation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>),
            Cache Eviction Rate (<xref target="cache-eviction-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="cache-eviction-rate" numbered="true" toc="default">
          <name>Cache Eviction Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The frequency at which cached prefix entries are removed to
            accommodate new entries.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Eviction occurs when cache capacity is exhausted. High eviction
            rates indicate insufficient capacity for the workload's prefix
            diversity.</t>
            
            <t>Eviction policies (LRU, frequency-based) affect which prefixes
            remain cached. Testers <bcp14>SHOULD</bcp14> specify the eviction policy.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>evictions per second or evictions per request</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Eviction policy effects</li>
              <li>Distinguishing capacity eviction from staleness eviction</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>),
            Prefix Cache Capacity (<xref target="prefix-cache-capacity"/>)</dd>
          </dl>
        </section>

        <section anchor="ttft-reduction-caching" numbered="true" toc="default">
          <name>TTFT Reduction from Caching</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The reduction in TTFT attributable to prefix cache hits, computed
            as TTFT without caching minus TTFT with caching.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>This metric quantifies caching benefit. It depends on both hit
            rate and the length of cached prefixes.</t>
            
            <t>Measurement requires comparing TTFT with caching enabled versus
            disabled, or estimating based on prefill latency per token.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires baseline measurement without caching</li>
              <li>Varies with cached prefix length</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>),
            Time to First Token (<xref target="ttft"/>),
            Prefill Latency (<xref target="prefill-latency"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="speculative-decoding" numbered="true" toc="default">
        <name>Speculative Decoding Metrics</name>
        
        <section anchor="draft-acceptance-rate" numbered="true" toc="default">
          <name>Draft Acceptance Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of tokens proposed by a draft model that are accepted
            by the target model.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Speculative decoding uses a smaller draft model to propose
            multiple tokens verified in parallel by the target model. Higher
            acceptance rates yield greater speedup.</t>
            
            <t>Acceptance rate depends on draft model quality and alignment with
            the target model. Rates vary by domain; code and structured text
            achieve higher rates than creative writing.</t>
            
            <t>Acceptance rate is computed per speculation window, then averaged.
            Report the distribution across windows.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Variation across domains and prompts</li>
              <li>Dependence on draft model selection</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Speculative Speedup (<xref target="speculative-speedup"/>),
            Draft Overhead (<xref target="draft-overhead"/>)</dd>
          </dl>
        </section>

        <section anchor="speculative-speedup" numbered="true" toc="default">
          <name>Speculative Speedup</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The ratio of decoding throughput with speculative decoding enabled
            to throughput with speculative decoding disabled.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Speedup depends on acceptance rate and the relative cost of draft
            and target model inference. High acceptance with low draft cost
            yields high speedup.</t>
            
            <t>Speculative decoding increases TTFT due to draft model prefill.
            Speedup applies to the decode phase only. End-to-end speedup is
            lower, especially for short outputs.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio (dimensionless)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Excludes TTFT overhead</li>
              <li>Varies with acceptance rate</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Draft Acceptance Rate (<xref target="draft-acceptance-rate"/>),
            Draft Overhead (<xref target="draft-overhead"/>),
            Output Token Throughput (<xref target="output-token-throughput"/>)</dd>
          </dl>
        </section>

        <section anchor="draft-overhead" numbered="true" toc="default">
          <name>Draft Overhead</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The additional latency or compute cost introduced by the draft
            model.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Draft model inference adds to prefill time and per-step decode
            time. This overhead must be recovered through acceptance to
            achieve net speedup.</t>
            
            <t>Overhead is measured as additional latency per speculation window
            or as fraction of total compute.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms) or percentage of total latency</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Amortization over speculation window length</li>
              <li>Memory overhead for draft model weights</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Draft Acceptance Rate (<xref target="draft-acceptance-rate"/>),
            Speculative Speedup (<xref target="speculative-speedup"/>)</dd>
          </dl>
        </section>

        <section anchor="verification-throughput" numbered="true" toc="default">
          <name>Verification Throughput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of draft tokens verified per second by the target
            model.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Verification throughput measures the target model's capacity to
            check draft proposals. Higher verification throughput enables
            longer speculation windows.</t>
            
            <t>Verification processes multiple tokens in parallel, achieving
            higher throughput than autoregressive generation of the same
            tokens.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per second (tok/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Distinction from generation throughput</li>
              <li>Variation with speculation window length</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Draft Acceptance Rate (<xref target="draft-acceptance-rate"/>),
            Output Token Throughput (<xref target="output-token-throughput"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="rag-metrics" numbered="true" toc="default">
        <name>Retrieval-Augmented Generation Metrics</name>
        
        <section anchor="embedding-latency" numbered="true" toc="default">
          <name>Embedding Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time to convert query text into vector representations
            for retrieval.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Retrieval-Augmented Generation (RAG) systems embed queries before
            searching a vector store. Embedding latency adds to TTFT.</t>
            
            <t>Embedding models are smaller and faster than generation models.
            Embedding latency is a minor TTFT component for most deployments.</t>
            
            <t>Batched embedding of multiple queries achieves higher throughput
            than sequential embedding.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Batched vs sequential embedding</li>
              <li>Embedding model selection effects</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Retrieval Latency (<xref target="retrieval-latency"/>),
            Time to First Token (<xref target="ttft"/>)</dd>
          </dl>
        </section>

        <section anchor="retrieval-latency" numbered="true" toc="default">
          <name>Retrieval Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time to search a vector store and fetch relevant
            documents.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Retrieval latency includes vector similarity search, optional
            reranking, and document fetching. It is a component of TTFT for
            RAG systems.</t>
            
            <t>Retrieval latency depends on index size, search algorithm, and
            number of results. Approximate nearest neighbor search trades
            accuracy for speed.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Index size and algorithm effects</li>
              <li>Reranking inclusion</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Embedding Latency (<xref target="embedding-latency"/>),
            Time to First Token (<xref target="ttft"/>),
            Context Injection Overhead (<xref target="context-injection-overhead"/>)</dd>
          </dl>
        </section>

        <section anchor="retrieval-recall" numbered="true" toc="default">
          <name>Retrieval Recall</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of relevant documents retrieved from the corpus.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Recall measures retrieval effectiveness. Low recall causes the
            generation model to lack relevant context. This is a quality
            metric, not a performance metric, but affects overall system
            evaluation.</t>
            
            <t>Measuring recall requires ground-truth relevance labels. For
            benchmarking without labels, use proxy metrics such as answer
            correctness.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires ground-truth relevance labels</li>
              <li>Trade-off with retrieval latency</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Retrieval Latency (<xref target="retrieval-latency"/>),
            Context Utilization Rate (<xref target="context-utilization-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="context-injection-overhead" numbered="true" toc="default">
          <name>Context Injection Overhead</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The additional prefill latency caused by retrieved context tokens.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Retrieved documents increase prompt length, increasing prefill
            computation. This overhead scales with retrieved token count.</t>
            
            <t>The overhead is the difference between prefill latency with
            retrieved context and prefill latency without it.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Varies with retrieval result length</li>
              <li>Interaction with context length limits</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefill Latency (<xref target="prefill-latency"/>),
            Retrieval Latency (<xref target="retrieval-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="context-utilization-rate" numbered="true" toc="default">
          <name>Context Utilization Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of retrieved context tokens that materially influence
            the generated response.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Not all retrieved content contributes to generation. Low
            utilization indicates retrieval of irrelevant content, wasting
            context window capacity and prefill compute.</t>
            
            <t>Utilization is difficult to measure directly. Proxy measurements
            include attention weight analysis and ablation studies.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Indirect measurement required</li>
              <li>Attribution challenges</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Retrieval Recall (<xref target="retrieval-recall"/>),
            Context Injection Overhead (<xref target="context-injection-overhead"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="agentic-metrics" numbered="true" toc="default">
        <name>Agentic and Compound System Metrics</name>
        
        <section anchor="task-completion-latency" numbered="true" toc="default">
          <name>Task Completion Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time for a compound system to satisfy a user intent,
            encompassing all LLM calls, retrieval steps, and tool executions.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Agentic systems perform multiple internal operations per user
            request. Task completion latency measures the full user-facing
            response time.</t>
            
            <t>Task completion latency depends on the number of internal steps,
            which varies by task complexity. Simple tasks complete in one LLM
            call. Complex tasks require multiple calls with tool use.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Variation with task complexity</li>
              <li>Definition of task completion for open-ended tasks</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Sub-Request Count (<xref target="sub-request-count"/>),
            Tool Execution Latency (<xref target="tool-execution-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="sub-request-count" numbered="true" toc="default">
          <name>Sub-Request Count</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of internal LLM inference requests triggered by a
            single user interaction.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Agentic systems decompose user requests into multiple LLM calls
            for planning, reasoning, and action. Sub-request count indicates
            system complexity and affects total latency and cost.</t>
            
            <t>High sub-request counts indicate complex reasoning chains or
            retry loops. Testers <bcp14>SHOULD</bcp14> examine sub-request count
            distributions to identify inefficient patterns.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>count</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Includes retries and failed attempts</li>
              <li>Varies with task complexity</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Task Completion Latency (<xref target="task-completion-latency"/>),
            Loop Incidence Rate (<xref target="loop-incidence-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="loop-incidence-rate" numbered="true" toc="default">
          <name>Loop Incidence Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of tasks where the agent enters a repetitive control
            flow without making progress.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Agentic loops occur when the system repeats similar actions
            without advancing toward task completion. Loops indicate planning
            failures or tool errors.</t>
            
            <t>Loop detection requires defining "progress" for the task domain.
            Common heuristics include action repetition count and state
            similarity thresholds.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Progress definition is task-dependent</li>
              <li>Distinguishing loops from legitimate retries</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Sub-Request Count (<xref target="sub-request-count"/>),
            Task Completion Latency (<xref target="task-completion-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="tool-execution-latency" numbered="true" toc="default">
          <name>Tool Execution Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time for external tool calls invoked by the agent.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Agents call external tools for information retrieval, computation,
            or actions. Tool latency contributes to task completion latency.</t>
            
            <t>Tool latency varies by tool type. Local computation completes in
            milliseconds. External API calls require seconds.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms) or seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>High variance across tool types</li>
              <li>External dependencies outside system control</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Task Completion Latency (<xref target="task-completion-latency"/>),
            Sub-Request Count (<xref target="sub-request-count"/>)</dd>
          </dl>
        </section>

        <section anchor="agentic-goodput" numbered="true" toc="default">
          <name>Agentic Goodput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of tasks completed successfully while meeting
            declared task-level objectives.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Agentic goodput combines completion rate and quality. A task that
            completes but produces incorrect results does not count toward
            goodput.</t>
            
            <t>Objective definitions are task-specific. Testers <bcp14>MUST</bcp14> declare
            objectives and evaluation criteria.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Task-specific objective definitions</li>
              <li>Evaluation criteria for correctness</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Task Completion Latency (<xref target="task-completion-latency"/>),
            SLO Attainment Rate (<xref target="slo-attainment"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="quality-policy" numbered="true" toc="default">
        <name>Quality and Policy Enforcement Metrics</name>
        
        <section anchor="policy-violation-rate" numbered="true" toc="default">
          <name>Policy Violation Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of responses that violate declared content or safety
            policies.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Safety systems filter outputs to prevent harmful content. Policy
            violation rate measures filter failure.</t>
            
            <t>Violation detection requires evaluation against policy criteria.
            Automated classifiers or human review provide measurements.</t>
            
            <t>Low violation rate indicates effective filtering. Very low rates
            may indicate overly restrictive filtering causing false refusals.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Policy definition and scope</li>
              <li>Detection method reliability</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>False Refusal Rate (<xref target="false-refusal-rate"/>),
            Guardrail Processing Overhead (<xref target="guardrail-overhead"/>)</dd>
          </dl>
        </section>

        <section anchor="false-refusal-rate" numbered="true" toc="default">
          <name>False Refusal Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of benign, policy-compliant requests that are
            incorrectly refused.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Overly sensitive safety filters refuse legitimate requests. False
            refusals degrade user experience and system utility.</t>
            
            <t>Measuring false refusals requires labeled benign requests.
            Testers <bcp14>MUST</bcp14> specify the evaluation dataset and labeling criteria.</t>
            
            <t>False refusal rate trades off against policy violation rate.
            Stricter filtering reduces violations but increases false
            refusals.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires labeled benign test set</li>
              <li>Trade-off with policy violation rate</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Policy Violation Rate (<xref target="policy-violation-rate"/>),
            Admission Rate (<xref target="admission-rate"/>)</dd>
          </dl>
        </section>

        <section anchor="guardrail-overhead" numbered="true" toc="default">
          <name>Guardrail Processing Overhead</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The additional latency introduced by safety, policy, or content
            filtering mechanisms.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Guardrails add processing before, during, or after generation.
            Input filters add to TTFT. Output filters add to end-to-end
            latency.</t>
            
            <t>Overhead is measured by comparing latency with guardrails enabled
            versus disabled.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Baseline measurement without guardrails</li>
              <li>Multiple guardrail stages with distinct overhead</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Time to First Token (<xref target="ttft"/>),
            End-to-End Latency (<xref target="end-to-end-latency"/>),
            Policy Violation Rate (<xref target="policy-violation-rate"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="slo-metrics" numbered="true" toc="default">
        <name>Service Level Objective Metrics</name>
        
        <section anchor="slo" numbered="true" toc="default">
          <name>Service Level Objective</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>A quantitative threshold for a performance metric that defines
            acceptable service quality.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>SLOs specify targets such as:</t>
            <ul>
              <li>P99 TTFT below 500ms</li>
              <li>P95 TPOT below 50ms</li>
              <li>Error rate below 0.1%</li>
            </ul>
            
            <t>SLOs derive from user experience requirements and business
            constraints. Different applications require different SLOs.</t>
            
            <t>SLO definitions include the metric, percentile, threshold, and
            measurement window. Testers <bcp14>MUST</bcp14> fully specify SLOs when
            reporting attainment.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>not applicable (SLO is a specification, not a measurement)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Application-specific requirements</li>
              <li>Multiple SLOs may apply simultaneously</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>SLO Attainment Rate (<xref target="slo-attainment"/>),
            Sustainable Load (<xref target="sustainable-load"/>)</dd>
          </dl>
        </section>

        <section anchor="slo-attainment" numbered="true" toc="default">
          <name>SLO Attainment Rate</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of requests or time periods meeting all declared
            SLOs.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Attainment rate summarizes SLO compliance. Request-based
            attainment counts requests meeting SLOs. Time-based attainment
            counts measurement windows where aggregate metrics meet SLOs.</t>
            
            <t>Attainment below 1.0 indicates SLO violations. The acceptable
            attainment level depends on SLA commitments.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Request-based vs time-based measurement</li>
              <li>Handling of multiple simultaneous SLOs</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Service Level Objective (<xref target="slo"/>),
            Sustainable Load (<xref target="sustainable-load"/>)</dd>
          </dl>
        </section>

        <section anchor="goodput" numbered="true" toc="default">
          <name>Goodput</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The throughput of requests that complete successfully and meet
            SLO requirements.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Goodput equals total throughput multiplied by SLO attainment rate.
            It measures useful, compliant work rather than raw capacity.</t>
            
            <t>Systems with high throughput but low attainment achieve low
            goodput. Goodput captures the throughput-quality trade-off.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>requests per second (req/s) or tokens per second (tok/s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Requires defined SLOs</li>
              <li>May use request or token basis</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>SLO Attainment Rate (<xref target="slo-attainment"/>),
            Output Token Throughput (<xref target="output-token-throughput"/>),
            Request Throughput (<xref target="request-throughput"/>)</dd>
          </dl>
        </section>
      </section>
    </section>

    <section anchor="measurement-considerations" numbered="true" toc="default">
      <name>Measurement Considerations</name>
      
      <section anchor="workload-spec" numbered="true" toc="default">
        <name>Workload Specification</name>
        
        <t>Benchmarks <bcp14>MUST</bcp14> specify the workload used for measurement:</t>
        
        <ul>
          <li>Input length distribution (mean, percentiles, min, max)</li>
          <li>Output length distribution or generation parameters</li>
          <li>Request arrival pattern (open-loop rate or closed-loop concurrency)</li>
          <li>Dataset source or generation method</li>
          <li>Tokenizer used for length calculations</li>
        </ul>
        
        <t>Workload characteristics affect all metrics. Results from different
        workloads are not directly comparable.</t>
      </section>

      <section anchor="warmup" numbered="true" toc="default">
        <name>Warm-up and Steady State</name>
        
        <t>Systems require warm-up before reaching steady-state performance.
        Warm-up effects include:</t>
        
        <ul>
          <li>JIT compilation of kernels</li>
          <li>Memory allocator warm-up</li>
          <li>Prefix cache population</li>
          <li>Batch size ramp-up</li>
        </ul>
        
        <t>Testers <bcp14>MUST</bcp14> exclude warm-up from measurement or report it
        separately. Testers <bcp14>SHOULD</bcp14> document the warm-up procedure and
        duration.</t>
      </section>

      <section anchor="measurement-duration" numbered="true" toc="default">
        <name>Measurement Duration</name>
        
        <t>Measurement windows <bcp14>MUST</bcp14> be long enough to capture steady-state
        behavior and sufficient samples for statistical reliability.</t>
        
        <t>For percentile measurements:</t>
        <ul>
          <li>P50 requires at least 100 samples</li>
          <li>P99 requires at least 1,000 samples</li>
          <li>P99.9 requires at least 10,000 samples</li>
        </ul>
        
        <t>Testers <bcp14>MUST</bcp14> report sample counts alongside percentiles.</t>
      </section>

      <section anchor="clock-sync" numbered="true" toc="default">
        <name>Clock Synchronization</name>
        
        <t>Distributed systems require synchronized clocks for latency
        measurement. Clock skew introduces measurement error.</t>
        
        <t>Testers <bcp14>SHOULD</bcp14> use NTP or PTP for clock synchronization and report
        the synchronization method and estimated accuracy.</t>
      </section>

      <section anchor="system-config" numbered="true" toc="default">
        <name>System Configuration</name>
        
        <t>Benchmarks <bcp14>MUST</bcp14> report system configuration:</t>
        
        <ul>
          <li>Hardware: accelerator model, count, memory, interconnect</li>
          <li>Software: serving framework, version, model format</li>
          <li>Model: architecture, parameter count, precision</li>
          <li>Serving: batching strategy, parallelism configuration</li>
          <li>Tuning: any non-default configuration parameters</li>
        </ul>
        
        <t>Results are specific to the reported configuration.</t>
      </section>
    </section>

    <section anchor="security" numbered="true" toc="default">
      <name>Security Considerations</name>
      
      <section anchor="threat-model" numbered="true" toc="default">
        <name>Threat Model</name>
        
        <t>This section considers adversaries who submit requests designed to
        degrade service for other users or extract information about the
        system or other users' requests.</t>
        
        <t>Performance benchmarking itself does not introduce security
        vulnerabilities. However, performance characteristics may be
        exploited by adversaries.</t>
      </section>

      <section anchor="side-channels" numbered="true" toc="default">
        <name>Side-Channel Information Leakage</name>
        
        <t>Shared infrastructure creates side-channel risks.</t>
        
        <t>Timing channels: Request latency depends on queue depth, batch
        composition, and cache state influenced by other users. An adversary
        observing their own request latency may infer information about
        concurrent requests.</t>
        
        <t>Cache channels: Prefix caching creates observable timing differences
        between cache hits and misses. An adversary may probe for cached
        prefixes to learn about other users' prompts.</t>
        
        <t>Batch channels: Continuous batching causes ITL variation based on
        batch membership changes. An adversary may infer when other requests
        arrive or complete.</t>
        
        <t>Mitigation strategies include request isolation, timing noise
        injection, and partitioned caching. These mitigations affect
        performance. Testers evaluating multi-tenant systems <bcp14>SHOULD</bcp14> measure
        side-channel leakage alongside performance.</t>
      </section>

      <section anchor="resource-exhaustion" numbered="true" toc="default">
        <name>Resource Exhaustion Attacks</name>
        
        <t>Adversaries may craft requests to exhaust system resources:</t>
        
        <t>Memory exhaustion: Requests with long outputs grow KV cache until
        memory is exhausted. Systems without output length limits or memory
        management are vulnerable.</t>
        
        <t>Compute exhaustion: Long input sequences maximize prefill compute.
        Pathological inputs may trigger worst-case attention patterns.</t>
        
        <t>Queue exhaustion: Bursts of requests exceed admission capacity.
        Without rate limiting, legitimate requests are delayed or rejected.</t>
        
        <t>The metrics Sustainable Load (<xref target="sustainable-load"/>), 
        and Admission Rate (<xref target="admission-rate"/>) characterize resilience to
        resource exhaustion.</t>
      </section>

      <section anchor="model-extraction" numbered="true" toc="default">
        <name>Model Extraction</name>
        
        <t>High-volume query access enables model extraction attacks where an
        adversary trains a copy of the model from input-output pairs. This
        document does not define rate-limiting terminology. Deployments
        concerned with model extraction <bcp14>SHOULD</bcp14> implement and monitor rate
        limits.</t>
      </section>

      <section anchor="benchmark-gaming" numbered="true" toc="default">
        <name>Benchmark Gaming</name>
        
        <t>Systems may be optimized for benchmark workloads in ways that do not
        generalize to production traffic. Testers <bcp14>SHOULD</bcp14> use diverse
        workloads representative of intended deployment.</t>
      </section>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>
      
      <references>
        <name>Normative References</name>
        
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        
        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>

      <references>
        <name>Informative References</name>
        
        <reference anchor="RFC1242" target="https://www.rfc-editor.org/info/rfc1242">
          <front>
            <title>Benchmarking Terminology for Network Interconnection Devices</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="July" year="1991"/>
          </front>
          <seriesInfo name="RFC" value="1242"/>
          <seriesInfo name="DOI" value="10.17487/RFC1242"/>
        </reference>
        
        <reference anchor="RFC2544" target="https://www.rfc-editor.org/info/rfc2544">
          <front>
            <title>Benchmarking Methodology for Network Interconnect Devices</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <author fullname="J. McQuaid" initials="J." surname="McQuaid"/>
            <date month="March" year="1999"/>
          </front>
          <seriesInfo name="RFC" value="2544"/>
          <seriesInfo name="DOI" value="10.17487/RFC2544"/>
        </reference>
        
        <reference anchor="RFC2647" target="https://www.rfc-editor.org/info/rfc2647">
          <front>
            <title>Benchmarking Terminology for Firewall Performance</title>
            <author fullname="D. Newman" initials="D." surname="Newman"/>
            <date month="August" year="1999"/>
          </front>
          <seriesInfo name="RFC" value="2647"/>
          <seriesInfo name="DOI" value="10.17487/RFC2647"/>
        </reference>
        
        <reference anchor="MLPERF">
          <front>
            <title>MLPerf Inference Benchmark</title>
            <author fullname="V. J. Reddi" initials="V. J." surname="Reddi"/>
            <date year="2020"/>
          </front>
          <seriesInfo name="DOI" value="10.1109/ISCA45697.2020.00045"/>
          <refcontent>Proceedings of the ACM/IEEE International Symposium on Computer Architecture (ISCA)</refcontent>
        </reference>
        
        <reference anchor="VLLM">
          <front>
            <title>Efficient Memory Management for Large Language Model Serving with PagedAttention</title>
            <author fullname="W. Kwon" initials="W." surname="Kwon"/>
            <date year="2023"/>
          </front>
          <seriesInfo name="DOI" value="10.1145/3600006.3613165"/>
          <refcontent>Proceedings of the ACM Symposium on Operating Systems Principles (SOSP)</refcontent>
        </reference>
        
        <reference anchor="ORCA">
          <front>
            <title>Orca: A Distributed Serving System for Transformer-Based Generative Models</title>
            <author fullname="G. Yu" initials="G." surname="Yu"/>
            <date year="2022"/>
          </front>
          <refcontent>Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)</refcontent>
        </reference>
        
        <reference anchor="SARATHI">
          <front>
            <title>Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve</title>
            <author fullname="A. Agrawal" initials="A." surname="Agrawal"/>
            <date year="2024"/>
          </front>
          <refcontent>Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI)</refcontent>
        </reference>
        
        <reference anchor="ATTENTION">
          <front>
            <title>Attention Is All You Need</title>
            <author fullname="A. Vaswani" initials="A." surname="Vaswani"/>
            <date year="2017"/>
          </front>
          <refcontent>Advances in Neural Information Processing Systems (NeurIPS)</refcontent>
        </reference>
        
        <reference anchor="JAIN">
          <front>
            <title>A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems</title>
            <author fullname="R. Jain" initials="R." surname="Jain"/>
            <author fullname="D. Chiu" initials="D." surname="Chiu"/>
            <author fullname="W. Hawe" initials="W." surname="Hawe"/>
            <date month="September" year="1984"/>
          </front>
          <refcontent>DEC Research Report TR-301</refcontent>
        </reference>
      </references>
    </references>

    <section anchor="appendix-supplementary" numbered="true" toc="default">
      <name>Supplementary Metrics</name>
      
      <t>This appendix defines metrics relevant for production deployment
      decisions but not essential for basic performance characterization.</t>
      
      <section anchor="energy-metrics" numbered="true" toc="default">
        <name>Energy and Sustainability Metrics</name>
        
        <section anchor="energy-per-token" numbered="true" toc="default">
          <name>Energy per Token</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The energy consumed to generate one output token.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Energy per token enables efficiency comparison across systems and
            sustainability analysis. The value depends on model size,
            hardware, batch size, and workload.</t>
            
            <t>Measurement requires power monitoring integrated with token
            counting. GPU power is accessible via vendor APIs (NVML for
            NVIDIA). Total system power requires external instrumentation.</t>
            
            <t>Energy per token differs between prefill and decode phases.
            Testers <bcp14>SHOULD</bcp14> report phase-separated energy when feasible.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Joules per token (J/tok) or millijoules per token (mJ/tok)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Measurement boundary (GPU vs system)</li>
              <li>Phase attribution</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Tokens per Joule (<xref target="tokens-per-joule"/>),
            Output Token Throughput (<xref target="output-token-throughput"/>)</dd>
          </dl>
        </section>

        <section anchor="tokens-per-joule" numbered="true" toc="default">
          <name>Tokens per Joule</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The number of output tokens generated per Joule of energy.</dd>
            
            <dt>Discussion:</dt>
            <dd>Tokens per Joule is the inverse of energy per token, expressing
            energy efficiency. Higher values indicate greater efficiency.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per Joule (tok/J)</dd>
            
            <dt>Issues:</dt>
            <dd>Same as Energy per Token</dd>
            
            <dt>See also:</dt>
            <dd>Energy per Token (<xref target="energy-per-token"/>)</dd>
          </dl>
        </section>

        <section anchor="instantaneous-power" numbered="true" toc="default">
          <name>Instantaneous Power Draw</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The electrical power consumed at a given instant.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Power draw varies with load. Idle systems consume less power than
            systems under load. Prefill phases consume more power than decode
            phases due to higher compute utilization.</t>
            
            <t>Testers <bcp14>MUST</bcp14> specify the measurement boundary: GPU only, GPU and
            CPU, or entire system including cooling.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Watts (W)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Measurement boundary specification</li>
              <li>Sampling rate for time-varying power</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Peak Power Draw (<xref target="peak-power"/>),
            Idle Power Draw (<xref target="idle-power"/>)</dd>
          </dl>
        </section>

        <section anchor="peak-power" numbered="true" toc="default">
          <name>Peak Power Draw</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The maximum instantaneous power observed during a measurement
            interval.</dd>
            
            <dt>Discussion:</dt>
            <dd>Peak power determines infrastructure requirements for power
            delivery and cooling. Systems may have brief power spikes
            exceeding average consumption.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Watts (W)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Sampling rate may miss brief spikes</li>
              <li>Relationship to thermal design power (TDP)</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Instantaneous Power Draw (<xref target="instantaneous-power"/>)</dd>
          </dl>
        </section>

        <section anchor="idle-power" numbered="true" toc="default">
          <name>Idle Power Draw</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The power consumed when the system is ready to serve but not
            processing requests.</dd>
            
            <dt>Discussion:</dt>
            <dd>Idle power represents the baseline cost of provisioned capacity.
            The difference between loaded and idle power indicates dynamic
            power range.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>Watts (W)</dd>
            
            <dt>Issues:</dt>
            <dd>Definition of idle (model loaded, empty batch)</dd>
            
            <dt>See also:</dt>
            <dd>Instantaneous Power Draw (<xref target="instantaneous-power"/>)</dd>
          </dl>
        </section>

        <section anchor="carbon-intensity" numbered="true" toc="default">
          <name>Carbon Intensity</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The greenhouse gas emissions per token or per request.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Carbon intensity equals energy consumption multiplied by the
            grid's carbon factor (gCO2e/kWh). Grid carbon intensity varies
            by location and time.</t>
            
            <t>Testers reporting carbon metrics <bcp14>MUST</bcp14> specify the grid carbon
            factor used.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>grams CO2 equivalent per token (gCO2e/tok)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Grid carbon factor variation</li>
              <li>Scope of emissions (operational only or including embodied)</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Energy per Token (<xref target="energy-per-token"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="economic-metrics" numbered="true" toc="default">
        <name>Economic and Cost Metrics</name>
        
        <section anchor="cost-per-million" numbered="true" toc="default">
          <name>Cost per Million Tokens</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The monetary cost to generate one million output tokens.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Cost includes compute (hardware amortization or rental), energy,
            and operations. Testers <bcp14>MUST</bcp14> specify included cost components.</t>
            
            <t>Cloud pricing provides a market cost reference. Self-hosted
            deployments require cost modeling.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>currency per million tokens (e.g., $/Mtok)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Cost component inclusion</li>
              <li>Pricing model assumptions</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>GPU-Hours per Request (<xref target="gpu-hours"/>),
            Throughput-Cost Ratio (<xref target="throughput-cost"/>)</dd>
          </dl>
        </section>

        <section anchor="gpu-hours" numbered="true" toc="default">
          <name>GPU-Hours per Request</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The accelerator time consumed to complete one request.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>GPU-hours measures resource consumption independent of hardware
            cost. For multi-GPU deployments, report aggregate GPU-hours.</t>
            
            <t>GPU-hours equals end-to-end latency multiplied by GPU count used
            for the request.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>GPU-hours</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Fractional allocation in multi-tenant systems</li>
              <li>GPU count for distributed inference</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>End-to-End Latency (<xref target="end-to-end-latency"/>),
            Cost per Million Tokens (<xref target="cost-per-million"/>)</dd>
          </dl>
        </section>

        <section anchor="throughput-cost" numbered="true" toc="default">
          <name>Throughput-Cost Ratio</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The ratio of sustainable throughput to infrastructure cost.</dd>
            
            <dt>Discussion:</dt>
            <dd>Higher ratios indicate better cost efficiency. Testers <bcp14>MUST</bcp14>
            specify the throughput metric (tokens or requests) and cost
            basis (hourly rental or amortized ownership).</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens per second per dollar-hour (tok/s/$h)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Cost basis selection</li>
              <li>Throughput metric selection</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Sustainable Load (<xref target="sustainable-load"/>),
            Cost per Million Tokens (<xref target="cost-per-million"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="hardware-utilization" numbered="true" toc="default">
        <name>Hardware Utilization Metrics</name>
        
        <section anchor="compute-utilization" numbered="true" toc="default">
          <name>Compute Utilization</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of available compute capacity in use.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Compute utilization indicates how effectively the system uses
            accelerator compute resources. Low utilization suggests memory
            bandwidth or scheduling bottlenecks.</t>
            
            <t>For GPUs, utilization metrics include SM occupancy and tensor
            core utilization. Testers <bcp14>MUST</bcp14> specify the utilization metric.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Multiple utilization definitions (SM, tensor core, etc.)</li>
              <li>Vendor-specific measurement tools</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Memory Bandwidth Utilization (<xref target="memory-bandwidth"/>)</dd>
          </dl>
        </section>

        <section anchor="memory-bandwidth" numbered="true" toc="default">
          <name>Memory Bandwidth Utilization</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of peak memory bandwidth consumed.</dd>
            
            <dt>Discussion:</dt>
            <dd>LLM decode is memory-bandwidth-bound on current hardware. High
            bandwidth utilization indicates the system is limited by memory
            throughput rather than compute.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Measurement tool availability</li>
              <li>Cache effects on apparent bandwidth</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Compute Utilization (<xref target="compute-utilization"/>)</dd>
          </dl>
        </section>

        <section anchor="kv-cache-memory" numbered="true" toc="default">
          <name>KV Cache Memory</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The memory consumed by key-value cache for active requests.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>KV cache memory grows with batch size and sequence length. Memory
            exhaustion limits concurrent request capacity.</t>
            
            <t>KV cache memory equals:
            batch_size * sequence_length * num_layers * 2 * hidden_dim * precision_bytes</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>gigabytes (GB) or percentage of accelerator memory</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Variation with batch composition</li>
              <li>Paged vs contiguous allocation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>KV Cache Swap Rate (<xref target="kv-cache-swap-rate"/>),
            Page Fault Latency (<xref target="page-fault-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="memory-fragmentation" numbered="true" toc="default">
          <name>Memory Fragmentation</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of free memory unusable for new allocations due to
            non-contiguous layout.</dd>
            
            <dt>Discussion:</dt>
            <dd>Memory fragmentation reduces effective capacity. Paged attention
            systems achieve low fragmentation through fine-grained allocation.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Allocator-specific measurement</li>
              <li>Definition of "unusable" depends on minimum allocation size</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>KV Cache Memory (<xref target="kv-cache-memory"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="distributed-metrics" numbered="true" toc="default">
        <name>Distributed Serving Metrics</name>
        
        <section anchor="tensor-parallel-efficiency" numbered="true" toc="default">
          <name>Tensor Parallel Efficiency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The ratio of achieved throughput to ideal linear scaling with
            tensor parallelism degree.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Tensor parallelism partitions model layers across accelerators.
            Communication overhead reduces efficiency below ideal scaling.</t>
            
            <t>Efficiency equals throughput with N GPUs divided by (N times
            single-GPU throughput).</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Baseline single-GPU measurement</li>
              <li>Communication topology effects</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Pipeline Parallel Efficiency (<xref target="pipeline-parallel-efficiency"/>),
            Collective Communication Latency (<xref target="collective-latency"/>)</dd>
          </dl>
        </section>

        <section anchor="pipeline-parallel-efficiency" numbered="true" toc="default">
          <name>Pipeline Parallel Efficiency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The ratio of achieved throughput to ideal linear scaling with
            pipeline parallelism depth.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Pipeline parallelism partitions model layers into sequential
            stages. Pipeline bubbles (idle time during fill and drain) reduce
            efficiency.</t>
            
            <t>Bubble fraction equals (P-1)/(P-1+M) where P is pipeline depth
            and M is microbatch count.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Microbatch count selection</li>
              <li>Stage imbalance effects</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Tensor Parallel Efficiency (<xref target="tensor-parallel-efficiency"/>)</dd>
          </dl>
        </section>

        <section anchor="expert-load-imbalance" numbered="true" toc="default">
          <name>Expert Parallel Load Imbalance</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The deviation of token routing from uniform distribution across
            experts in Mixture-of-Experts models.</dd>
            
            <dt>Discussion:</dt>
            <dd>Load imbalance causes some experts to become bottlenecks while
            others are underutilized. Imbalance metrics include coefficient
            of variation or max/mean ratio.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>dimensionless ratio or percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Imbalance metric selection</li>
              <li>Dynamic variation across batches</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Tensor Parallel Efficiency (<xref target="tensor-parallel-efficiency"/>)</dd>
          </dl>
        </section>

        <section anchor="collective-latency" numbered="true" toc="default">
          <name>Collective Communication Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The latency of collective operations (all-reduce, all-gather)
            used in distributed inference.</dd>
            
            <dt>Discussion:</dt>
            <dd>Collective communication occurs between forward pass segments in
            tensor-parallel inference. This latency adds to per-token
            generation time.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>microseconds (us) or milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Operation type specification</li>
              <li>Message size effects</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Tensor Parallel Efficiency (<xref target="tensor-parallel-efficiency"/>),
            Inter-Token Latency (<xref target="itl"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="quantization-metrics" numbered="true" toc="default">
        <name>Quantization and Precision Metrics</name>
        
        <section anchor="precision-mode" numbered="true" toc="default">
          <name>Precision Mode</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The numerical representation for model weights and activations.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Common precision modes include FP32, FP16, BF16, FP8, INT8, and
            INT4. Lower precision reduces memory and increases throughput at
            potential accuracy cost.</t>
            
            <t>Mixed precision uses different precisions for different tensors.
            Testers <bcp14>MUST</bcp14> specify precision for weights, activations, and KV
            cache separately if they differ.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>not applicable (categorical specification)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Mixed precision specification</li>
              <li>KV cache precision separate from weights</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Quantization Speedup (<xref target="quantization-speedup"/>),
            Quantization Accuracy Impact (<xref target="quantization-accuracy"/>)</dd>
          </dl>
        </section>

        <section anchor="quantization-speedup" numbered="true" toc="default">
          <name>Quantization Speedup</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The throughput improvement from quantization relative to a
            baseline precision.</dd>
            
            <dt>Discussion:</dt>
            <dd>Speedup equals quantized throughput divided by baseline
            throughput. Testers <bcp14>MUST</bcp14> specify the baseline precision.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>dimensionless ratio</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Baseline precision selection</li>
              <li>Workload effects on speedup</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Precision Mode (<xref target="precision-mode"/>),
            Quantization Memory Reduction (<xref target="quantization-memory"/>)</dd>
          </dl>
        </section>

        <section anchor="quantization-memory" numbered="true" toc="default">
          <name>Quantization Memory Reduction</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The reduction in model memory from quantization relative to a
            baseline precision.</dd>
            
            <dt>Discussion:</dt>
            <dd>Memory reduction enables larger batch sizes or longer sequences.
            Reduction equals 1 minus (quantized size divided by baseline
            size).</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Baseline precision selection</li>
              <li>Overhead from quantization metadata</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Precision Mode (<xref target="precision-mode"/>),
            KV Cache Memory (<xref target="kv-cache-memory"/>)</dd>
          </dl>
        </section>

        <section anchor="quantization-accuracy" numbered="true" toc="default">
          <name>Quantization Accuracy Impact</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The change in model quality metrics due to quantization.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Quantization may degrade accuracy. Impact is measured on task-
            specific benchmarks or perplexity.</t>
            
            <t>Testers <bcp14>MUST</bcp14> specify the evaluation benchmark and baseline.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage points or absolute score change</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Benchmark selection</li>
              <li>Task-specific variation</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Precision Mode (<xref target="precision-mode"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="operational-metrics" numbered="true" toc="default">
        <name>Operational Lifecycle Metrics</name>
        
        <section anchor="model-load-time" numbered="true" toc="default">
          <name>Model Load Time</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time from process start to model readiness for
            serving.</dd>
            
            <dt>Discussion:</dt>
            <dd>Load time includes weight loading from storage, memory
            allocation, and initialization. Load time affects scale-up
            responsiveness in autoscaling deployments.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Storage medium effects (local vs network)</li>
              <li>Parallelism in loading</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Cold Start Latency (<xref target="cold-start"/>),
            Scale-Up Latency (<xref target="scale-up"/>)</dd>
          </dl>
        </section>

        <section anchor="cold-start" numbered="true" toc="default">
          <name>Cold Start Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The latency of the first request after system initialization.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Cold start latency exceeds steady-state latency due to JIT
            compilation, cache population, and memory allocation.</t>
            
            <t>Cold start affects user experience for scale-to-zero deployments
            and after restarts.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds (ms)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Definition of cold (first request vs first after idle)</li>
              <li>JIT compilation effects</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Model Load Time (<xref target="model-load-time"/>),
            Warm-up Duration (<xref target="warmup-duration"/>)</dd>
          </dl>
        </section>

        <section anchor="warmup-duration" numbered="true" toc="default">
          <name>Warm-up Duration</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The time or request count until latency stabilizes at steady
            state.</dd>
            
            <dt>Discussion:</dt>
            <dd>Warm-up accounts for JIT compilation, cache warming, and
            allocator stabilization. Testers <bcp14>SHOULD</bcp14> exclude warm-up from
            steady-state measurements.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>seconds (s) or request count</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Stability definition</li>
              <li>Workload effects on warm-up</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Cold Start Latency (<xref target="cold-start"/>)</dd>
          </dl>
        </section>

        <section anchor="scale-up" numbered="true" toc="default">
          <name>Scale-Up Latency</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The elapsed time to add serving capacity and begin handling
            traffic.</dd>
            
            <dt>Discussion:</dt>
            <dd>Scale-up latency affects autoscaling responsiveness. It includes
            instance provisioning, model loading, and warm-up.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>seconds (s)</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Infrastructure-dependent (containers, VMs, bare metal)</li>
              <li>Warm traffic routing timing</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Model Load Time (<xref target="model-load-time"/>),
            Cold Start Latency (<xref target="cold-start"/>)</dd>
          </dl>
        </section>
      </section>

      <section anchor="long-context-metrics" numbered="true" toc="default">
        <name>Long-Context Metrics</name>
        
        <section anchor="max-context-length" numbered="true" toc="default">
          <name>Maximum Context Length</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The maximum combined input and output token count supported by
            the system.</dd>
            
            <dt>Discussion:</dt>
            <dd>Context length is limited by model architecture, memory capacity,
            and serving configuration. Systems may support longer contexts
            than advertised at degraded performance.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>tokens</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Architectural vs memory-constrained limits</li>
              <li>Performance degradation at maximum length</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Context Window Utilization (<xref target="context-window-util"/>)</dd>
          </dl>
        </section>

        <section anchor="context-window-util" numbered="true" toc="default">
          <name>Context Window Utilization</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The fraction of maximum context length consumed by a request.</dd>
            
            <dt>Discussion:</dt>
            <dd>Utilization equals (input tokens plus output tokens) divided by
            maximum context length. High utilization stresses memory and may
            degrade performance.</dd>
            
            <dt>Unit of measurement:</dt>
            <dd>percentage</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Maximum length definition</li>
              <li>Performance effects at high utilization</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Maximum Context Length (<xref target="max-context-length"/>)</dd>
          </dl>
        </section>

        <section anchor="long-context-scaling" numbered="true" toc="default">
          <name>Long-Context Latency Scaling</name>
          
          <dl>
            <dt>Definition:</dt>
            <dd>The rate at which latency increases with context length.</dd>
            
            <dt>Discussion:</dt>
            <dd><t>Prefill latency scales linearly with input length for standard
            attention. Decode latency scales with total sequence length due
            to growing KV cache access.</t>
            
            <t>Efficient attention mechanisms (sparse, linear) may achieve
            sub-linear scaling.</t></dd>
            
            <dt>Unit of measurement:</dt>
            <dd>milliseconds per token (ms/tok) or scaling exponent</dd>
            
            <dt>Issues:</dt>
            <dd><ul>
              <li>Prefill vs decode scaling differs</li>
              <li>Architecture-dependent scaling</li>
            </ul></dd>
            
            <dt>See also:</dt>
            <dd>Prefill Latency (<xref target="prefill-latency"/>),
            Decode Latency (<xref target="decode-latency"/>)</dd>
          </dl>
        </section>
      </section>
    </section>

    <section anchor="cross-reference" numbered="true" toc="default">
      <name>Cross-Reference Index</name>
      
      <t>This index groups metrics by relationship for navigation.</t>
      
      <t>Latency metrics:</t>
      <ul>
        <li>End-to-End Latency (<xref target="end-to-end-latency"/>)</li>
        <li>Time to First Token (<xref target="ttft"/>)</li>
        <li>Inter-Token Latency (<xref target="itl"/>)</li>
        <li>Time Between Tokens (<xref target="tbt"/>)</li>
        <li>Time per Output Token (<xref target="tpot"/>)</li>
        <li>Normalized Latency (<xref target="normalized-latency"/>)</li>
        <li>Prefill Latency (<xref target="prefill-latency"/>)</li>
        <li>Decode Latency (<xref target="decode-latency"/>)</li>
      </ul>
      
      <t>Throughput metrics:</t>
      <ul>
        <li>Output Token Throughput (<xref target="output-token-throughput"/>)</li>
        <li>Input Token Throughput (<xref target="input-token-throughput"/>)</li>
        <li>Request Throughput (<xref target="request-throughput"/>)</li>
        <li>Non-Padding Token Throughput (<xref target="non-padding-throughput"/>)</li>
        <li>Goodput (<xref target="goodput"/>)</li>
      </ul>
      
      <t>Distribution metrics:</t>
      <ul>
        <li>Latency Percentiles (<xref target="latency-percentiles"/>)</li>
        <li>Latency Distribution (<xref target="latency-distribution-full"/>)</li>
        <li>Token Delivery Jitter (<xref target="token-delivery-jitter"/>)</li>
        <li>Maximum Pause Duration (<xref target="max-pause-duration"/>)</li>
      </ul>
      
      <t>Scheduling metrics:</t>
      <ul>
        <li>Head-of-Line Blocking (<xref target="hol-blocking"/>)</li>
        <li>Queue Depth (<xref target="queue-depth"/>)</li>
        <li>Queue Wait Time (<xref target="queue-wait-time"/>)</li>
        <li>Fairness Index (<xref target="fairness-index"/>)</li>
        <li>Batch Utilization (<xref target="batch-utilization"/>)</li>
        <li>Admission Rate (<xref target="admission-rate"/>)</li>
      </ul>
      
      <t>Resource management metrics:</t>
      <ul>
        <li>Preemption Rate (<xref target="preemption-rate"/>)</li>
        <li>Preemption Loss (<xref target="preemption-loss"/>)</li>
        <li>Starvation Rate (<xref target="starvation-rate"/>)</li>
        <li>Preemption Recovery Latency (<xref target="preemption-recovery-latency"/>)</li>
        <li>KV Cache Swap Rate (<xref target="kv-cache-swap-rate"/>)</li>
        <li>Page Fault Latency (<xref target="page-fault-latency"/>)</li>
      </ul>
      
      <t>Caching metrics:</t>
      <ul>
        <li>Prefix Cache Hit Rate (<xref target="prefix-cache-hit-rate"/>)</li>
        <li>Prefix Cache Capacity (<xref target="prefix-cache-capacity"/>)</li>
        <li>Cache Eviction Rate (<xref target="cache-eviction-rate"/>)</li>
        <li>TTFT Reduction from Caching (<xref target="ttft-reduction-caching"/>)</li>
      </ul>
      
      <t>Speculative decoding metrics:</t>
      <ul>
        <li>Draft Acceptance Rate (<xref target="draft-acceptance-rate"/>)</li>
        <li>Speculative Speedup (<xref target="speculative-speedup"/>)</li>
        <li>Draft Overhead (<xref target="draft-overhead"/>)</li>
        <li>Verification Throughput (<xref target="verification-throughput"/>)</li>
      </ul>
      
      <t>RAG metrics:</t>
      <ul>
        <li>Embedding Latency (<xref target="embedding-latency"/>)</li>
        <li>Retrieval Latency (<xref target="retrieval-latency"/>)</li>
        <li>Retrieval Recall (<xref target="retrieval-recall"/>)</li>
        <li>Context Injection Overhead (<xref target="context-injection-overhead"/>)</li>
        <li>Context Utilization Rate (<xref target="context-utilization-rate"/>)</li>
      </ul>
      
      <t>Agentic metrics:</t>
      <ul>
        <li>Task Completion Latency (<xref target="task-completion-latency"/>)</li>
        <li>Sub-Request Count (<xref target="sub-request-count"/>)</li>
        <li>Loop Incidence Rate (<xref target="loop-incidence-rate"/>)</li>
        <li>Tool Execution Latency (<xref target="tool-execution-latency"/>)</li>
        <li>Agentic Goodput (<xref target="agentic-goodput"/>)</li>
      </ul>
      
      <t>Quality metrics:</t>
      <ul>
        <li>Policy Violation Rate (<xref target="policy-violation-rate"/>)</li>
        <li>False Refusal Rate (<xref target="false-refusal-rate"/>)</li>
        <li>Guardrail Processing Overhead (<xref target="guardrail-overhead"/>)</li>
      </ul>
      
      <t>SLO metrics:</t>
      <ul>
        <li>Service Level Objective (<xref target="slo"/>)</li>
        <li>SLO Attainment Rate (<xref target="slo-attainment"/>)</li>
        <li>Goodput (<xref target="goodput"/>)</li>
      </ul>
    </section>
  </back>
</rfc>
