| Internet-Draft | LLM Benchmarking Terminology | January 2026 |
| Gaikwad | Expires 24 July 2026 | [Page] |
This document defines terminology for benchmarking the performance of Large Language Model (LLM) inference serving systems. It establishes a shared vocabulary for latency, throughput, resource utilization, and quality metrics applicable to inference engines, application gateways, and compound agentic systems. This document defines terminology only and does not prescribe benchmark methodologies or acceptance thresholds.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 July 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.¶
Large Language Model inference serving has emerged as a distinct category of network service with performance characteristics unlike traditional request-response systems. The autoregressive generation process produces output tokens sequentially, creating a streaming response pattern where latency has multiple meaningful definitions. The prefill and decode phases exhibit different computational profiles. Memory consumption grows with sequence length due to key-value cache requirements.¶
Large Language Model serving systems are increasingly deployed as Internet-facing services, often exposed via standardized APIs and shared infrastructure. Their performance characteristics influence availability, fairness, and side-channel risk in multi-tenant environments. Establishing consistent benchmarking terminology enables clearer communication among implementers, operators, and researchers.¶
Despite widespread deployment of LLM serving systems, no standard terminology exists for describing their performance. Different implementations, benchmarks, and academic publications use inconsistent definitions for terms such as "throughput," "latency," and "tokens per second." This inconsistency hinders meaningful comparison across systems and creates confusion for practitioners.¶
This document addresses the terminology gap by providing precise definitions for LLM serving performance metrics. The structure and approach follow [RFC2647], which established benchmarking terminology for firewall performance. Each term includes a definition, discussion of context and implementation considerations, unit of measurement, open issues, and cross-references to related terms.¶
This document defines terminology only. It does not specify benchmark methodologies, workload profiles, or acceptance criteria. Companion documents may address those topics.¶
The metrics in this document apply to transformer-based autoregressive language models. Other model architectures such as diffusion models, encoder-only models, or non-autoregressive decoders require different terminology not covered here.¶
A prerequisite for benchmarking is defining the System Under Test (SUT) boundary. The same metric measured at different boundaries yields different values. This document identifies three SUT boundary categories:¶
Testers MUST declare the SUT boundary when reporting metrics. Metrics from different SUT boundaries MUST NOT be directly compared without adjustment.¶
The measurement point within the SUT also affects results. For latency metrics, testers MUST specify whether measurement occurs at the client, at the network edge, or within the serving infrastructure.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
End-to-end latency encompasses all processing stages: network transmission, queuing, prefill computation, decode iterations, output filtering, and return transmission. For streaming responses, end-to-end latency is measured until the final token is received.¶
End-to-end latency depends on output length. Longer responses require more decode iterations and produce higher latency. When comparing systems, testers SHOULD control for output length or report latency normalized by output token count.¶
Client-side measurement includes network round-trip time. Server-side measurement excludes external network latency but may still include internal network hops within a distributed serving system.¶
Time to First Token (TTFT) measures how long a user waits before any response content appears. For interactive applications, TTFT determines perceived responsiveness independent of total response length.¶
TTFT includes network transmission, authentication, admission control, queue wait time, and prefill computation. Under low load with short prompts, prefill dominates TTFT. Under high load, queue wait time may dominate. With long prompts, prefill computation scales with input token count.¶
For non-streaming responses, TTFT equals end-to-end latency because all tokens arrive together. Testers SHOULD specify whether the response mode is streaming or non-streaming.¶
Some systems emit an empty or whitespace-only first token before substantive content. Testers MUST specify whether TTFT measures time to any token or time to first non-empty token.¶
Inter-Token Latency (ITL) measures the generation interval between adjacent tokens during the decode phase. ITL reflects decode efficiency and is affected by batch size, model architecture, and memory bandwidth.¶
ITL varies across tokens within a single request due to batching dynamics. When other requests join or leave the batch, the per-request compute allocation changes. Testers SHOULD report ITL distribution statistics rather than a single value.¶
ITL is measured server-side and excludes network transmission delay. For client-observed intervals, see Time Between Tokens (Section 4.1.4).¶
When aggregating ITL across requests, token-weighted averaging counts each token equally. Request-weighted averaging counts each request equally regardless of length. These methods yield different results. Testers MUST specify the aggregation method.¶
Time Between Tokens (TBT) measures the client-observed interval between adjacent tokens. TBT equals ITL plus network transmission variability and buffering effects.¶
Network jitter, TCP buffering, and intermediary proxies cause TBT to differ from ITL. Multiple tokens may arrive in a single network packet, producing near-zero TBT followed by a longer gap.¶
TBT directly affects user-perceived streaming smoothness. High TBT variance creates a "stuttering" appearance even when average TBT is acceptable.¶
The average time to generate each output token after the first token, computed as:¶
TPOT = (End-to-End Latency - TTFT) / (Output Token Count - 1)¶
Time per Output Token (TPOT) summarizes decode-phase performance in a single value. Unlike ITL, which measures each interval, TPOT averages across all decode steps for a request.¶
TPOT excludes the first token because TTFT captures that interval separately. For single-token outputs, TPOT is undefined.¶
TPOT is request-weighted by construction: each request contributes one TPOT value regardless of output length. When aggregating across requests, report the distribution rather than only the mean.¶
TPOT relates to user-perceived generation speed. A TPOT of 50ms corresponds to 20 tokens per second, approximately 900 words per minute for English text.¶
Normalized latency enables comparison across requests with different output lengths. It amortizes fixed overhead (TTFT) across all tokens.¶
For short outputs, TTFT dominates normalized latency. For long outputs, TPOT dominates. Testers SHOULD report output length distribution alongside normalized latency to enable interpretation.¶
Normalized latency obscures user-facing behavior because users experience TTFT and TPOT separately. Testers SHOULD NOT report normalized latency as the sole latency metric.¶
Prefill performs a forward pass over all input tokens to populate the key-value cache. This computation is parallelizable across the input sequence and is compute-bound on current hardware.¶
Prefill latency scales approximately linearly with input token count for uncached inputs. With prefix caching enabled, prefill processes only the uncached suffix, reducing latency for requests sharing common prefixes.¶
Prefill latency is a component of TTFT. Other TTFT components include queue wait time, network latency, and authentication overhead.¶
Chunked prefill implementations split long prefills into smaller segments interleaved with decode steps. This reduces head-of-line blocking but increases total prefill time. Testers MUST specify whether chunked prefill is enabled.¶
Decode latency equals (Output Token Count - 1) multiplied by average ITL. It measures total time in the decode phase.¶
Decode is memory-bandwidth-bound on current hardware because each step reads the full model weights and growing key-value cache while producing a single token.¶
Decode latency grows linearly with output token count under constant batching conditions. Variable batch membership during generation causes decode latency to deviate from simple linear scaling.¶
Output token throughput measures system-wide generation capacity. It increases with offered load until the system saturates, then plateaus or declines.¶
Throughput measurement requires specifying the token counting method. Subword tokenization produces different counts than word-level tokenization for the same text. Testers MUST identify the tokenizer used.¶
Some implementations pad sequences to uniform length within a batch. Testers MUST specify whether throughput counts include or exclude padding tokens.¶
Throughput measurement requires a defined time window. Short windows capture instantaneous throughput. Longer windows smooth over load variations. Testers MUST specify the measurement window duration.¶
Input token throughput measures prefill capacity. Systems with efficient batched prefill achieve higher input throughput than those processing prompts sequentially.¶
Input throughput differs from output throughput because prefill and decode have different computational characteristics. A system optimized for long-context prefill may show high input throughput but lower output throughput, and vice versa.¶
Prefix caching affects input throughput measurement. With caching, input tokens divide into cache hits (not processed) and cache misses (processed). Testers MUST specify whether input throughput counts all input tokens or only cache misses.¶
Request throughput counts completed requests regardless of token counts. A system completing many short requests achieves higher request throughput than one completing fewer long requests, even at equal token throughput.¶
Request throughput is relevant for capacity planning when requests have predictable lengths or when per-request overhead dominates.¶
Failed requests require specified handling. Testers MUST specify whether request throughput includes only successful completions or also counts failures.¶
Batched inference pads sequences to uniform length. Padding tokens consume compute but carry no information. Non-padding throughput measures useful work.¶
Systems using variable-length batching or continuous batching may avoid padding entirely. For these systems, non-padding throughput equals total output throughput.¶
Offered load characterizes workload intensity. Two forms exist:¶
Open-loop load specifies a request arrival rate (requests per second) independent of system response. New requests arrive according to a specified distribution regardless of outstanding request count.¶
Closed-loop load specifies a fixed concurrency level. Each completed request triggers a new request, maintaining constant outstanding requests.¶
Open-loop load reveals system behavior under overload. Closed-loop load cannot exceed system capacity by construction. Testers MUST specify the load model and its parameters.¶
Sustainable load identifies the operating region boundary. Below sustainable load, the system meets latency and quality targets. Above sustainable load, latency increases unboundedly or requests fail.¶
Sustainable load depends on the service objectives. Stricter latency targets yield lower sustainable load. Testers MUST declare the service objectives when reporting sustainable load.¶
Sustainable load also depends on workload characteristics. Longer prompts or outputs reduce sustainable load. Testers MUST characterize the workload profile.¶
Mean latency obscures distribution shape. A system with low mean but high variance provides inconsistent user experience. Percentiles characterize the distribution.¶
P50 (median) indicates typical experience. P99 indicates worst-case experience for most users. P99.9 indicates extreme tail behavior relevant for high-volume services.¶
Percentile computation requires sufficient sample size. For P99 accuracy, at least 1000 samples are needed. For P99.9, at least 10000 samples. Testers MUST report sample size alongside percentiles.¶
Percentiles apply to TTFT, TPOT, ITL, and end-to-end latency. Testers SHOULD report percentiles for multiple metrics rather than a single summary.¶
Full distributions reveal structure that percentiles miss. Multimodal distributions indicate distinct operating regimes. Heavy tails indicate outlier sensitivity.¶
Histogram bin width affects resolution. Narrow bins reveal detail but require more samples. Testers SHOULD use logarithmic binning for latency distributions spanning multiple orders of magnitude.¶
Jitter measures streaming smoothness. Low jitter indicates consistent token pacing. High jitter indicates irregular delivery that users perceive as stuttering.¶
Jitter arises from batching dynamics, memory bandwidth contention, and garbage collection pauses. Systems with continuous batching show higher jitter than static batching due to variable batch membership.¶
Jitter is computed per-request, then aggregated. Report the distribution of per-request jitter values.¶
Maximum pause captures the worst interruption in streaming output. A single long pause degrades user experience even when average ITL is acceptable.¶
Long pauses arise from garbage collection, KV cache operations, batch recomputation after preemption, and request scheduling delays.¶
First-come-first-served scheduling causes head-of-line (HOL) blocking. A long prefill operation delays all subsequent requests regardless of their size.¶
Chunked prefill mitigates HOL blocking by limiting the maximum uninterruptible computation. Shortest-job-first scheduling eliminates HOL blocking but requires output length prediction.¶
HOL blocking is measured as the difference between observed latency and latency under an idealized scheduler with no blocking.¶
Queue depth indicates load relative to capacity. Growing queue depth signals approaching overload. Stable queue depth indicates balanced load.¶
Queue depth has multiple measurement points: admission queue, prefill queue, decode batch wait queue. Testers MUST specify the queue measured.¶
Queue wait time is a component of TTFT representing time spent waiting rather than computing. Under low load, queue wait approaches zero. Under high load, queue wait dominates TTFT.¶
Systems with multiple queues (admission, prefill, decode) have corresponding wait times. Total queue wait is the sum across stages.¶
Jain's Fairness Index quantifies allocation equality:¶
J(x) = (sum(x_i))^2 / (n * sum(x_i^2))¶
where x_i is the allocation to request or tenant i, and n is the count. J ranges from 1/n (maximally unfair) to 1 (perfectly fair).¶
Fairness applies to latency (lower is better, so use reciprocal), throughput, or SLO attainment. Testers MUST specify the measured quantity.¶
Multi-tenant systems require per-tenant fairness measurement. Single-tenant systems measure fairness across concurrent requests.¶
Static batching pads sequences to uniform length. Batch utilization measures the fraction of compute applied to real tokens versus padding.¶
Continuous batching achieves high utilization by avoiding padding. For continuous batching systems, batch utilization approaches 1.0 and is less informative.¶
Admission control rejects requests to prevent overload. Admission rate below 1.0 indicates load shedding.¶
Rejected requests may receive an error response or be redirected. Testers MUST specify the rejection behavior.¶
Memory pressure or priority policies cause request preemption. Preempted requests lose their key-value cache state and must recompute it upon resumption.¶
High preemption rates indicate memory over-subscription or aggressive scheduling. Preemption degrades latency for affected requests.¶
Preemption discards key-value cache state. Upon resumption, the system recomputes prefill for all tokens (input plus previously generated output). This recomputation is wasted work.¶
Preemption loss contributes to tail latency. Requests preempted late in generation lose more work than those preempted early.¶
Starvation occurs when scheduling policies indefinitely delay certain requests. Priority inversion and unbounded queue growth cause starvation.¶
The starvation threshold depends on application requirements. Testers MUST declare the threshold when reporting starvation rate.¶
Recovery latency includes: wait time for scheduling, KV cache reload or recomputation, and prefill of previously generated tokens.¶
Systems with KV cache offloading to host memory recover faster than those requiring full recomputation. Recovery latency varies with the amount of generated output at preemption time.¶
Memory-constrained systems swap KV cache to host memory when accelerator memory is exhausted. Swapping enables higher concurrency at the cost of swap latency.¶
Swap rate indicates memory pressure. High swap rates degrade latency due to PCIe transfer overhead.¶
Paged attention systems fault in KV cache blocks on demand. Page fault latency includes PCIe transfer time from host memory or storage.¶
Fault latency increases ITL for affected decode steps. Prefetching strategies aim to hide fault latency.¶
Prefix caching stores KV cache state for common prompt prefixes. Subsequent requests sharing a cached prefix skip prefill for those tokens.¶
Hit rate depends on workload locality. Workloads with shared system prompts achieve high hit rates. Workloads with unique prompts achieve low hit rates.¶
Hit rate is computed as cached tokens divided by total input tokens across requests. Per-request hit rate varies; report the distribution.¶
Cache capacity limits the number and length of prefixes stored. Larger capacity enables more prefixes or longer prefixes at the cost of memory available for active requests.¶
Capacity is often expressed as a fraction of total accelerator memory or as maximum cacheable token count.¶
Eviction occurs when cache capacity is exhausted. High eviction rates indicate insufficient capacity for the workload's prefix diversity.¶
Eviction policies (LRU, frequency-based) affect which prefixes remain cached. Testers SHOULD specify the eviction policy.¶
This metric quantifies caching benefit. It depends on both hit rate and the length of cached prefixes.¶
Measurement requires comparing TTFT with caching enabled versus disabled, or estimating based on prefill latency per token.¶
Speculative decoding uses a smaller draft model to propose multiple tokens verified in parallel by the target model. Higher acceptance rates yield greater speedup.¶
Acceptance rate depends on draft model quality and alignment with the target model. Rates vary by domain; code and structured text achieve higher rates than creative writing.¶
Acceptance rate is computed per speculation window, then averaged. Report the distribution across windows.¶
Speedup depends on acceptance rate and the relative cost of draft and target model inference. High acceptance with low draft cost yields high speedup.¶
Speculative decoding increases TTFT due to draft model prefill. Speedup applies to the decode phase only. End-to-end speedup is lower, especially for short outputs.¶
Draft model inference adds to prefill time and per-step decode time. This overhead must be recovered through acceptance to achieve net speedup.¶
Overhead is measured as additional latency per speculation window or as fraction of total compute.¶
Verification throughput measures the target model's capacity to check draft proposals. Higher verification throughput enables longer speculation windows.¶
Verification processes multiple tokens in parallel, achieving higher throughput than autoregressive generation of the same tokens.¶
Retrieval-Augmented Generation (RAG) systems embed queries before searching a vector store. Embedding latency adds to TTFT.¶
Embedding models are smaller and faster than generation models. Embedding latency is a minor TTFT component for most deployments.¶
Batched embedding of multiple queries achieves higher throughput than sequential embedding.¶
Retrieval latency includes vector similarity search, optional reranking, and document fetching. It is a component of TTFT for RAG systems.¶
Retrieval latency depends on index size, search algorithm, and number of results. Approximate nearest neighbor search trades accuracy for speed.¶
Recall measures retrieval effectiveness. Low recall causes the generation model to lack relevant context. This is a quality metric, not a performance metric, but affects overall system evaluation.¶
Measuring recall requires ground-truth relevance labels. For benchmarking without labels, use proxy metrics such as answer correctness.¶
Retrieved documents increase prompt length, increasing prefill computation. This overhead scales with retrieved token count.¶
The overhead is the difference between prefill latency with retrieved context and prefill latency without it.¶
Not all retrieved content contributes to generation. Low utilization indicates retrieval of irrelevant content, wasting context window capacity and prefill compute.¶
Utilization is difficult to measure directly. Proxy measurements include attention weight analysis and ablation studies.¶
Agentic systems perform multiple internal operations per user request. Task completion latency measures the full user-facing response time.¶
Task completion latency depends on the number of internal steps, which varies by task complexity. Simple tasks complete in one LLM call. Complex tasks require multiple calls with tool use.¶
Agentic systems decompose user requests into multiple LLM calls for planning, reasoning, and action. Sub-request count indicates system complexity and affects total latency and cost.¶
High sub-request counts indicate complex reasoning chains or retry loops. Testers SHOULD examine sub-request count distributions to identify inefficient patterns.¶
Agentic loops occur when the system repeats similar actions without advancing toward task completion. Loops indicate planning failures or tool errors.¶
Loop detection requires defining "progress" for the task domain. Common heuristics include action repetition count and state similarity thresholds.¶
Agents call external tools for information retrieval, computation, or actions. Tool latency contributes to task completion latency.¶
Tool latency varies by tool type. Local computation completes in milliseconds. External API calls require seconds.¶
Agentic goodput combines completion rate and quality. A task that completes but produces incorrect results does not count toward goodput.¶
Objective definitions are task-specific. Testers MUST declare objectives and evaluation criteria.¶
Safety systems filter outputs to prevent harmful content. Policy violation rate measures filter failure.¶
Violation detection requires evaluation against policy criteria. Automated classifiers or human review provide measurements.¶
Low violation rate indicates effective filtering. Very low rates may indicate overly restrictive filtering causing false refusals.¶
Overly sensitive safety filters refuse legitimate requests. False refusals degrade user experience and system utility.¶
Measuring false refusals requires labeled benign requests. Testers MUST specify the evaluation dataset and labeling criteria.¶
False refusal rate trades off against policy violation rate. Stricter filtering reduces violations but increases false refusals.¶
Guardrails add processing before, during, or after generation. Input filters add to TTFT. Output filters add to end-to-end latency.¶
Overhead is measured by comparing latency with guardrails enabled versus disabled.¶
SLOs specify targets such as:¶
SLOs derive from user experience requirements and business constraints. Different applications require different SLOs.¶
SLO definitions include the metric, percentile, threshold, and measurement window. Testers MUST fully specify SLOs when reporting attainment.¶
Attainment rate summarizes SLO compliance. Request-based attainment counts requests meeting SLOs. Time-based attainment counts measurement windows where aggregate metrics meet SLOs.¶
Attainment below 1.0 indicates SLO violations. The acceptable attainment level depends on SLA commitments.¶
Goodput equals total throughput multiplied by SLO attainment rate. It measures useful, compliant work rather than raw capacity.¶
Systems with high throughput but low attainment achieve low goodput. Goodput captures the throughput-quality trade-off.¶
Benchmarks MUST specify the workload used for measurement:¶
Workload characteristics affect all metrics. Results from different workloads are not directly comparable.¶
Systems require warm-up before reaching steady-state performance. Warm-up effects include:¶
Testers MUST exclude warm-up from measurement or report it separately. Testers SHOULD document the warm-up procedure and duration.¶
Measurement windows MUST be long enough to capture steady-state behavior and sufficient samples for statistical reliability.¶
For percentile measurements:¶
Testers MUST report sample counts alongside percentiles.¶
Distributed systems require synchronized clocks for latency measurement. Clock skew introduces measurement error.¶
Testers SHOULD use NTP or PTP for clock synchronization and report the synchronization method and estimated accuracy.¶
Benchmarks MUST report system configuration:¶
Results are specific to the reported configuration.¶
This section considers adversaries who submit requests designed to degrade service for other users or extract information about the system or other users' requests.¶
Performance benchmarking itself does not introduce security vulnerabilities. However, performance characteristics may be exploited by adversaries.¶
Shared infrastructure creates side-channel risks.¶
Timing channels: Request latency depends on queue depth, batch composition, and cache state influenced by other users. An adversary observing their own request latency may infer information about concurrent requests.¶
Cache channels: Prefix caching creates observable timing differences between cache hits and misses. An adversary may probe for cached prefixes to learn about other users' prompts.¶
Batch channels: Continuous batching causes ITL variation based on batch membership changes. An adversary may infer when other requests arrive or complete.¶
Mitigation strategies include request isolation, timing noise injection, and partitioned caching. These mitigations affect performance. Testers evaluating multi-tenant systems SHOULD measure side-channel leakage alongside performance.¶
Adversaries may craft requests to exhaust system resources:¶
Memory exhaustion: Requests with long outputs grow KV cache until memory is exhausted. Systems without output length limits or memory management are vulnerable.¶
Compute exhaustion: Long input sequences maximize prefill compute. Pathological inputs may trigger worst-case attention patterns.¶
Queue exhaustion: Bursts of requests exceed admission capacity. Without rate limiting, legitimate requests are delayed or rejected.¶
The metrics Sustainable Load (Section 4.3.6), and Admission Rate (Section 4.5.6) characterize resilience to resource exhaustion.¶
High-volume query access enables model extraction attacks where an adversary trains a copy of the model from input-output pairs. This document does not define rate-limiting terminology. Deployments concerned with model extraction SHOULD implement and monitor rate limits.¶
Systems may be optimized for benchmark workloads in ways that do not generalize to production traffic. Testers SHOULD use diverse workloads representative of intended deployment.¶
This appendix defines metrics relevant for production deployment decisions but not essential for basic performance characterization.¶
Energy per token enables efficiency comparison across systems and sustainability analysis. The value depends on model size, hardware, batch size, and workload.¶
Measurement requires power monitoring integrated with token counting. GPU power is accessible via vendor APIs (NVML for NVIDIA). Total system power requires external instrumentation.¶
Energy per token differs between prefill and decode phases. Testers SHOULD report phase-separated energy when feasible.¶
Power draw varies with load. Idle systems consume less power than systems under load. Prefill phases consume more power than decode phases due to higher compute utilization.¶
Testers MUST specify the measurement boundary: GPU only, GPU and CPU, or entire system including cooling.¶
Carbon intensity equals energy consumption multiplied by the grid's carbon factor (gCO2e/kWh). Grid carbon intensity varies by location and time.¶
Testers reporting carbon metrics MUST specify the grid carbon factor used.¶
Cost includes compute (hardware amortization or rental), energy, and operations. Testers MUST specify included cost components.¶
Cloud pricing provides a market cost reference. Self-hosted deployments require cost modeling.¶
GPU-hours measures resource consumption independent of hardware cost. For multi-GPU deployments, report aggregate GPU-hours.¶
GPU-hours equals end-to-end latency multiplied by GPU count used for the request.¶
Compute utilization indicates how effectively the system uses accelerator compute resources. Low utilization suggests memory bandwidth or scheduling bottlenecks.¶
For GPUs, utilization metrics include SM occupancy and tensor core utilization. Testers MUST specify the utilization metric.¶
KV cache memory grows with batch size and sequence length. Memory exhaustion limits concurrent request capacity.¶
KV cache memory equals: batch_size * sequence_length * num_layers * 2 * hidden_dim * precision_bytes¶
Tensor parallelism partitions model layers across accelerators. Communication overhead reduces efficiency below ideal scaling.¶
Efficiency equals throughput with N GPUs divided by (N times single-GPU throughput).¶
Pipeline parallelism partitions model layers into sequential stages. Pipeline bubbles (idle time during fill and drain) reduce efficiency.¶
Bubble fraction equals (P-1)/(P-1+M) where P is pipeline depth and M is microbatch count.¶
Common precision modes include FP32, FP16, BF16, FP8, INT8, and INT4. Lower precision reduces memory and increases throughput at potential accuracy cost.¶
Mixed precision uses different precisions for different tensors. Testers MUST specify precision for weights, activations, and KV cache separately if they differ.¶
Quantization may degrade accuracy. Impact is measured on task- specific benchmarks or perplexity.¶
Testers MUST specify the evaluation benchmark and baseline.¶
Cold start latency exceeds steady-state latency due to JIT compilation, cache population, and memory allocation.¶
Cold start affects user experience for scale-to-zero deployments and after restarts.¶
Prefill latency scales linearly with input length for standard attention. Decode latency scales with total sequence length due to growing KV cache access.¶
Efficient attention mechanisms (sparse, linear) may achieve sub-linear scaling.¶
This index groups metrics by relationship for navigation.¶
Latency metrics:¶
Throughput metrics:¶
Distribution metrics:¶
Scheduling metrics:¶
Resource management metrics:¶
Caching metrics:¶
Speculative decoding metrics:¶
RAG metrics:¶
Agentic metrics:¶
Quality metrics:¶
SLO metrics:¶