| Internet-Draft | LLM Benchmarking Methodology | January 2026 |
| Gaikwad | Expires 24 July 2026 | [Page] |
This document defines benchmarking methodologies for Large Language Model (LLM) inference serving systems. It provides test procedures, setup parameters, measurement specifications, and reporting formats for evaluating latency, throughput, scheduling, and resource management characteristics. This document is a companion to "Benchmarking Terminology for Large Language Model Serving" and SHOULD be consulted alongside that terminology document.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 July 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
This document provides benchmarking methodologies for Large Language Model inference serving systems. It defines test procedures, measurement specifications, and reporting formats that enable meaningful performance comparison.¶
A companion document, "Benchmarking Terminology for Large Language Model Serving" [LLM-TERMS], defines the metrics referenced in this methodology. That terminology document SHOULD be consulted before attempting to make use of this document.¶
LLM serving systems present unique benchmarking challenges:¶
These characteristics require methodology beyond traditional throughput and latency measurement. This document addresses these challenges by specifying:¶
This document does not specify acceptance thresholds or recommend particular systems. It provides methodology for fair comparison.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
An implementation is not compliant if it fails to satisfy one or more of the MUST requirements for a given test. An implementation that satisfies all the MUST and all the SHOULD requirements for a test is said to be "unconditionally compliant" for that test; one that satisfies all the MUST requirements but not all the SHOULD requirements is said to be "conditionally compliant."¶
This document covers benchmarking methodology for transformer-based autoregressive language models deployed as network services. The methodology applies to:¶
The following are out of scope:¶
The System Under Test (SUT) boundary MUST be declared before benchmarking. This document defines three standard configurations.¶
The Model Engine configuration measures raw inference capability.¶
+------------------+
| Load Generator |
+--------+---------+
|
Internal API (gRPC/HTTP)
|
+--------v---------+
| Model Engine |
| (SUT Boundary) |
+------------------+
Included components:¶
Excluded components:¶
This configuration is appropriate for comparing inference engines (vLLM, TensorRT-LLM, SGLang) independent of deployment stack.¶
The Application Gateway configuration measures user-observable API performance.¶
+------------------+
| Load Generator |
+--------+---------+
|
External API (HTTPS)
|
+--------v---------+
| Application GW |
| (SUT Boundary) |
| +------------+ |
| | Engine | |
| +------------+ |
+------------------+
Included components (in addition to Model Engine):¶
This configuration is appropriate for comparing API providers or evaluating production deployment performance.¶
The Compound System configuration measures end-to-end task completion for agentic or retrieval-augmented workloads.¶
+------------------+
| Task Driver |
+--------+---------+
|
+--------v---------+
| Compound System |
| (SUT Boundary) |
| +------------+ |
| | Retrieval | |
| +------------+ |
| +------------+ |
| | Tools | |
| +------------+ |
| +------------+ |
| | Gateway | |
| +------------+ |
+------------------+
Included components (in addition to Application Gateway):¶
This configuration is appropriate for evaluating RAG systems or agentic applications.¶
The load generator produces requests and measures responses. It MUST satisfy the following requirements.¶
The load generator MUST measure time with resolution of 1 millisecond or better. Microsecond resolution is RECOMMENDED for ITL measurement.¶
The load generator MUST support streaming response protocols (SSE, WebSocket, or gRPC streaming). It MUST record the arrival time of each token or chunk, not only the complete response.¶
The load generator MUST support open-loop load generation where request arrival times are determined by a specified distribution independent of response times. Poisson arrivals MUST be supported. Uniform and bursty arrival patterns are RECOMMENDED.¶
The load generator MUST support closed-loop load generation where a fixed number of concurrent requests are maintained. When a request completes, a new request is immediately submitted.¶
The load generator MUST NOT allow slow responses to delay the submission of subsequent requests in open-loop mode. Asynchronous or multi-threaded implementation is REQUIRED.¶
Workload specification is critical for reproducible benchmarking. This document defines reference workloads with fixed characteristics. Testers MAY use custom workloads but MUST fully specify them.¶
Each workload MUST specify:¶
Distribution type (fixed, uniform, normal, empirical), parameters (mean, std, min, max, or histogram), and unit (tokens using specified tokenizer).¶
Distribution type (fixed, uniform, normal, empirical), parameters (mean, std, min, max, or histogram), control method (max_tokens parameter, stop sequence, or both), and unit (tokens using specified tokenizer).¶
Domain (general, code, conversation, instruction), language (English, multilingual, code languages), and system prompt presence and typical length.¶
Fraction of requests sharing common prefix and shared prefix length distribution.¶
This document defines five standard workloads. Full specifications appear in Appendix A.¶
Purpose: Baseline comparison with controlled variability¶
This workload isolates inference performance from content effects. It is REQUIRED for Model Engine benchmarking.¶
Purpose: Test behavior under realistic length variation¶
This workload tests scheduling fairness with high length variance.¶
Purpose: Simulate interactive chat workloads¶
This workload is RECOMMENDED for Application Gateway benchmarking.¶
Purpose: Simulate coding assistant workloads¶
This workload tests prefix caching effectiveness.¶
For reproducible benchmarking:¶
Token counts depend on the tokenizer. Different tokenizers produce different counts for identical text, making cross-system comparison challenging.¶
The test report MUST specify:¶
For cross-system comparison where systems use different tokenizers:¶
The test report MUST declare which option is used. Option B with cl100k_base (GPT-4 tokenizer) as reference is RECOMMENDED for cross-system comparison.¶
LLM serving systems require warm-up before reaching steady-state performance. Warm-up effects include JIT compilation, memory allocator initialization, prefix cache population, and batch size ramp-up.¶
Before measurement begins, testers MUST:¶
Testers SHOULD verify warm-up completion by:¶
When cold start performance is being measured (Model Load Time, Cold Start Latency), warm-up MUST be skipped. The test report MUST clearly indicate cold start measurement.¶
LLM serving systems deliver tokens via streaming protocols. The choice of protocol affects timing measurement.¶
This methodology supports:¶
Streaming protocols may deliver multiple tokens per chunk due to batching or network buffering. The test report MUST specify:¶
When chunks contain multiple tokens:¶
The test report MUST declare which option is used. Option C is RECOMMENDED when available.¶
Accurate timing requires synchronized clocks between load generator and SUT, and between distributed SUT components.¶
When load generator and SUT run on the same machine, clock synchronization is inherent. This configuration is RECOMMENDED for Model Engine testing.¶
When load generator and SUT are on different machines:¶
For Application Gateway testing where network latency is significant:¶
All timestamps MUST be recorded in a format with at least millisecond precision. ISO 8601 with milliseconds (YYYY-MM-DDTHH:MM:SS.sssZ) or Unix epoch with milliseconds is RECOMMENDED.¶
Production LLM deployments include safety systems that affect performance. Benchmarking MUST account for these systems.¶
The test report MUST disclose:¶
For Application Gateway benchmarking intended to represent production performance:¶
This section defines benchmarking tests. Each test includes: objective, setup parameters, procedure, measurements, and reporting format.¶
To determine the latency from request submission to first token receipt under varying load conditions. TTFT measures perceived responsiveness for interactive applications.¶
The following parameters MUST be defined:¶
For each request:¶
The first token is defined as the first content token received, excluding:¶
If the system emits non-content tokens before content, the test report MUST note this and specify whether TTFT measures time to any token or time to first content token.¶
For P99 accuracy within 10% relative error at 95% confidence, at least 1000 samples are required. For P99.9, at least 10000 samples. The test report MUST state the sample count.¶
The test report MUST include:¶
The results SHOULD be reported in tabular format:¶
| Metric | Value |
|---|---|
| Requests | 10000 |
| TTFT P50 | 127 ms |
| TTFT P90 | 245 ms |
| TTFT P95 | 312 ms |
| TTFT P99 | 524 ms |
| TTFT P99.9 | 891 ms |
| TTFT Mean | 156 ms |
| TTFT Min | 89 ms |
| TTFT Max | 1243 ms |
If applicable:¶
| Input Tokens | P50 (ms) | P95 (ms) | P99 (ms) |
|---|---|---|---|
| 0-256 | 95 | 198 | 312 |
| 256-512 | 142 | 287 | 445 |
| 512-1024 | 198 | 412 | 623 |
| 1024-2048 | 312 | 587 | 891 |
| 2048+ | 523 | 912 | 1243 |
Testers SHOULD include a histogram or CDF plot of the TTFT distribution.¶
To determine the maximum rate at which the SUT can generate output tokens while maintaining acceptable latency. This test measures system capacity under load.¶
The following parameters MUST be defined:¶
For open-loop:¶
For closed-loop:¶
When specified, throughput is measured as the maximum rate achieving these SLOs.¶
This test employs an iterative search to find maximum throughput.¶
For each load level (arrival rate or concurrency):¶
Use binary search to find maximum throughput:¶
System saturation is detected when:¶
| Metric | Value |
|---|---|
| Max Output Throughput | 2847 tok/s |
| Max Request Throughput | 18.2 req/s |
| Max Input Throughput | 5123 tok/s |
| Sustainable Load | 20 req/s |
| Tokens per GPU-second | 356 tok/s/GPU |
| Metric | P50 | P95 | P99 |
|---|---|---|---|
| TTFT | 312 ms | 687 ms | 1124 ms |
| TPOT | 42 ms | 78 ms | 134 ms |
| End-to-End | 6.2 s | 11.4 s | 18.7 s |
To characterize the relationship between throughput and latency across the operating range of the SUT. This test produces a throughput-latency curve revealing system behavior better than point measurements.¶
For each load level, record:¶
Derived metrics:¶
| Offered (r/s) | Achieved (tok/s) | TTFT P50 | TTFT P99 | TPOT P50 | TPOT P99 | Success |
|---|---|---|---|---|---|---|
| 2 | 284 | 95 | 142 | 32 | 41 | 100% |
| 6 | 852 | 102 | 178 | 34 | 48 | 100% |
| 10 | 1420 | 128 | 267 | 38 | 62 | 100% |
| 14 | 1988 | 198 | 512 | 48 | 98 | 100% |
| 18 | 2534 | 378 | 1234 | 72 | 198 | 99.8% |
| 22 | 2712 | 823 | 3456 | 142 | 523 | 94.1% |
Knee point: 14 req/s (TTFT P99 exceeds 2x minimum)¶
Saturation point: 22 req/s (throughput peaks)¶
To characterize the variability of token delivery during the decode phase. ITL distribution determines streaming smoothness experienced by users.¶
For each load level:¶
The interval between request submission and first token (TTFT) MUST NOT be included in ITL calculation.¶
| Metric | Value |
|---|---|
| ITL Samples | 15234 |
| ITL P50 | 38 ms |
| ITL P90 | 52 ms |
| ITL P95 | 67 ms |
| ITL P99 | 124 ms |
| ITL P99.9 | 312 ms |
| ITL Mean | 42 ms |
| ITL Std Dev | 28 ms |
| P99/P50 Ratio | 3.26 |
To determine the maximum number of concurrent requests the SUT can maintain while meeting latency objectives. This test measures memory capacity and scheduling limits.¶
Request completion rate >= 99%, TTFT P99 <= specified threshold, and no out-of-memory errors.¶
This test employs binary search to find maximum concurrent capacity.¶
For each concurrency level:¶
Binary search:¶
| Concurrency | Completion | TTFT P99 | TPOT P99 | Errors | Status |
|---|---|---|---|---|---|
| 8 | 100% | 142 ms | 38 ms | 0 | Pass |
| 16 | 100% | 178 ms | 42 ms | 0 | Pass |
| 32 | 100% | 267 ms | 52 ms | 0 | Pass |
| 64 | 99.7% | 523 ms | 78 ms | 0 | Pass |
| 128 | 97.2% | 1234 ms | 156 ms | 3 | Fail |
Maximum concurrent requests meeting criteria: 64¶
To evaluate how equitably the SUT allocates resources across concurrent requests with different characteristics. This test reveals head-of-line blocking, starvation, and priority effects.¶
Define two or more request classes:¶
| Class | Count | TTFT P50 | TTFT P99 | TPOT P50 | TPOT P99 |
|---|---|---|---|---|---|
| Short | 4012 | 89 ms | 234 ms | 35 ms | 67 ms |
| Long | 988 | 312 ms | 1234 ms | 42 ms | 89 ms |
| Metric | Value |
|---|---|
| Jain's Fairness Index | 0.87 |
| Short Class Starvation | 0.3% |
| Long Class Starvation | 2.1% |
To evaluate the performance benefit of prefix caching under workloads with shared prefixes. This test quantifies TTFT reduction from cache hits.¶
| Configuration | TTFT P50 | TTFT P95 | TTFT P99 |
|---|---|---|---|
| Cache Disabled | 312 ms | 423 ms | 534 ms |
| Cache (Cold) | 134 ms | 198 ms | 267 ms |
| Cache (Warm) | 98 ms | 156 ms | 212 ms |
To characterize SUT behavior when memory resources are constrained, including preemption, swapping, and degradation patterns.¶
For each oversubscription level:¶
| Oversub Level | Complete | Preempt | Fail Rate | TTFT P99 |
|---|---|---|---|---|
| 100% (base) | 99.7% | 0% | 0.3% | 523 ms |
| 110% | 98.2% | 5.2% | 1.8% | 789 ms |
| 125% | 94.5% | 18.7% | 5.5% | 1456 ms |
| 150% | 82.3% | 42.1% | 17.7% | 3234 ms |
| Context (tokens) | TTFT Mean | TTFT P95 | ms/1K tokens |
|---|---|---|---|
| 1024 | 89 ms | 112 ms | 76 |
| 4096 | 289 ms | 367 ms | 63 |
| 16384 | 1023 ms | 1287 ms | 59 |
| 65536 | 4234 ms | 5123 ms | 62 |
| 131072 | 9123 ms | 11234 ms | 68 |
Best fit: Linear (R^2 = 0.9987), ~68 microseconds per input token¶
| Configuration | TTFT P50 | TTFT P99 | E2E P50 | E2E P99 |
|---|---|---|---|---|
| Baseline | 98 ms | 234 ms | 4.2 s | 8.7 s |
| Input Filter | 112 ms | 267 ms | 4.3 s | 8.9 s |
| Output Filter | 101 ms | 242 ms | 4.8 s | 9.8 s |
| Full Filter | 118 ms | 289 ms | 5.0 s | 10.2 s |
| Configuration | Max Throughput | Reduction |
|---|---|---|
| Baseline | 2867 tok/s | - |
| Input Filter | 2756 tok/s | -3.9% |
| Output Filter | 2412 tok/s | -15.9% |
| Full Filter | 2289 tok/s | -20.2% |
When comparing multiple SUTs:¶
Testers MUST ensure:¶
When hardware differs:¶
For comparative claims:¶
Before publishing comparative results, verify:¶
Benchmarking methodology intersects with security in several ways.¶
Benchmark results may reveal:¶
Operators SHOULD consider whether to publish detailed capacity information publicly.¶
Systems may be optimized specifically for benchmark workloads in ways that do not generalize:¶
Testers SHOULD vary workloads and verify results with production traffic samples.¶
This methodology uses benign workloads. Adversarial inputs (jailbreak attempts, prompt injections) may have different performance characteristics due to guardrail processing.¶
Testing with adversarial workloads requires additional ethical and safety considerations not covered here.¶
Memory pressure tests (Section 5.8) intentionally push systems beyond capacity. Testers SHOULD:¶
This appendix provides complete specifications for standard workloads.¶
Purpose: Controlled baseline with minimal variance¶
Python pseudocode:¶
def generate_synthetic_uniform(n_requests, seed=42):
rng = random.Random(seed)
requests = []
for i in range(n_requests):
input_len = rng.randint(128, 512)
output_len = rng.randint(64, 256)
input_tokens = [rng.randint(0, 100255)
for _ in range(input_len)]
requests.append({
'input_tokens': input_tokens,
'max_tokens': output_len,
'temperature': 0.0
})
return requests
¶
Purpose: Test scheduling with high length variance¶
Purpose: Realistic interactive chat patterns¶
Purpose: Test prefix caching with code context¶
Purpose: Test long-context handling¶
This appendix provides detailed guidance for timing measurements.¶
Client-side TTFT:¶
T_first is when the complete "data:" line is received and parsed, not when the first byte of the response arrives.¶
SSE delivery may batch multiple tokens per event due to server-side batching, TCP buffering, or client-side buffering.¶
For sub-millisecond accuracy:¶
For quick comparisons, include at minimum:¶
=== LLM Benchmark Report (Minimum) === System Identification: - Model: [model name and version] - Hardware: [GPU type] x [count] - Software: [inference engine and version] - SUT Boundary: [Model Engine | Gateway | Compound] Test Configuration: - Workload: [workload name] - Load Model: [open-loop rate | closed-loop concurrency] - Request Count: [N] - Test Duration: [seconds] Key Results: - TTFT P50: [value] ms - TTFT P99: [value] ms - TPOT P50: [value] ms - TPOT P99: [value] ms - Max Throughput: [value] tok/s - Throughput at P99 TTFT < 500ms: [value] tok/s Notes: - [Any deviations from methodology] - [Guardrail configuration] === End Report ===¶
A complete benchmark report should include the following sections:¶
This document draws on the structure and approach established by RFC 3511 for firewall benchmarking methodology. The author thanks the Benchmarking Methodology Working Group for their foundational work in network device benchmarking.¶