| Internet-Draft | LLM Benchmarking Profiles | January 2026 |
| Gaikwad | Expires 24 July 2026 | [Page] |
This document defines performance benchmarking profiles for Large Language Model (LLM) serving systems. Profiles bind the terminology defined in draft-gaikwad-llm-benchmarking-terminology and the procedures described in draft-gaikwad-llm-benchmarking-methodology to concrete architectural roles and workload patterns. Each profile clarifies the System Under Test (SUT) boundary, measurement points, and interpretation constraints required for reproducible and comparable benchmarking.¶
This document specifies profiles only. It does not define new metrics, benchmark workloads, or acceptance thresholds.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 5 July 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
LLM serving systems are rarely monolithic. Production deployments typically compose multiple infrastructural intermediaries before a request reaches a Model Engine. A request may pass through an API gateway for authentication, an AI firewall for prompt inspection, a load balancer for routing, and finally arrive at an inference engine. Each component adds latency and affects throughput.¶
Performance metrics such as Time to First Token (TTFT) or throughput are boundary dependent. A TTFT measurement taken at the client includes network latency, gateway processing, firewall inspection, queue wait time, and prefill computation. The same measurement taken at the engine boundary includes only queue wait and prefill. Without explicit boundary declaration, reported results cannot be compared.¶
This document addresses this ambiguity by defining benchmarking profiles: standardized descriptions of SUT boundaries and their associated performance interpretation rules. Section 4 defines four infrastructure profiles that specify what component is being measured. Section 5 defines workload profiles that specify how that component is tested. Section 6 then shows how to attribute latency across composed systems using delta measurement.¶
Client
|
v
+------------------+
| AI Gateway |<-- Auth, routing, caching
+--------+---------+
|
v
+------------------+
| AI Firewall |<-- Prompt inspection
+--------+---------+
|
v
+------------------+
| Model Engine |<-- Inference
+--------+---------+
|
v
+------------------+
| AI Firewall |<-- Output inspection
+--------+---------+
|
v
+------------------+
| AI Gateway |<-- Response normalization
+--------+---------+
|
v
Client
Each layer adds latency. Benchmarks must declare which layers are included.
This document uses metrics defined in [I-D.gaikwad-llm-benchmarking-terminology]. The following table maps profile-specific terms to their normative definitions.¶
| Term Used in Profiles | Terminology Draft Reference |
|---|---|
| TTFT | Time to First Token |
| ITL | Inter-Token Latency |
| TPOT | Time per Output Token |
| Queue Residence Time | Queue Wait Time |
| FRR | False Refusal Rate |
| Guardrail Overhead | Guardrail Processing Overhead |
| Task Completion Latency | Task Completion Latency |
| Goodput | Goodput |
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Profiles divide into two categories that serve orthogonal purposes. Conflating them produces misleading benchmarks.¶
Infrastructure Profiles define what is being tested. They specify the SUT boundary: where measurements start and end, what components are included, and what is excluded.¶
| Profile | SUT Boundary | Primary Question Answered |
|---|---|---|
| Model Engine | Inference runtime only | How fast can this engine generate tokens? |
| AI Gateway | API intermediary layer | What overhead does the gateway add? |
| AI Firewall | Security inspection layer | What latency and accuracy does inspection cost? |
| Compound System | End-to-end orchestration | How long does it take to complete a task? |
The choice of infrastructure profile determines which metrics are meaningful. Measuring "AI Firewall throughput" in tokens per second conflates firewall performance with downstream engine performance. The firewall does not generate tokens; it inspects them. Appropriate firewall metrics include inspection latency, detection rate, and false positive rate.¶
Workload Profiles define how the SUT is tested. They specify traffic patterns, request characteristics, and arrival models. Workload profiles are independent of infrastructure profiles.¶
| Profile | Traffic Pattern | Applicable To |
|---|---|---|
| Chatbot Workload | Multi-turn, streaming, human-paced | Engine, Gateway, Firewall, Compound |
| Compound Workflow | Multi-step, tool-using, machine-paced | Compound System primarily |
A Chatbot Workload can be applied to a Model Engine (measuring raw inference speed), an AI Gateway (measuring gateway overhead under conversational traffic), or a Compound System (measuring end-to-end chat latency including retrieval). The infrastructure profile determines the measurement boundary; the workload profile determines the traffic shape.¶
Conflating infrastructure and workload profiles produces non-comparable results. "Chatbot benchmark on Gateway A" versus "Chatbot benchmark on Engine B" compares different things. The former includes gateway overhead; the latter does not. Valid comparison requires either:¶
Cross-profile comparisons require explicit delta decomposition (Section 6).¶
| If you want to measure... | Use Infrastructure Profile | Apply Workload Profile |
|---|---|---|
| Raw model inference speed | Model Engine | Chatbot or synthetic |
| Gateway routing overhead | AI Gateway | Match production traffic |
| Security inspection cost | AI Firewall | Mixed benign/adversarial |
| End-to-end agent latency | Compound System | Compound Workflow |
| Full-stack production performance | Composite (see Section 7) | Match production traffic |
A Model Engine is the runtime responsible for executing LLM inference. Before specifying the benchmark boundary, understanding three core operations is necessary:¶
Prefill (also called prompt processing): The engine processes all input tokens in parallel to build initial hidden states. Prefill is compute-bound and benefits from parallelism. Prefill latency scales with input length but can be reduced by adding more compute.¶
Decode (also called autoregressive generation): The engine generates output tokens one at a time, each depending on all previous tokens. Decode is memory-bandwidth-bound because each token requires reading the full model weights. Decode latency per token is relatively constant regardless of batch size, but throughput increases with batching.¶
KV Cache: To avoid recomputing attention over previous tokens, the engine stores key-value pairs from prior tokens. The KV cache grows with sequence length and consumes GPU memory. Cache management (allocation, eviction, swapping to CPU) directly affects how many concurrent sequences the engine can handle.¶
These three operations determine the fundamental performance characteristics:¶
Included in SUT:¶
Excluded from SUT:¶
Model Engines exist in several architectural configurations that affect measurement interpretation.¶
Prefill and decode execute on the same hardware. This is the simplest configuration and the most common in single-GPU deployments.¶
MONOLITHIC ENGINE
+------------------------------------------+
| GPU / Accelerator |
| |
| Request -> [Prefill] -> [Decode] -> Out |
| | | |
| KV Cache <------+ |
| |
+------------------------------------------+
Timeline for single request:
|---- Queue ----|---- Prefill ----|---- Decode (N tokens) ----|
t6 t6a -> t7 t7 -> t8
Timestamp mapping:¶
| Symbol | Event |
|---|---|
| t6 | Request enters engine queue |
| t6a | Prefill computation begins (batch slot acquired) |
| t7 | First output token generated |
| t8 | Last output token generated |
Derived metrics:¶
Queue residence = t6a - t6 Prefill latency = t7 - t6a Engine TTFT = t7 - t6 Generation time = t8 - t7¶
Prefill and decode execute on separate hardware pools. Prefill nodes are optimized for compute throughput; decode nodes are optimized for memory bandwidth. After prefill completes, the KV cache must transfer across the network to the decode pool.¶
This architecture appears in published systems including DistServe [DISTSERVE] and Mooncake [MOONCAKE], and in open-source projects such as llm-d.¶
DISAGGREGATED SERVING
+------------------+ +------------------+
| PREFILL POOL | | DECODE POOL |
| | | |
| High compute | KV Cache | High memory BW |
| utilization | Transfer | utilization |
| | ================> | |
| +-----------+ | Network link | +-----------+ |
| | GPU 0 | | (RDMA or TCP) | | GPU 0 | |
| +-----------+ | | +-----------+ |
| | GPU 1 | | Bottleneck at | | GPU 1 | |
| +-----------+ | high context | +-----------+ |
| | ... | | lengths | | ... | |
| +-----------+ | | +-----------+ |
+------------------+ +------------------+
Timeline:
|-- Queue --|-- Prefill --|-- KV Transfer --|-- Decode --|
t6 t6a t7a t7 -> t8
The KV transfer phase (t7a) does not exist in monolithic deployments. This phase can become the bottleneck for long contexts.¶
KV Transfer Constraint:¶
Transfer time depends on context length and network bandwidth:¶
KV_transfer_time = (context_length * kv_bytes_per_token) / effective_bandwidth¶
Where:¶
Bandwidth Saturation Threshold: The context length at which KV transfer time exceeds prefill compute time. Beyond this threshold, adding more prefill compute does not reduce TTFT.¶
Configuration: Model: 70B parameters KV cache: 80 layers, 128 heads, 128 dim, BF16 KV bytes per token: 2 * 80 * 128 * 128 * 2 = 5.24 MB Inter-pool bandwidth: 400 Gbps = 50 GB/s effective At 4K context: KV transfer = 4096 * 5.24 MB / 50 GB/s = 430 ms At 32K context: KV transfer = 32768 * 5.24 MB / 50 GB/s = 3.44 s If prefill compute takes 500ms for 32K tokens: Bottleneck shifts to KV transfer at ~4.8K tokens
Testers benchmarking disaggregated architectures MUST report:¶
| Parameter | Description |
|---|---|
| Pool configuration | Number and type of prefill vs decode accelerators |
| KV transfer mechanism | RDMA, TCP, or other; theoretical bandwidth |
| KV bytes per token | Calculated from model architecture |
| Observed transfer latency | Measured, not calculated |
| Bandwidth saturation threshold | Context length where transfer becomes bottleneck |
| TTFT boundary | Whether reported TTFT includes KV transfer |
Results from disaggregated and monolithic deployments MUST NOT be directly compared without explicit architectural notation.¶
MONOLITHIC DISAGGREGATED
+----------------------+ +-----------+ +-----------+
| Single Pool | | Prefill | | Decode |
| | | Pool | | Pool |
| Prefill --> Decode | | | | |
| | | | | +-------+ | | +-------+ |
| +> KV Cache <---+ | | GPU 0 | | | | GPU 0 | |
| | | +-------+ | | +-------+ |
| Same memory space | | +-------+ | | +-------+ |
| No transfer needed | | | GPU 1 |======>| GPU 1 | |
| | | +-------+ | | +-------+ |
+----------------------+ | KV transfer | |
+-----------+ +-----------+
TTFT = Queue + Prefill TTFT = Queue + Prefill + KV_Transfer
Best for: Best for:
- Smaller models - Large models (70B+)
- Lower latency - Higher throughput
- Simpler deployment - Independent scaling
Model sharded across multiple accelerators using tensor parallelism (TP), pipeline parallelism (PP), or expert parallelism (EP for mixture-of-experts models).¶
Testers MUST report:¶
Testers MUST disclose:¶
| Configuration | Example Values | Why It Matters |
|---|---|---|
| Model precision | FP16, BF16, INT8, FP8 | Affects throughput, memory, and quality |
| Quantization method | GPTQ, AWQ, SmoothQuant | Different speed/quality tradeoffs |
| Batch strategy | Static, continuous, chunked prefill | Affects latency distribution |
| Max batch size | 64 requests | Limits concurrency |
| Max sequence length | 8192 tokens | Limits context window |
| KV cache memory | 24 GB | Limits concurrent sequences |
Speculative decoding uses a smaller draft model to propose multiple tokens, then verifies them in parallel with the target model. When draft tokens are accepted, generation is faster. When rejected, compute is wasted.¶
If speculative decoding is enabled, testers MUST report:¶
| Parameter | Description |
|---|---|
| Draft model | Identifier and parameter count |
| Speculation window (k) | Tokens proposed per verification step |
| Acceptance rate | Fraction of draft tokens accepted |
| Verification overhead | Latency when draft tokens are rejected |
Acceptance rate directly affects efficiency:¶
High acceptance (80%), k=5: Expected accepted per step = 4 tokens Verification passes per output token = 0.25 Low acceptance (30%), k=5: Expected accepted per step = 1.5 tokens Verification passes per output token = 0.67 Result: 2.7x more verification overhead at low acceptance¶
Results with speculative decoding MUST be labeled separately and include observed acceptance rate.¶
Chunked prefill splits long prompts into smaller pieces, processing each chunk and potentially interleaving with decode iterations from other requests. This reduces head-of-line blocking but increases total prefill time for the chunked request.¶
If chunked prefill is enabled, testers MUST report:¶
Request rate saturation differs from token saturation. A system might handle 2000 output tokens per second but only 50 requests per second if scheduling overhead dominates. Testers SHOULD measure both dimensions.¶
Mixed-length workloads increase tail latency under continuous batching. Short requests arriving behind long prefills experience head-of-line blocking. When workload includes high length variance, measure fairness: the ratio of actual latency to expected latency based on request size.¶
An AI Gateway is a network-facing intermediary that virtualizes access to one or more Model Engines. Gateways handle cross-cutting concerns that do not belong in the inference engine itself.¶
Gateways perform several functions that affect latency:¶
Request Processing: TLS termination, authentication, schema validation, and protocol translation. These operations add fixed overhead per request.¶
Routing: Selection of backend engine based on load, capability, or policy. Intelligent routing (e.g., KV-cache-aware) adds decision latency but may reduce overall latency by improving cache hit rates.¶
Caching: Gateways may implement response caching. Traditional exact-match caching has limited utility for LLM traffic due to low query repetition. Semantic caching (matching similar queries) improves hit rates but introduces quality risk from approximate matches.¶
Admission Control: Rate limiting and quota enforcement. Under load, admission control adds queuing delay or rejects requests.¶
Included in SUT:¶
Excluded from SUT:¶
Gateway overhead is meaningful only relative to direct engine access. Gateway benchmarks MUST declare measurement type:¶
| Measurement Type | What It Includes |
|---|---|
| Aggregate | Gateway processing plus downstream engine latency |
| Differential | Gateway overhead only, relative to direct engine access |
To measure differential latency:¶
Report both absolute values and delta.¶
Load balancing strategy affects tail latency. Testers MUST report:¶
| Configuration | Options | Impact |
|---|---|---|
| Algorithm | Round-robin, least-connections, weighted, adaptive | Tail latency variance |
| Health checks | Interval, timeout, failure threshold | Failover speed |
| Sticky sessions | Enabled/disabled, key type | Cache locality |
| Retry policy | Max retries, backoff strategy | Failure handling |
For intelligent routing (KV-cache-aware, cost-optimized, latency-optimized):¶
Modern gateways route to multiple backend models based on capability, cost, or latency.¶
When gateway routes to heterogeneous backends, testers MUST report:¶
Per-model metrics SHOULD be reported separately.¶
Cross-gateway comparison requires backend normalization. Comparing Gateway A (routing to GPT-4) against Gateway B (routing to Llama-70B) conflates gateway performance with model performance.¶
Semantic caching matches queries by meaning rather than exact text. A cache hit on "What is the capital of France?" might serve a response cached from "France's capital city?" This improves hit rates but risks serving inappropriate responses for queries that are similar but not equivalent.¶
Configuration Disclosure:¶
| Parameter | Example | Why It Matters |
|---|---|---|
| Similarity threshold | Cosine >= 0.92 | Lower threshold: more hits, more mismatches |
| Embedding model | text-embedding-3-small | Affects similarity quality |
| Cache capacity | 100,000 entries | Hit rate ceiling |
| Eviction policy | LRU, frequency-based | Long-term hit rate |
| Cache scope | Global, per-tenant, per-user | Security and hit rate tradeoff |
| TTL | 1 hour | Staleness vs hit rate |
Required Metrics:¶
| Metric | Definition |
|---|---|
| Hit rate | Fraction of requests served from cache |
| Hit rate distribution | P50, P95, P99 of per-session hit rates |
| Latency on hit | TTFT when cache serves response |
| Latency on miss | TTFT when engine generates |
| Cache delta | Latency_miss minus Latency_hit |
| Mismatch rate | Fraction of hits where cached response was inappropriate |
Mismatch rate requires evaluation. Testers SHOULD disclose evaluation methodology (human review, automated comparison, or LLM-as-judge).¶
Session Definition: For per-session metrics, define what constitutes a session: requests sharing a session identifier, requests from the same user within a time window, or another definition. Testers MUST disclose session definition.¶
Staleness in RAG Systems: When semantic cache operates with a RAG system, cached responses may reference documents that have since been updated.¶
| Parameter | Description |
|---|---|
| Index update frequency | How often RAG index refreshes |
| Cache TTL | Maximum age of cached entries |
| Staleness risk | Estimated fraction of stale cache hits |
Staleness risk estimate:¶
staleness_risk = (average_cache_age / index_update_interval) * corpus_change_rate¶
Benchmarking Constraints: Workload diversity determines hit rate. Testers MUST report:¶
An AI Firewall is a bidirectional security intermediary that inspects LLM inputs and outputs to detect and prevent policy violations.¶
Unlike traditional firewalls that examine packet headers or match byte patterns, AI Firewalls analyze semantic content. They must understand what a prompt is asking and what a response is saying. This requires ML models, making firewall latency fundamentally different from network firewall latency.¶
The firewall sits on the request path and adds latency to every request. The core tradeoff: more thorough inspection catches more threats but costs more time.¶
Included in SUT:¶
Excluded from SUT:¶
AI Firewalls operate bidirectionally. Each direction addresses different threats.¶
Inbound Enforcement inspects user prompts before they reach the model:¶
| Threat | Description |
|---|---|
| Direct prompt injection | User attempts to override system instructions |
| Indirect prompt injection | Malicious content in retrieved documents |
| Jailbreak attempts | Techniques to bypass model safety training |
| Context poisoning | Adversarial content to manipulate model behavior |
Outbound Enforcement inspects model outputs before they reach the user:¶
| Threat | Description |
|---|---|
| PII leakage | Model outputs personal information |
| Policy violation | Output violates content policies |
| Tool misuse | Model attempts unauthorized actions |
| Data exfiltration | Sensitive information encoded in output |
Testers MUST declare which directions are enforced. A benchmark testing inbound-only enforcement MUST NOT claim protection against outbound threats.¶
Firewalls use different inspection strategies with distinct latency characteristics.¶
INSPECTION ARCHITECTURE COMPARISON
BUFFERED (adds to TTFT, no ITL impact):
Input: ================........................
|-- collect --|-- analyze --|-- forward -->
Output: ..................========........
|- engine generates -|- buffer -|- analyze -|-->
t7 t12
|-- inspection delay --|
STREAMING (no TTFT impact, adds to ITL):
Output: ..o..o..o..o..o..o..o..o..o..o..o..o..o..o..o.
| | | | | | |
inspect | inspect | inspect | inspect
| | | | | | |
--o-----o-----o-----o-----o-----o-----o----->
Variable delays, increased jitter
Buffered Inspection: The firewall collects complete input (or output) before analysis.¶
Characteristics:¶
For outbound buffered inspection, the client receives the first token later than the engine generates it. This distinction matters:¶
Engine TTFT (t7 - t6): 200ms Outbound inspection: 50ms Client-observed TTFT (t12 - t0): 250ms + network¶
Streaming Inspection: The firewall analyzes content as tokens flow through.¶
Characteristics:¶
Required measurements:¶
| Metric | Definition |
|---|---|
| Per-token inspection delay | Average latency added per token |
| Maximum pause duration | Longest delay during streaming |
| Pause frequency | How often inspection causes batching |
| Jitter contribution | Standard deviation of delays |
Hybrid Inspection: Initial buffering followed by streaming. Common pattern: buffer first N tokens for context, then stream with spot-checks.¶
Configuration to disclose:¶
Accuracy Metrics:¶
| Metric | Definition |
|---|---|
| Detection Rate | Fraction of malicious inputs correctly blocked |
| False Positive Rate (FPR) | Fraction of benign inputs blocked by firewall |
| False Refusal Rate (FRR) | Fraction of policy-compliant requests refused at system boundary |
| Over-Defense Rate | FPR conditional on trigger-word presence in benign inputs |
FPR vs FRR: FPR measures firewall classifier errors on a benign test set. FRR measures all refusals observed at the system boundary, which may include:¶
Therefore: FRR >= FPR when other refusal sources exist.¶
When reporting both, attribute refusals by source when possible.¶
Over-Defense Rate: Measures false positives on benign inputs that contain words commonly associated with attacks.¶
Over-Defense Rate = P(Block | Benign AND Contains_Trigger_Words)¶
Examples of benign inputs that may trigger over-defense:¶
The test corpus for over-defense MUST contain semantically benign inputs that happen to include trigger words. Testing with trivially benign inputs does not measure real over-defense risk.¶
Latency Metrics:¶
| Metric | Definition |
|---|---|
| Passing latency | Overhead when firewall allows request |
| Blocking latency | Time to reach block decision |
| Throughput degradation | Reduction in requests per second |
Latency may vary by decision path:¶
Example: Allow (no flags): 8ms Allow (flagged, deep analysis): 45ms Block (pattern match): 3ms Block (semantic analysis): 67ms¶
Report latency distribution by decision type.¶
AI Firewall benchmarks require careful workload design.¶
Benign Workload: Normal traffic with no policy violations. Measures passing latency, FRR, and throughput impact on legitimate use. Source: Sanitized production samples or standard datasets.¶
Adversarial Workload: Known attack patterns. Measures detection rate, blocking latency, and FPR. Source: Published datasets (BIPIA [BIPIA], JailbreakBench, PromptInject) or red team generated. Do not publish working exploits.¶
Mixed Workload (recommended): Combines benign and adversarial at declared ratio.¶
| Parameter | Example |
|---|---|
| Mix ratio | 95% benign, 5% adversarial |
| Adversarial categories | 40% injection, 30% jailbreak, 30% PII |
| Arrival pattern | Uniform or bursty |
Production deployments often stack multiple inspection layers.¶
Request -> Quick Filter -> ML Classifier -> Model -> Semantic Check -> PII Scan -> Response
| | | |
regex embedding output entity
+ rules classifier analysis detection
¶
When multiple layers exist, report:¶
Delta decomposition:¶
Total overhead = Quick(2ms) + ML(12ms) + Semantic(34ms) + PII(8ms) = 56ms With short-circuit on input block: Overhead = Quick(2ms) + ML(12ms) = 14ms¶
Blocking speed alone is meaningless. A firewall blocking all requests in 1ms is useless. Always measure impact on benign traffic alongside detection effectiveness.¶
Disclose integration with WAF, rate limiting, or DDoS protection. These add latency.¶
Different attack categories have different detection latencies. Pattern-based detection is faster than semantic analysis. Report detection latency by category.¶
A Compound System executes multiple inference, retrieval, and tool-use steps to satisfy a user intent. The system orchestrates these steps, manages state across them, and produces a final response.¶
Examples: RAG pipelines, multi-agent systems, tool-using assistants, coding agents.¶
Unlike single-inference benchmarks, compound system benchmarks measure task completion, not token generation. The primary question is "Did it accomplish the goal?" not "How fast did it generate tokens?"¶
Included in SUT:¶
Excluded from SUT:¶
Boundary Rule: The Compound System boundary includes only components deployed and controlled as part of the serving system. User-provided plugins or custom code at runtime are excluded. This prevents ambiguity when comparing systems with different extensibility models.¶
| Component | Included? | Rationale |
|---|---|---|
| Built-in retrieval | Yes | Part of serving system |
| Standard tool library | Yes | Shipped with system |
| User-uploaded plugin | No | User-supplied |
| External API (weather) | Latency measured | Outside boundary |
| Metric | Definition |
|---|---|
| Task Completion Latency | Time from user request to final response |
| Task Success Rate | Fraction of tasks completed correctly |
Task Success has two dimensions:¶
| Type | Definition | Evaluation |
|---|---|---|
| Hard Success | Structural correctness | Automated (valid JSON, no errors) |
| Soft Success | Semantic correctness | Requires evaluation |
When using automated evaluation for Task Success Rate, disclose oracle methodology.¶
LLM-as-Judge:¶
| Parameter | Report |
|---|---|
| Judge model | Identifier and version |
| Judge prompt | Full prompt or published rubric reference |
| Ground truth access | Whether judge sees reference answers |
| Sampling | Temperature, judgments per task |
Report inter-rater agreement if using multiple judges.¶
Rule-Based Evaluation:¶
| Parameter | Report |
|---|---|
| Rule specification | Formal definition |
| Coverage | Fraction of criteria that are rule-checkable |
| Edge case handling | How ambiguous cases resolve |
Human Evaluation:¶
| Parameter | Report |
|---|---|
| Evaluator count | Number of humans |
| Rubric | Criteria and scoring |
| Agreement | Inter-rater reliability (e.g., Cohen's Kappa) |
| Blinding | Whether evaluators knew system identity |
| Metric | Definition |
|---|---|
| Trace Depth | Sequential steps in execution |
| Fan-out Factor | Maximum parallel sub-requests |
| Sub-Request Count | Total LLM calls per user request |
| Loop Incidence Rate | Fraction of tasks with repetitive non-progressing actions |
| Stalled Task Rate | Fraction of tasks hitting step limit without resolution |
| State Management Overhead | Latency and memory for multi-turn context |
Stalled Task Rate:¶
Stalled Task Rate = Tasks_reaching_max_steps / Total_tasks¶
Stalled tasks differ from loops. A loop repeats similar actions. A stalled task may try diverse actions but fail to converge. Both indicate problems but different ones.¶
When Compound System includes Retrieval-Augmented Generation:¶
RAG PIPELINE LATENCY
Query --> Embed --> Search --> Rerank --> Inject --> Generate
| | | | | |
Q E S R I G
| | | | | |
----------------------------------------------------------->
0ms 15ms 60ms 180ms 185ms 385ms
| | | | |
+--15ms----+ | | |
+-----45ms-------+ | |
+-----120ms------+ |
+--5ms-+ |
+--200ms----+
TTFT = E + S + R + I + Prefill + Queue = 385ms
Configuration Disclosure:¶
| Component | Parameters |
|---|---|
| Embedding | Model, dimensions, batch size |
| Vector store | Type, index configuration |
| Search | Top-k, similarity metric, filters |
| Reranking | Model (if used), top-n after rerank |
| Context | Max tokens, formatting template |
RAG-Specific Metrics:¶
| Metric | Definition |
|---|---|
| Embedding Latency | Query to vector conversion |
| Retrieval Latency | Search and fetch time |
| Retrieval Recall | Fraction of relevant docs retrieved |
| Context Injection Overhead | Additional prefill from retrieved content |
Corpus Constraints:¶
| Characteristic | Impact |
|---|---|
| Corpus size | Larger means longer search |
| Document length | Longer means more context overhead |
| Semantic diversity | More diverse reduces precision |
Report corpus statistics: document count, average length, domain.¶
Vector index must be fully built before measurement.¶
For multi-agent or tool-using systems:¶
AGENTIC EXECUTION TRACE
User Request
|
v
+---------+ +---------+ +---------+
| Planner |---->| Agent A |---->| Agent B |
| (LLM) | | (LLM) | | (LLM) |
+----+----+ +----+----+ +----+----+
| | |
| +----+----+ |
| v v |
| +-------+ +-------+ |
| | Tool | | Tool | |
| | API | | DB | |
| +-------+ +-------+ |
| | | |
| +----+----+ |
| v |
| +---------+ |
+--------->| Final |<---------+
| Response|
+---------+
Trace depth: 4 (Planner -> A -> Tools -> B)
Fan-out: 2 (parallel tool calls)
Sub-requests: 3 LLM calls
Definitions:¶
| Term | Definition |
|---|---|
| Agent invocation | Single LLM call with specific role |
| Tool call | External capability invocation |
| Orchestration step | Planning/routing decision |
| Trace | Complete sequence for one user request |
Measurement Points:¶
| Metric | Start | End |
|---|---|---|
| Per-agent latency | Agent receives input | Agent produces output |
| Per-tool latency | Tool call initiated | Response received |
| Orchestration overhead | Previous step complete | Next step starts |
| Task completion | User request received | Final response delivered |
Custom user application logic and bespoke agent frameworks are out of scope. This profile covers general patterns, not specific implementations.¶
Workload profiles specify traffic patterns applied to infrastructure profiles. They do not define measurement boundaries.¶
| Characteristic | Description |
|---|---|
| Interaction | Stateful, multi-turn |
| Delivery | Streaming |
| Arrival | Closed-loop (user thinks between turns) |
| Session length | Variable, typically 3-20 turns |
| Parameter | Description | Example |
|---|---|---|
| Arrival model | Open or closed loop | Closed-loop |
| Think-time | User delay between turns | Exponential, mean=5s |
| Input length | Tokens per user message | Log-normal, median=50 |
| Output length | Tokens per response | Log-normal, median=150 |
| Context retention | History handling | Sliding window, 4K tokens |
| Session length | Turns per conversation | Geometric, mean=8 |
Chatbot Workload: Customer Support Arrival: Closed-loop, 100 concurrent sessions Think-time: Exponential(mean=8s) Input: Log-normal(mu=4.0, sigma=0.8), range [10, 500] Output: Log-normal(mu=5.0, sigma=1.0), range [20, 1000] Context: Sliding window, last 4000 tokens Session: Geometric(p=0.12), mean ~8 turns System prompt: 200 tokens, shared¶
| Characteristic | Description |
|---|---|
| Execution | Multi-step, may include parallel branches |
| Tool usage | API calls, code execution, database queries |
| Dependencies | Steps may depend on previous outputs |
| Failure modes | Steps may fail, requiring retry or alternatives |
| Parameter | Description | Example |
|---|---|---|
| Task complexity | Steps per task | Fixed=5 or distribution |
| Fan-out pattern | Parallel vs sequential | Max parallel=3 |
| Tool latency | External dependency behavior | Real, mocked, simulated |
| Failure injection | Simulated failures | 5% tool failure rate |
| Retry behavior | Failure handling | Max 2 retries, exponential backoff |
Compound workflows depend on external systems. Disclose handling:¶
| Approach | Description | When |
|---|---|---|
| Real | Actual API calls | Production-representative |
| Mocked | Fixed responses | Controlled experiments |
| Simulated | Statistical model | Reproducible benchmarks |
Report observed latency and failure rate for real dependencies. Report configured values for mocked dependencies.¶
Section 4 and Section 5 defined individual profiles. Production systems compose multiple profiles. A request may pass through Gateway, Firewall, and Engine before response generation.¶
Meaningful comparison across composed systems requires attributing latency to each component. This section defines the delta measurement model.¶
Consider a request flowing through a full stack:¶
Client | | t0: request sent v +-------------------+ | AI Gateway | | t1: arrives | | t2: exits | +-------------------+ | v +-------------------+ | AI Firewall | | t3: arrives | | t4: decision | | t5: exits | +-------------------+ | v +-------------------+ | Model Engine | | t6: queue entry | | t6a: exec start | | t7: first token | | t8: last token | +-------------------+ | v +-------------------+ | Output Path | | t9: fw receives | | t10: fw releases | | t11: gw releases | +-------------------+ | v t12: client receives first token
| Timestamp | Location | Event |
|---|---|---|
| t0 | Client | Request transmission begins |
| t1 | Gateway | Request arrives |
| t2 | Gateway | Request exits toward firewall |
| t3 | Firewall | Request arrives |
| t4 | Firewall | Inbound decision reached |
| t5 | Firewall | Request exits toward engine |
| t6 | Engine | Request enters queue |
| t6a | Engine | Prefill computation begins |
| t7 | Engine | First output token generated |
| t8 | Engine | Last output token generated |
| t9 | Firewall | First token arrives for outbound inspection |
| t10 | Firewall | First token released after inspection |
| t11 | Gateway | First token exits toward client |
| t12 | Client | Client receives first token |
| Component | Formula | Measures |
|---|---|---|
| Gateway inbound | t2 - t1 | Auth, validation, routing |
| Firewall inbound (pass) | t5 - t3 | Prompt inspection |
| Firewall inbound (block) | t4 - t3 | Time to block |
| Engine queue | t6a - t6 | Wait before execution |
| Engine prefill | t7 - t6a | Prefill computation |
| Engine TTFT | t7 - t6 | Queue plus prefill |
| Firewall outbound | t10 - t9 | Output inspection |
| Gateway outbound | t11 - t10 | Response processing |
| Metric | Formula | Notes |
|---|---|---|
| Engine TTFT | t7 - t6 | At engine boundary |
| System TTFT | t12 - t0 | Client-observed |
| Output path overhead | t12 - t7 | Delay from engine emit to client receive |
Delta metrics within a single component (t2 - t1, both from gateway clock) are reliable. Cross-component deltas (t6 - t5) require clock synchronization.¶
For end-to-end metrics involving client timestamps (t0, t12), clock skew introduces error.¶
Options:¶
Recommended practice: Calculate deltas within components rather than across boundaries when possible.¶
See Section 9.3 for synchronization requirements.¶
When SUT includes multiple profiles, testers MUST:¶
1. Enumerate all components in request path:¶
Client -> AI Gateway -> AI Firewall -> Model Engine -> AI Firewall -> Client¶
2. Declare measurement boundary:¶
| Type | Description |
|---|---|
| Full-stack | Client to response, all components |
| Per-component | Separate measurement at each boundary |
| Partial | Specific subset (e.g., Gateway + Engine) |
3. Provide delta decomposition:¶
Component | TTFT Contribution | Throughput Impact -------------------|-------------------|------------------ AI Gateway | +15ms | -3% AI Firewall (in) | +45ms | -8% Model Engine | 180ms (baseline) | baseline AI Firewall (out) | +12ms* | -12% -------------------|-------------------|------------------ Total | 252ms | -22% *Outbound adds to client-observed TTFT, not engine TTFT¶
Measure components independently before measuring composite:¶
If validation fails, interaction effects exist. Document them.¶
Components may interact beyond simple addition:¶
| Effect | Description | Example |
|---|---|---|
| Batching interference | Gateway batching conflicts with engine | Gateway batches 8, engine max is 4 |
| Cache interaction | High gateway cache hit means engine sees hard queries | Biased difficulty |
| Backpressure | Slow component causes upstream queuing | Firewall slowdown grows gateway queue |
| Timeout cascades | Mismatched timeouts waste resources | See below |
Timeout Cascades:¶
TIMEOUT CASCADE (mismatched configurations)
Gateway timeout: 10s -------------+
Firewall timeout: 15s ---------------------+
Engine timeout: 30s -------------------------------------+
| | | |
Time: 0s 10s 15s 30s
| | | |
+-- Request ---->| | |
| | | |
| Gateway -----X timeout | |
| (returns error to client) | |
| | | |
| Firewall -----------------+ |
| (still waiting) | |
| | |
| Engine ---------------------------+
| (completes at 12s, result discarded)
| | |
+-------------------------------------+
Result: Client gets error at 10s. Engine wastes 12s of compute.
Report timeout configurations and note mismatches.¶
All profiles MUST log:¶
| Field | Description |
|---|---|
| timestamp | Request start time |
| request_id | Unique identifier |
| profile | Infrastructure profile under test |
| workload | Workload profile applied |
| latency_ms | Total request latency |
| status | Success, error, timeout |
| Field | Description |
|---|---|
| queue_time_ms | Time in queue |
| prefill_time_ms | Prefill latency |
| decode_time_ms | Generation time |
| batch_size | Concurrent requests in batch |
| token_count_in | Input tokens |
| token_count_out | Output tokens |
| Field | Description |
|---|---|
| direction | Inbound or outbound |
| decision | Allow, block, modify |
| policy_triggered | Which policy matched |
| confidence | Detection confidence |
| inspection_time_ms | Analysis time |
| Field | Description |
|---|---|
| trace_id | Identifier linking all steps |
| step_count | Total orchestration steps |
| tool_calls | List of tools invoked |
| success_type | Hard, soft, or failure |
| Field | Description |
|---|---|
| cache_status | Hit, miss, or bypass |
| route_target | Selected backend |
| token_count_in | Input tokens |
| token_count_out | Output tokens |
OpenTelemetry integration SHOULD be supported. Reference GenAI semantic conventions when available.¶
For intermediary components (Gateway, Firewall), provide differential measurements:¶
Declare whether results include cold start.¶
| Profile | Cold Start Factors |
|---|---|
| Model Engine | JIT compilation, KV cache allocation, batch ramp-up |
| AI Gateway | Connection pool, cache population |
| AI Firewall | Model loading, rule compilation |
| Compound System | All above plus retrieval index loading |
If excluding cold start, report warm-up procedure and duration.¶
| Configuration | Minimum Accuracy | Method |
|---|---|---|
| Single-machine | Inherent | N/A |
| Same rack | 1ms | NTP |
| Distributed | 100us | PTP |
| Sub-millisecond analysis | 10us | PTP with hardware timestamps |
Reports MUST declare:¶
| Profile | Recommended Protocol | Notes |
|---|---|---|
| Model Engine | gRPC streaming | Lower overhead |
| AI Gateway | SSE over HTTP | Broad compatibility |
| AI Firewall | Match upstream/downstream | Minimize translation |
| Compound System | SSE or WebSocket | Client dependent |
Report chunk size distribution when measuring ITL.¶
AI Firewalls enforcing only one direction leave systems exposed.¶
Inbound-only gaps:¶
Outbound-only gaps:¶
Declare which directions are enforced. "AI Firewall protection" without direction is incomplete.¶
Security requirements for adversarial benchmarks:¶
Performance characteristics may leak information:¶
| Channel | Risk | Mitigation |
|---|---|---|
| Timing | Decision time reveals classification | Add noise |
| Cache | Hit patterns reveal similarity | Per-tenant isolation |
| Routing | Balancing reveals backend state | Randomize |
Multi-tenant benchmarks SHOULD measure side-channel exposure.¶
1. Executive Summary - SUT and profile(s) used - Key results 2. System Configuration - Hardware - Software versions - Profile-specific config (per Section 4) 3. Workload Specification - Workload profile - Parameters (per Section 5) - Dataset sources 4. Methodology - Measurement boundary - Clock synchronization - Warm-up procedure - Duration and request counts 5. Results - Primary metrics with percentiles - Secondary metrics - Delta decomposition (if composite) 6. Analysis - Observations - Interaction effects - Limitations 7. Reproduction - Config files - Scripts - Random seeds¶