Internet-Draft Network Management March 2025
Liu & Guo Expires 4 September 2025 [Page]
Workgroup:
Network Management
Internet-Draft:
draft-liu-nmrg-ai-llm-inference-requirements-00
Published:
Intended Status:
Informational
Expires:
Authors:
C. Liu
China Mobile
C. Guo
China Mobile

Requirements Analysis of System and Network for Large Language Model Inference Service

Abstract

With the rise of ChatGPT, DeepSeek, and other Large Language Models, which is short for LLMs in the remaining part, as well as the proliferation of inference applications, inference serving oriented to large-scale users has become increasingly critical. However, due to the extreme demands on computing power and communication during inference, the large-scale service deployment of LLMs poses significant challenges. To address these challenges, different vendors have adopted diverse inference service architectures, among which the vLLM proposed in 2023 is the most representative. This paper investigates mainstream inference frameworks, summarizes their core design principles, and analyzes the requirements and challenges they impose on system and network configurations. The goal is to lay a foundation for defining a unified LLM inference architecture in the future.

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 4 September 2025.

Table of Contents

1. Introduction

Since the launch of ChatGPT in 2023, more and more product-level LLMs have emerged, with GPT-4o, Claude-Sonnet-3.5, Gemini, Kimi, and others leading the charge. In early 2025, DeepSeek-R1 reignited the LLM frenzy, and Musk's xAI recently unveiled the powerful Grok3. It is evident that LLMs will continue to reach new heights.

Major vendors, including OpenAI, Anthropic, DeepSeek, and Google, have deployed their LLM applications across mobile and web platforms. As the field grows, daily active users (DAUs) for these applications are expected to surge, potentially reaching hundreds of millions during peak periods. This presents significant challenges for large-scale inference services. For instance, up to now, DeepSeek still struggles with persistent "Service Busy" issues.

Existing large-scale inference service architectures primarily adopt two technical approaches: Prefill-Decoding (PD) Fusion and Prefill-Decoding Disaggregation, which is derived from the distinct computational characteristics of the Prefill (compute-intensive) and Decoding (memory-intensive) phases. Efficient network management and hardware coordination are essential to maximize system throughput and minimize user-perceived latency.

This document first introduces mainstream inference frameworks, then optimization metrics, and finally elaborates on the network and system requirements for deploying large-scale LLM inference services.

2. Service-Oriented Inference Frameworks

At present, there are two main technical routes of the mainstream LLM service systems, namely PD Fusion and PD Disaggregation. Prefill, which is to simultaneously compute all of tokens of user requests, also known as prompts, is characterized as computational intensive, computing-bound, with extremely high computing force requirements. Decoding generates user-required content based on the KV Cache and first token generated by Prefill phase. Due to the reuse of KV Cache of the tokens prior to the current token, it is characterized as memory-intensive and memory-bound, with higher requirements for memory in decoding phase. A complete LLM inference procedure is shown in Figure 1. Based on whether to decouple two stages with obviously different computing requirements, two technical routes of LLM inference serving system emerge, namely, PD Fusion and decoupled PD Disaggregation. The rest of this section describes in detail about the two technical architectures.

 +-------------+  +-------------+ +-------------+ +-------------+
 |     LLM     |  |     LLM     | |     LLM     | |     LLM     |
 | Iteration 1 +-+| Iteration 2 ++| Iteration 3 ++| Iteration 4 ++
 +-----^-------+ |+---^------^--+|+---^-------^-+|+---^-------^-+|
       |         |    |      |   |    |       |  |    |       |  |
       |         | +--+--+   |   | +--+--+    |  | +--+--+    |  |
<Prompt:Is apple | | KV  | <Yes> | | KV  |  <It> | | KV  |  <Is> |<EOS>
        a fruit?>| |Cache|   ^   | |Cache|    ^  | |Cache|    ^  |  ^
                 | +--^--+   |   | +--^--+    |  | +--^--+    |  |  |
                 |    |      |   |    |       |  |    |       |  |  |
                 +----+------+   +----+-------+  +----+-------+  +--+

+-----Prefill----+--------------------Decoding----------------------+
Figure 1: LLM Inference Process

Prefill: Processes all tokens in user prompts (Parallelizable, compute-bound, requiring high computing power).

Decoding: Generates output tokens sequentially based on the KV Cache from Prefill (Memory-bound, requiring high GPU memory).

2.1. PD Fusion Architecture

In PD Fusion, LLM instances are deployed within a single cluster, managed by a global scheduler responsible for load balancing, KV Cache management, and resource allocation. Most frameworks adopt vLLM[vLLM]'s paged KV Cache mechanism, inspired by OS virtual memory management. This approach stores KV Cache into non-contiguous physical blocks across nodes and uses a scheduler to map logical blocks to physical memory. Additionally, prefix-sharing strategies are employed to reuse KV Cache for prompts with identical prefixes, reducing redundant computations. Remote KV Cache replication across nodes is also required to reduce duplicated computing of KV Cache of same tokens. The architecture is shown in Figure 2.

                      Request1/Prompt1
                      Request2/Prompt2
                              |
                              |
                 +------------v------------+
                 |                         |
                 |  Scheduler/Controller   |
       Request1  |                         | Request2
     +-----------+  *********************  +----------+
     |           |  *KV Cache Management*  |          |
     |           |  *  Load Balancing   *  |          |
     |           |  *     ... ...       *  |          |
     |           |  *********************  |          |
     |           +-------------------------+          |
     |                                                |
     |                                                |
     |                                                |
+----v-----+  Remote    +----------+   Remote    +----v-----+
|  Model   |KVCache copy|  Model   | KVCache copy|  Model   |
|Instance 1<----------->|Instance 2|<------------>Instance 3|
+----------+            +----------+             +----------+
Figure 2: PD Fusion Architecture

2.2. PD Disaggregation Architecture

In PD Disaggregation, Prefill and Decoding are decoupled into separate instances to optimize hardware utilization. After Prefill computes the full KV Cache for a prompt, the data is transferred to Decoding instances for text generation. This architecture demands efficient coordination between Prefill and Decoding instances, as well as reliable high-speed data transmission. The workflow is illustrated in Figure 3.

                      Request1/Prompt1
                      Request2/Prompt2
                              |
                              |
                 +------------v------------+
                 |                         |
                 |  Scheduler/Controller   |
       Request1  |                         | Request2
     +-----------+  *********************  +----------+
     |           |  *KV Cache Management*  |          |
     |           |  *  Load Balancing   *  |          |
     |           |  *     ... ...       *  |          |
     |           |  *********************  |          |
     |           +-------------------------+          |
     |                                                |
     |                                                |
     |                                                |
+----v-----+                                     +----v-----+
|  Model   |                                     |  Model   |
|          |       Remote KVCache copy           |          |
| Prefill  <-------------------------------------> Prefill  |
|Instance 1|                                     |Instance 2|
+----+-----+                                     +----+-----+
     |KV Cache                                KV Cache|
     |Transfer                                Transfer|
     |                                                |
+----v-----+                                     +----v-----+
|  Model   |                                     |  Model   |
|          |                                     |          |
|Decoding  |                                     |Decoding  |
|Instance 1|                                     |Instance 2|
+----------+                                     +----------+
Figure 3: PD Disaggregation Architecture

3. System Goodput and Optimization Metrics

The ultimate goals of an inference system are to maximize system goodput which reflects the serving volume of user requests and minimize user-perceived latency. For PD Disaggregation architectures, two key metrics are defined as follows:

TTFT (Time to First Token): The time taken by the Prefill phase to generate the first token.

TBT (Time Between Tokens): The interval between consecutive token generations in the Decoding phase.

Optimization aims to minimize both TTFT and TBT under resource constraints and SLO constraints.

4. Network and System Requirements for Service-Oriented Inference Frameworks

To achieve large-scale LLM service deployment, frameworks MUST address the following challenges in both control plane and data plane.

4.1. Efficient Load Balancing

Both PD Fusion and PD Disaggregation architectures require dynamic load balancing to prevent server overload. For PD Disaggregation, schedulers MUST consider compute constraints (Prefill) and memory constraints (Decoding) when distributing requests.

4.2. KV Cache Management

Effective KV Cache management is critical. Most frameworks adopt vLLM’s paged KV Cache mechanism, schedulers are REQUIRED to handle memory allocation, cross-request KV cache sharing, and KV cache replacement policies. Future optimizations must address exponential user growth and ensure efficient cache synchronization across clusters or nodes.

4.3. KV Cache Transmission

PD Disaggregation architectures demand high-speed, reliable transmission of KV Cache data between Prefill and Decoding instances. The network MUST provide low-latency, high-bandwidth channels to ensure seamless coordination.

5. Security Considerations

TBD.

6. IANA Considerations

TBD.

7. References

7.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.

7.2. Informative References

[vLLM]
Kwon, W., "Efficient Memory Management for Large Language Model Serving with PagedAttention", .

Authors' Addresses

Chang Liu
China Mobile
Beijing
100053
China
Chuyi Guo
China Mobile
Beijing
100053
China