Internet-Draft Computing-Aware Traffic Steering (CATS) October 2025
Fu, et al. Expires 13 April 2026 [Page]
Workgroup:
CATS
Internet-Draft:
draft-fu-cats-oam-fw-04
Published:
Intended Status:
Standards Track
Expires:
Authors:
H. Fu
ZTE Corporation
B. Liu
China Mobile
Z. Li
China Mobile
Q. Xiong
ZTE Corporation

Computing-Aware Traffic Steering (CATS) Operations, Administration, and Maintenance (OAM) Framework

Abstract

This document describes the OAM framework and requirements for Computing-Aware Traffic Steering (CATS). The framework defines the CATS OAM layering model and OAM components. It also describes the requirements to enable the fault and the performance management of end-to-end connections from clients to networks and finally to services instances.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 13 April 2026.

Table of Contents

1. Introduction

As described in [I-D.ietf-cats-usecases-requirements], edge computing provides lower response time and higher transmission rate than cloud computing by moving computing instances to the network edge. To meet the requirements of users that are highly distributive, service providers deploy the same type of service instances at multiple edge sites, which involves steering traffic from clients to the most appropriate computing instance.

Compute-aware traffic steering (CATS) [I-D.ietf-cats-framework] is a traffic engineering approach as per [I-D.ietf-teas-rfc3272bis] developed to address the aforementioned traffic steering problem. This approach takes into account the dynamic nature of both the computing resources and the network states to optimize the way that traffic is forwarded towards a given service instance. Various metrics can be taken into account to devise and enforce such service-specific and computing-aware traffic steering policies.To achieve better service assurance, it is necessary to not only rapidly detect whether the QoS provided by the computing networks meets the SLA requirements of clients, but also dynamically trigger the calculation and the adjustment of both the computing and the networking services. There are some OAM technologies developed for networks, but they are only deployed to facilitate the operations and the maintenance of network operators, and cannot provide measurements of an end-to-end connection from a client to a service instance.

To this end, based on the CATS framework as per [I-D.ietf-cats-framework], this document describes the OAM framework and requirements for Computing-Aware Traffic Steering (CATS). The framework defines the CATS OAM layering model and OAM components. It also describes the requirements to enable the fault and the performance management of end-to-end connections from clients to networks and finally to services instances.The deployment considerations are also described as well.

2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Terminology

This document makes use of the terms defined in [I-D.ietf-cats- framework].

4. Motivation

The main objectives of OAM are to detect anomalies before they intensify, reduce the number of traffic flows impacted by these abnormalities, and ensure that network operators fulfill their QoS guarantee commitments to meet the Service Level Agreement(SLA) of clients.

As a traffic engineering method, computing-aware traffic steering (CATS) takes into account the dynamic nature of both the computing resources and the network states to optimize the way that traffic is forwarded toward a given service instance. However, existing OAM technologies cannot be used to collect metrics associated with the computing resources. Therefore, it is necessary to extend the existing OAM technologies to build an end-to-end OAM for CATS. Key objectives include:

5. CATS OAM Framework

5.1. CATS OAM Layering Model

The CATS OAM layering model is shown in Fig. 1. In this architecture,both the CATS router and the underlay node are deployed with the existing OAM technologies.These OAM technologies are used to detect anomalies and monitor service performance in the network domain, and can be divided into three categories: link OAM, tunnel OAM, and service OAM.

  +------+ +--+--------+    +---+----+   +--------+--+ +--------+
  |client+-+  CATS-    +----+underlay+---+  CATS-    +-+service |
  |      | |Forwarder 1|    |  node  |   |Forwarder 2| |instance|
  +------+ +-----------+    +--------+   +-----------+ +----+---+

           o------------- Service OAM -----------o---------------o

           o------------- Tunnel OAM -----------o

                      o----o      o----o       o----o    Link OAM

                Figure 1: CATS OAM Layering Model
  • In link OAM, anomaly detection and performance monitoring are conducted for a single ethernet link. The link layer is an optional sublayer implemented in the data link layer between the Logical Link Control (LLC) and the MAC sublayer in the Open Systems Interconnection (OSI) model. Common detection tools of link OAM include IEEE-802 .3ah.

  • A tunnel bears multiple services so the tunnel OAM must ensure that the performance of a given service is not degraded when the network fails or the number of services in the tunnel increases. As a result, failure detection and performance monitoring are conducted on the LSP layer to implement service protection.Common detection tools of tunnel OAM include ITU-T Y.1711, MPLS-LM-DM, BFD, etc.

  • Service OAM is generally conducted for the L2VPN/L3VPN service layer that is provided by the network to evaluate the service quality and protect services. Common detection tools of service OAM include ITU-T Y.1731, TWAMP, STAMP, etc.

CATS simultaneously steers traffic along network paths and toward compute instances. Within the network domain the three conventional OAM mechanisms remain applicable, yet link-level OAM can at best cover the direct link between compute instances; no effective OAM exists from the ingress/egress gateways to the compute instances themselves. Moreover, the introduction of flow-affinity policies mandates that end-to-end quality assessment of service flows span both network and compute domains.

5.2. CATS OAM Components

The CATS OAM layering model should flexibly support existing OAM detection tools and it consists of the following three components, SI-OAM, TC-OAM and AF-OAM as Figure 2 shown.

  +------+ +--+--------+    +---+----+   +--------+--+ +--------+
  |client+-+  CATS-    +----+underlay+---+  CATS-    +-+service |
  |      | |Forwarder 1|    |  node  |   |Forwarder 2| |instance|
  +------+ +-----------+    +--------+   +-----------+ +----+---+
      ^       ^                                   ^         |
      |       |                                   |         |
      |       |                               +---+----+    |
      |       |                               | SI_OAM |<-->|
      |    +--+-----+                         +--------+    |
      |    | TC_OAM |<------------------------------------->|
      |    +--+-----+                                       |
      |       |                                             |
      |    +--+-----+                                       |
      +----+ AF_OAM |<------------------------------------->|
           +--+-----+


              Figure 2: CATS OAM Functional Components

5.2.1. SI-OAM Component

The functions of this component include (but are not limited to) detecting the failures that happen between the CATS-Forwarder 2 and the service instance, and measuring the associated metrics such as latency, packet loss, and bandwidth.The SI-OAM component generally would not dive into the internal structure of the network between the CATS-Forwarder 2 and the service instance and only makes the measurements of the end-to-end connection. These measurements are generally fed back to the C-SMA component to achieve faster failure detection and performance monitoring than the CATS control plane.

5.2.2. TC-OAM Component

The functions of this component include but are not limited to detecting the failures that happen between the CATS-Forwarder 1 and the service instance of a certain specific ID, and measuring the associated metrics such as delay and packet loss. The testing packets are delivered through the CATS Path Selector (C-PS) to the associated service instance according to the corresponding forwarding table entry of the CATS Traffic Classifier (C-TC) to verify whether the measurements of the connection meet the service level agreement (SLA) requirements. And if it does not, recalculation is triggered.

5.2.3. AF-OAM Component

The functions of this component include but are not limited to measuring the metrics such as delay, packet loss, and bandwidth, of the service flow in CATS. In general, the user experience of an active connection may be affected by a number of factors, such as the processing latency of the service instances may increase or the network performance may degrade due to the increase of the incoming traffic to the service instance. For CATS-Forwarder 1, it is necessary to evaluate whether the SLA requirements of service flows are achieved, and if the SLA requirements are not achieved, conduct appropriate path adjustments to compensate for the deviation as much as possible to ensure the clients have consistent experience. For client terminals, if the experience is degraded, it is necessary to accurately locate where the problem occurs and quickly conduct troubleshooting. It should be noted that related OAM tools can also be developed, so that the entire network stack (L2-L7) can be observed for applications and the entire network stack,instead of merely traditional application-level visibility or network-level visibility, providing a comprehensive solution for operators' efficiency.

6. CATS OAM Requirements

6.1. Operation

  • Sub-second/second-granularity telemetry SHALL be collected for CPU, GPU, memory, accelerator utilization and energy consumption to produce unified compute metrics (e.g., TOPS/W, TFLOPS).

  • These metrics SHALL be fused with network telemetry to generate an integrated “compute-network” telemetry stream encompassing packet loss, latency, throughput and compute load, providing real-time decision inputs to the C-PS.

6.2. Administration

  • Compute-resource provisioning: A node SHALL present a compute-capability template (type, capacity, affinity) at boot; OAM SHALL authenticate the template and synchronize it to the network-wide routing database.

  • Service contract and billing: OAM SHALL generate a billing model from multi-dimensional factors—compute class, usage duration, network distance—and push the model to edge controllers.

  • Unified orchestration: OAM SHALL abstract compute workloads into routable Compute-SIDs and, together with network SIDs, inject them into the SRv6/BGP SR Policy orchestration plane to enable resource scheduling across domains, clouds, and edges.

6.3. Maintenance

  • End-to-end quality assessment:1)Network segment: Employ BFD, TWAMP and IOAM to detect link/node faults; convergence latency SHALL be ≤ 50 ms.2)Compute segment: Utilize keep-alive plus health probes to monitor container/VM/accelerator liveness; crashes or overload SHALL be detected within seconds.

  • Fault correlation and localization: OAM SHALL correlate “compute unavailable” events with “network-path degradation” events to determine whether the root cause is resource exhaustion or packet loss, eliminating needless path shifts.

  • Intelligent self-healing: 1)Compute-node failure SHALL trigger the CATS Path Selector to re-select a path and move traffic in real time to an alternate node in the same or a remote pool. 2)Network-link failure SHALL invoke TI-LFA/SR-TE protection switching within < 50 ms while simultaneously updating the compute topology to prevent black-holing.

7. Deployment Considerations

To demonstrate the complete CATS OAM procedure, a proper OAM detection tool needs to be selected and deployed on the network and service instance hosts of the CATS OAM architecture. The selection of OAM detection tools is out of the scope of this document.

                             +-------------------------+
                  +--------------+ Intelligent controller  +-------------+
                  |              +-------------------+-----+             |
                  |                                   |                  |
                  v                                   v                  v
            +-----------+                        +-----------+       +--------+
            |  CATS-    |                        |  CATS-    |       |  Edge  |
            |Forwarder 1|                        |Forwarder 2|       |  Site  |
            |           |                        |           |Service|        |
+--------+  |+---------+|                        |+---------+|Metrics|S-ID 1  |
| client |  ||  C-PS   ||       +--------+       ||  C-SMA  |<-------|SI-ID 1 |
|        |  |+---------+|Network|        |Network|+---------+|       |        |
|+------+|  |  ^    ^   |Metrics|Underlay|Metrics|       ^   |       |S-ID 1  |
||AF-OAM|+--+  |    |   |<------+ domain |<------|       |   |-------|SI-ID 2 |
|+--+---+|  |  |    |   |       +--------+       |   +---+--+| OWAMP |        |
|   |    |  |  |    |   |                        |   |SI-OAM|<------>|S-ID 2  |
+---+----+  |  |+---+--+|           OWAMP        |   +------+|       |SI-ID 1 |
    |       |  ||TC-OAM|+------------------------+-----------+------>|        |
    |       |  |+------+|                        |           |       |S-ID 2  |
    |       | ++-------+|           IOAM         |           |       |SI-ID 2 |
    |       | | AF-OAM |+------------------------+-----------+------>|        |
    |       | +--------+|           IOAM         |           |       |        |
    +-------+-----------+------------------------+-----------+------>|        |
            +-----------+                        +-----------+       +--------+

             Figure 3: An Example Of CATS OAM Deployment

As illustrated in Fig. 3, the OWAMP and the IOAM tools are selected as examples to describe how the CATS OAM component works with these detection tools to fulfill the four objectives :

For different detection targets, flexible choices of detection protocols and mechanisms can be made, which will not be elaborated upon here.

8. Security Considerations

To be discussed in future versions of this document.

9. Acknowledgements

To be added upon contributions, comments and suggestions.

10. IANA Considerations

TBD.

11. References

11.1. Normative References

[I-D.ldbc-cats-framework]
Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ldbc-cats-framework-06, , <https://datatracker.ietf.org/doc/html/draft-ldbc-cats-framework-06>.
[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC4656]
Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, , <https://www.rfc-editor.org/rfc/rfc4656>.
[RFC7276]
Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. Weingarten, "An Overview of Operations, Administration, and Maintenance (OAM) Tools", RFC 7276, DOI 10.17487/RFC7276, , <https://www.rfc-editor.org/rfc/rfc7276>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC8402]
Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., Decraene, B., Litkowski, S., and R. Shakir, "Segment Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, , <https://www.rfc-editor.org/rfc/rfc8402>.
[RFC8754]
Filsfils, C., Ed., Dukes, D., Ed., Previdi, S., Leddy, J., Matsushima, S., and D. Voyer, "IPv6 Segment Routing Header (SRH)", RFC 8754, DOI 10.17487/RFC8754, , <https://www.rfc-editor.org/rfc/rfc8754>.
[RFC9378]
Brockners, F., Ed., Bhandari, S., Ed., Bernier, D., and T. Mizrahi, Ed., "In Situ Operations, Administration, and Maintenance (IOAM) Deployment", RFC 9378, DOI 10.17487/RFC9378, , <https://www.rfc-editor.org/rfc/rfc9378>.

11.2. Informative References

[I-D.ietf-cats-usecases-requirements]
Yao, K., Contreras, L. M., Shi, H., Zhang, S., and Q. An, "Computing-Aware Traffic Steering (CATS) Problem Statement, Use Cases, and Requirements", Work in Progress, Internet-Draft, draft-ietf-cats-usecases-requirements-07, , <https://datatracker.ietf.org/doc/html/draft-ietf-cats-usecases-requirements-07>.
[I-D.ietf-teas-rfc3272bis]
Farrel, A., "Overview and Principles of Internet Traffic Engineering", Work in Progress, Internet-Draft, draft-ietf-teas-rfc3272bis-27, , <https://datatracker.ietf.org/doc/html/draft-ietf-teas-rfc3272bis-27>.

Contributors

Daniel Huang
ZTE Corporation
Cheng Huang
ZTE Corporation
Wei Duan
ZTE Corporation

Authors' Addresses

Huakai Fu
ZTE Corporation
Bo Liu
China Mobile
Zhenqiang Li
China Mobile
Quan Xiong
ZTE Corporation