| Internet-Draft | HPC/AI scheduler job metadata | June 2026 |
| Xiong, et al. | Expires 31 December 2026 | [Page] |
This document defines a scheduler-facing metadata model for High Performance Computing (HPC) and AI workloads. The model captures common job, workload, scheduler, tenant, timing, and task metadata that can be mapped from heterogeneous workload managers and orchestration platforms and used as context for network service intent.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 31 December 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
HPC and AI workflows are commonly managed by workload managers and orchestration systems such as batch schedulers, Kubernetes-based training systems, workflow engines, and higher-level AI platforms. These systems maintain metadata about jobs, tasks, users, tenants, timing, resource requests, and workload structure.¶
Examples of such systems include HPC workload managers such as Slurm, PBS Pro/OpenPBS, IBM Spectrum LSF, and Grid Engine-style schedulers, as well as AI and machine learning orchestration platforms based on Kubernetes, Kubeflow, Ray, Volcano, Kueue, Red Hat OpenShift AI, NVIDIA Base Command Manager, and NVIDIA Run:ai. These examples are illustrative; the model is intended to be independent of any specific scheduler or orchestration platform.¶
The requirements reflected in this model are derived from the types of information commonly exposed by such workload schedulers and AI orchestration platforms, including workload identity, job structure, task or role information, timing, placement context, tenant or project context, and correlation identifiers. The intent is to carry the network-relevant subset of this information without requiring the network domain to adopt the native data model of any one scheduler.¶
The representation of this metadata is platform-specific. For example, an HPC scheduler may identify jobs using scheduler-local job identifiers and queues, while a Kubernetes-based AI platform may use namespaces, custom resources, pod sets, and workload admission objects. A common metadata model allows the network-relevant portions of these platform-specific job descriptions to be represented in a consistent form.¶
The broader HP-WAN context and current deployment considerations are described in [I-D.kcrh-hpwan-state-of-art] and [I-D.xhy-hpwan-framework]. This document focuses on the scheduler and job metadata needed to relate workload context to that network environment.¶
Related work on machine learning cluster scheduling, including [I-D.kompella-rtgwg-mlnwsched], illustrates that job timing, placement, and resource context can be relevant beyond the compute scheduler itself. This document provides a platform-neutral way to carry scheduler and job metadata that can be used for correlation with network service intent.¶
This document defines a YANG model for scheduler and job metadata. It does not define the requested network service itself and does not define how that service is realized in the network. The metadata defined here is intended to be used by a service intent model that expresses the desired connectivity outcome for the workload.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document defines common terminology used by the HPC/AI scheduler job metadata model, the HPC/AI service intent model, and the HPC/AI tunnel realization model.¶
The scheduler job metadata model provides workload context that can be consumed by a network service intent system. It includes identifiers and descriptive attributes that allow a network controller, orchestrator, or broker to correlate a network service request with the originating workload manager and job.¶
The model is intended to be independent of a specific workload manager. Platform-specific identifiers are carried as metadata and do not imply that the network controller understands the internal scheduling behavior of the originating platform.¶
This model is intended to provide a stable boundary between workload scheduling systems and IETF-defined interfaces used by data center and inter-data-center network orchestration systems.¶
module: ietf-hpc-scheduler-job-metadata
+--rw hpc-scheduler-job-metadata
+--rw scheduler
| +--rw scheduler-id? string
| +--rw scheduler-name? string
| +--rw scheduler-type? identityref
| +--rw platform-instance? string
+--rw submitter
| +--rw tenant-id? string
| +--rw project-id? string
| +--rw namespace? string
| +--rw user-id? string
| +--rw account-id? string
+--rw workload
| +--rw workload-id? string
| +--rw workload-name? string
| +--rw workload-type? identityref
| +--rw framework? identityref
| +--rw priority? uint32
| +--rw queue? string
| +--rw correlation-id? string
+--rw job
| +--rw job-id? string
| +--rw job-name? string
| +--rw job-array-id? string
| +--rw job-size? uint32
| +--rw task* [task-id]
| +--rw task-id string
| +--rw task-name? string
| +--rw task-role? identityref
| +--rw task-index? uint32
+--rw timing
+--rw submit-time? yang:date-and-time
+--rw earliest-start-time? yang:date-and-time
+--rw requested-start-time? yang:date-and-time
+--rw deadline? yang:date-and-time
+--rw requested-duration? uint32
+--rw duration-unit? identityref
Figure 2: Scheduler job metadata model structure
¶
The naming relationship between these concepts is hierarchical.¶
* Scheduler job metadata in this document identifies and describes the workload.¶
* A service intent as per draft-xkk-teas-hpc-service-intent identifies the network service requested for that workload.¶
* A tunnel realization as per draft-xkk-teas-hpc-tunnel-realization identifies the network resources used to realize an admitted service intent.¶
.----------------------------.
| Scheduler/Job Metadata |
| workload-id, job-id, |
| task-id, correlation-id |
'-------------+--------------'
|
| referenced by
v
.-------------+--------------.
| Service Intent |
| intent-id, workload-ref, |
| endpoints, objectives |
'-------------+--------------'
|
| admitted and realized by
v
.-------------+--------------.
| Tunnel Realization |
| realization-id, intent-ref,|
| tunnel/path references |
'----------------------------'
Figure 1: Relationship
¶
A workload or job can have zero or more service intent instances. A service intent instance can have zero or more tunnel realization instances. A tunnel realization instance is associated with one service intent instance, although the underlying network service may use one or more tunnels, paths, or technology-specific constructs.¶
The scheduler job metadata model provides context for a separate service intent request. A service intent instance can refer to the metadata instance using a workload identifier, job identifier, or correlation identifier. This separation allows multiple service intent requests to be associated with a single workload, and allows one service intent request to be updated or replaced without changing the scheduler metadata.¶
The YANG data model is as follows:¶
module ietf-hpc-scheduler-job-metadata {
yang-version 1.1;
namespace "urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata";
prefix hpc-sched;
import ietf-yang-types {
prefix yang;
reference
"RFC 6991: Common YANG Data Types";
}
organization
"IETF Traffic Engineering Architecture and Signaling (TEAS)
Working Group";
contact
"WG Web: <https://datatracker.ietf.org/wg/teas/>
WG List: <mailto:teas@ietf.org>
Editor: Quan Xiong
<mailto:xiong.quan@zte.com.cn>
Editor: Kireeti Kompella
<mailto:kireeti.ietf@gmail.com>
Editor: Daniel King
<mailto:d.king@lancaster.ac.uk>";
description
"This module defines a scheduler-facing metadata model for
High Performance Computing (HPC) and AI workloads. The model
captures common job, workload, scheduler, tenant, timing, and
task metadata that can be mapped from heterogeneous workload
managers and orchestration platforms.
Copyright (c) 2026 IETF Trust and the persons identified as
authors of the code. All rights reserved.
Redistribution and use in source and binary forms, with or
without modification, is permitted pursuant to, and subject
to the license terms contained in, the Revised BSD License
set forth in Section 4.c of the IETF Trust's Legal Provisions
Relating to IETF Documents
(https://trustee.ietf.org/license-info).
This version of this YANG module is part of RFC XXXX; see
the RFC itself for full legal notices.";
revision 2026-04-23 {
description
"Initial version of the HPC/AI scheduler job metadata model.";
reference
"RFC XXXX: HPC/AI Scheduler Job Metadata Model";
}
/*
* Identity definitions
*/
identity scheduler-type {
description
"Base identity for scheduler types.";
}
identity slurm {
base scheduler-type;
description
"Slurm workload manager.";
}
identity pbs {
base scheduler-type;
description
"PBS Pro/OpenPBS workload manager.";
}
identity lsf {
base scheduler-type;
description
"IBM Spectrum LSF workload manager.";
}
identity kubernetes {
base scheduler-type;
description
"Kubernetes-based orchestration platform.";
}
identity kubeflow {
base scheduler-type;
description
"Kubeflow AI orchestration platform.";
}
identity workload-type {
description
"Base identity for workload types.";
}
identity hpc-batch {
base workload-type;
description
"HPC batch workload.";
}
identity ai-training {
base workload-type;
description
"AI training workload.";
}
identity ai-inference {
base workload-type;
description
"AI inference workload.";
}
identity data-movement {
base workload-type;
description
"Data movement workload.";
}
identity framework {
description
"Base identity for workload frameworks.";
}
identity mpi {
base framework;
description
"Message Passing Interface (MPI) framework.";
}
identity tensorflow {
base framework;
description
"TensorFlow machine learning framework.";
}
identity pytorch {
base framework;
description
"PyTorch machine learning framework.";
}
identity task-role {
description
"Base identity for task roles.";
}
identity worker {
base task-role;
description
"Worker role in distributed computation.";
}
identity parameter-server {
base task-role;
description
"Parameter server role in distributed training.";
}
identity master {
base task-role;
description
"Master/coordinator role.";
}
identity duration-unit {
description
"Base identity for duration units.";
}
identity seconds {
base duration-unit;
description
"Duration in seconds.";
}
identity minutes {
base duration-unit;
description
"Duration in minutes.";
}
identity hours {
base duration-unit;
description
"Duration in hours.";
}
/*
* Typedefs
*/
typedef priority-type {
type uint32 {
range "0..1000";
}
description
"Priority value type, with higher values indicating higher priority.";
}
/*
* Groupings
*/
grouping scheduler-grouping {
description
"Scheduler identification and metadata.";
leaf scheduler-id {
type string;
description
"Unique identifier for the scheduler instance.";
}
leaf scheduler-name {
type string;
description
"Human-readable name of the scheduler.";
}
leaf scheduler-type {
type identityref {
base scheduler-type;
}
description
"Type of scheduler or orchestration platform.";
}
leaf platform-instance {
type string;
description
"Platform-specific instance identifier or version.";
}
}
grouping submitter-grouping {
description
"Submitter and tenant context.";
leaf tenant-id {
type string;
description
"Tenant identifier for multi-tenant environments.";
}
leaf project-id {
type string;
description
"Project identifier within the tenant.";
}
leaf namespace {
type string;
description
"Namespace identifier (e.g., Kubernetes namespace).";
}
leaf user-id {
type string;
description
"User identifier who submitted the workload.";
}
leaf account-id {
type string;
description
"Accounting or billing account identifier.";
}
}
grouping workload-grouping {
description
"Workload identification and metadata.";
leaf workload-id {
type string;
description
"Unique identifier for the workload.";
}
leaf workload-name {
type string;
description
"Human-readable name of the workload.";
}
leaf workload-type {
type identityref {
base workload-type;
}
description
"Type of workload.";
}
leaf framework {
type identityref {
base framework;
}
description
"Computational framework used by the workload.";
}
leaf priority {
type priority-type;
description
"Priority of the workload.";
}
leaf queue {
type string;
description
"Queue or partition where the workload is submitted.";
}
leaf correlation-id {
type string;
description
"Correlation identifier for cross-system tracing.";
}
}
grouping task-grouping {
description
"Task-level metadata.";
leaf task-id {
type string;
mandatory true;
description
"Unique identifier for the task within the job.";
}
leaf task-name {
type string;
description
"Human-readable name of the task.";
}
leaf task-role {
type identityref {
base task-role;
}
description
"Functional role of the task in the workload.";
}
leaf task-index {
type uint32;
description
"Index or sequence number of the task.";
}
}
grouping job-grouping {
description
"Job structure and task information.";
leaf job-id {
type string;
description
"Scheduler-specific job identifier.";
}
leaf job-name {
type string;
description
"Human-readable job name.";
}
leaf job-array-id {
type string;
description
"Job array identifier for array jobs.";
}
leaf job-size {
type uint32;
description
"Total number of tasks or execution units in the job.";
}
list task {
key "task-id";
description
"List of tasks comprising the job.";
uses task-grouping;
}
}
grouping timing-grouping {
description
"Timing and scheduling information.";
leaf submit-time {
type yang:date-and-time;
description
"Time when the workload was submitted to the scheduler.";
}
leaf earliest-start-time {
type yang:date-and-time;
description
"Earliest time when the workload can start.";
}
leaf requested-start-time {
type yang:date-and-time;
description
"Requested start time for the workload.";
}
leaf deadline {
type yang:date-and-time;
description
"Deadline by which the workload should complete.";
}
leaf requested-duration {
type uint32;
description
"Requested duration for the workload execution.";
}
leaf duration-unit {
type identityref {
base duration-unit;
}
description
"Unit for the requested duration.";
}
}
/*
* Top-level container
*/
container hpc-scheduler-job-metadata {
description
"Top-level container for HPC/AI scheduler job metadata.";
container scheduler {
description
"Scheduler identification and metadata.";
uses scheduler-grouping;
}
container submitter {
description
"Submitter and tenant context.";
uses submitter-grouping;
}
container workload {
description
"Workload identification and metadata.";
uses workload-grouping;
}
container job {
description
"Job structure and task information.";
uses job-grouping;
}
container timing {
description
"Timing and scheduling information.";
uses timing-grouping;
}
}
}
¶
Scheduler and job metadata can reveal user, tenant, project, workload, timing, and operational information. Implementations need to protect the confidentiality and integrity of this information and restrict access to authorized workload managers, controllers, orchestrators, and network management systems.¶
IANA is requested to register one URI in the "IETF XML Registry" [RFC3688]. Following the format in [RFC3688], the following registration is requested:¶
URI: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata Registrant Contact: The IESG. XML: N/A; the requested URI is an XML namespace.¶
IANA is requested to register the following YANG module in the "YANG Module Names" registry [RFC6020].¶
name: ietf-hpc-scheduler-job-metadata
namespace: urn:ietf:params:xml:ns:yang:ietf-hpc-scheduler-job-metadata
prefix: hpc-sched
reference: RFC XXXX
¶
The authors acknowledge the related HP-WAN framework and problem statement work that provides the broader context for this scheduler job metadata model.¶
This section provides an example of scheduler job metadata for a distributed AI training workload. The example demonstrates how platform-specific job information from a Kubernetes-based AI orchestration system is mapped to the common metadata model.¶
Consider a scenario where a user submits a distributed training job using Kubeflow on a Kubernetes cluster. The job involves multiple worker nodes and parameter servers.¶
{
"ietf-hpc-scheduler-job-metadata:hpc-scheduler-job-metadata": {
"scheduler": {
"scheduler-id": "ai-orchestrator-1",
"scheduler-name": "AI-Training-Orchestrator",
"scheduler-type": "kubernetes",
"platform-instance": "nvidia-base-command-2.0"
},
"submitter": {
"tenant-id": "ai-research-lab",
"project-id": "distributed-ml-project",
"namespace": "ml-training",
"user-id": "researcher-bob",
"account-id": "project-alpha"
},
"workload": {
"workload-id": "distributed-training-001",
"workload-name": "large-scale-llm-training",
"workload-type": "ai-training",
"framework": "pytorch",
"priority": 100,
"queue": "gpu-high-priority",
"correlation-id": "corr-ai-training-001"
},
"job": {
"job-id": "job-2026-04-23-001",
"job-name": "llm-13b-distributed",
"job-size": 3,
"task": [
{
"task-id": "worker-1",
"task-name": "gpu-worker-west-1",
"task-role": "worker",
"task-index": 0
},
{
"task-id": "worker-2",
"task-name": "gpu-worker-west-2",
"task-role": "worker",
"task-index": 1
},
{
"task-id": "worker-3",
"task-name": "gpu-worker-east-1",
"task-role": "worker",
"task-index": 2
}
]
},
"timing": {
"submit-time": "2026-04-23T09:00:00Z",
"earliest-start-time": "2026-04-23T09:45:00Z",
"requested-start-time": "2026-04-23T10:00:00Z",
"deadline": "2026-04-23T12:00:00Z",
"requested-duration": 120,
"duration-unit": "minutes"
}
}
}
¶