Using the Model Context Protocol (MCP) for Intent-Based Network Troubleshooting Automation

Internet-Draft	MCP for Networks	November 2025
Zeng, et al.	Expires 6 May 2026	[Page]

Abstract

The Model Context Protocol (MCP) is an open standard that enables Large Language Model (LLM) applications to seamlessly integrate with external data sources and tools by exposing Resources, Prompts and Tools in a JSON-RPC 2.0 transport. This document describes a mapping of MCP roles, primitives and security model to the network management domain so that network devices act as MCP servers and network controllers act as MCP clients. This document also extends the model to Device-to-Device (D2D) collaboration, allowing network elements to perform distributed fault correlation when the controller is unreachable or when real-time cross-device data is required. The goal is to provide an intent-based, conversational and secure approach for automated network troubleshooting, configuration validation, and closed-loop remediation without inventing new protocols or device agents.¶

1. Introduction

Network operators today face two converging demands: (1) reduce Mean Time to Repair (MTTR) while managing ever larger infrastructures, and (2) adopt intent-based interfaces that allow engineers to express high-level goals such as "verify reachability between Site-A and Site-B" instead of typing low-level CLI commands.¶

Simultaneously, Large Language Models (LLMs) have demonstrated utility in reasoning about semi-structured data such as device logs, configurations, and command outputs. However, safely exposing device-level actions and data to an LLM in real time remains an open problem. The Model Context Protocol (MCP), developed by Anthropic and published at https://modelcontextprotocol.io, provides a lightweight, capability-oriented RPC layer that already addresses this problem for general LLM applications.¶

This document specifies a deterministic mapping of MCP roles, primitives, and security workflows onto the network management plane so that:¶

A network element (router, switch, firewall, etc.) becomes an "MCP server" that exposes management data and actions as MCP Resources, Prompts and Tools.¶
A controller, orchestrator or chat-based assistant becomes an "MCP client" that consumes these primitives.¶
Human operators interact with the controller using natural language; the controller translates the intent into a sequence of MCP calls, optionally consulting an LLM for reasoning.¶
All interactions are subject to explicit user consent, audit, and capability-based access control, as mandated by MCP.¶

While the star-shaped Controller-to-Device model covers most brown-field deployments, some faults are visible only when two or more devices compare live data in real time. To address this we extend MCP to support Device-to-Device (D2D) troubleshooting collaboration. In this mode network elements autonomously form a transient "collaboration domain", exchange YANG/JSON-RPC calls, correlate results with an on-box LLM, and produce a signed report that can be retrieved by the controller once connectivity is restored. Security is maintained through mutual TLS, short-lived device certificates, and a white-list of neighbour-callable capabilities.¶

The result is an intent-based, conversational and secure automation framework that re-uses existing agents (NETCONF/RESTCONF/YANG, SNMP, CLI, gNMI, etc.) already present on devices instead of requiring new firmware.¶

3. Mapping of MCP Primitives to Network Management

3.1. Resources

Resources are exposed under the URI scheme mcp://<device>/<yang-module>:<path>. Reading a resource returns YANG JSON-encoded data per [RFC7951]. Examples:¶

mcp://router1/ietf-interfaces:interfaces/interface=eth0¶
mcp://router1/openconfig-bgp:bgp/neighbors/neighbor=1.1.1.1¶

Servers SHOULD support the "content-id" header to enable E-tags for caching.¶

3.2. Tools

Tools are mapped to well-known RPC operations already exposed by devices via NETCONF/RESTCONF/YANG, gNMI, or CLI. A Tool is described by an OpenAPI 3.0 Operation Object and MUST be idempotent when possible.¶

Example Tool schema (ping):¶

{
  "name": "ping",
  "description": "Execute ICMP echo probe",
  "inputSchema": {
    "type": "object",
    "properties": {
      "destination": { "type": "string" },
      "count":       { "type": "integer", "default": 5 },
      "source":      { "type": "string" }
    },
    "required": ["destination"]
  }
}

Figure 1: Tool Schema Example

3.3. Prompts

Prompts are reusable prompt templates stored on the device. They allow vendors or operators to encode golden troubleshooting workflows in natural language. A prompt MAY contain variable placeholders such as "{{interface}}" that the client fills in before sending to an LLM.¶

3.4. Sampling (Optional)

If the client advertises the "sampling" capability, the server MAY request LLM inference on behalf of the device. This is useful for recursive troubleshooting where the device needs to ask clarifying questions. All sampling requests MUST be approved by the human operator via explicit consent UI.¶

4. Controller-to-Device Troubleshooting

4.1. Architecture

   +----------------------------------+
   |  Human Operator (Chat UI, Web)   |
   +-----------------+----------------+
                     | Intent (natural lang.)
                     v
   +----------------------------------+
   |  Controller / Orchestrator       |
   |  (MCP Client + LLM Host)         |
   +-----------------+----------------+
                     | JSON-RPC 2.0 over TLS
                     v
   +----------------------------------+
   |  Network Device                  |
   |  (MCP Server)                    |
   +----------------------------------+

Figure 2: Controller-to-Device model

4.2. Role Allocation

MCP Server: Runs on or proxied in front of the network element. Exposes:¶

Resources: Read-only YANG datastores, syslog, tech-support¶
Tools: Idempotent actions, e.g., "ping", "traceroute", "clear counters", "rollback config"¶
Prompts: Re-usable prompt templates, e.g., "Diagnose BGP session down"¶

MCP Client: Runs inside the controller. Maintains a persistent JSON-RPC 2.0 connection to each server. Optionally hosts an LLM that reasons about the data returned by servers.¶

4.3. Capability Negotiation

On connection establishment, the client and server exchange capability objects as defined in MCP. Servers list their supported YANG modules [RFC8525], CLI command sets, and Tool schemas encoded in OpenAPI 3.0. Clients list optional features such as "sampling" (LLM recursion) or "roots" (URI scoping).¶

4.4. Example Use Cases

4.4.1. Intent: "Verify reachability between Site-A and Site-B"

The following steps illustrate the flow:¶

Operator enters intent in chat UI.¶
Controller's LLM deduces required Tools: − ping (Tool) from router1 to 10.2.2.2
− show interfaces (Resource) on router2¶
Controller issues MCP calls.¶
Devices return results.¶
LLM summarizes: "Packet loss 0%; MTU mismatch detected on router2 ge-0/0/0. Recommend 'set interfaces ge-0/0/0 mtu 1500'."¶

4.4.2. Intent: "Diagnose why BGP neighbor 1.1.1.1 is down"

Controller retrieves: -/openconfig-bgp:bgp/neighbors/neighbor=1.1.1.1/state -/ietf-interfaces:interfaces/interface=loopback0¶
Controller calls Tool "tcpdump" filtered on port 179.¶
LLM correlates: "No TCP SYN received; ACL foo on interface loopback0 denies port 179."¶
Controller offers one-click remediation: remove ACL entry.¶

5. Device-to-Device Troubleshooting Collaboration

Some root causes are scattered across several nodes (e.g., unidirectional fiber, one-way ACL, single-sided BFD Down, localized SRv6 SID failure). Although the controller can collect data centrally, the north-bound link may be impaired, time-synchronisation is costly, and uploading bulk data is expensive. Allowing nearby devices to form a transient "trusted collaboration domain" and exchange data, correlate and infer root causes locally can significantly shorten MTTR.¶

5.1. Architecture

      +-------------+       +-------------+       +-------------+
      |   Device A  |       |   Device B  |       |   Device C  |
      | MCP Client  |       | MCP Client  |       | MCP Client  |
      |  + Server   |<----->|  + Server   |<----->|  + Server   |
      +-------------+       +-------------+       +-------------+
            |                       |                       |
            |  MCP/JSON-RPC 2.0     |  MCP/JSON-RPC 2.0     |
            |  over mutual-TLS      |  over mutual-TLS      |
            |                       |                       |

Figure 3: Device-to-Device Collaboration Model

5.2. Role Allocation

MCP-Server, MCP-Client: moved into a "Collaboration Agent" (CA) running on the device. Every CA is both client and server; the MCP message layer MUST support mutual TLS and capability negotiation.¶
The user (human or controller) only needs to inject an Intent on any one device in the domain; the CA will then perform the chained calls autonomously and roll-up the final report.¶

5.3. Example Use Cases

5.3.1. Intent: Verify packet loss on the SRv6 path from PE-1 to PE-3 via P-2

The following steps illustrate SRv6 packet-loss localization between three routers: PE-1 (head-end), P-2 (mid-point), and PE-3 (tail-end). The same pattern can be applied to any multi-box fault domain.¶

Step-0: Intent Injection¶

An operator types the following natural-language goal in the chat window that is served by PE-1:¶

Intent: "Verify packet loss on the SRv6 path from PE-1 to PE-3 via P-2"

PE-1 Collaboration Agent (CA) parses the Intent, extracts the SID list {PE-1, P-2, PE-3}, and starts the D2D workflow.¶

Step-1: Collaboration Domain Discovery¶

PE-1 CA sends an LLDP/IS-IS Extended TLV that contains:¶

  + mcp-d2d-port = 9514
  + supported-capabilities = [ietf-srv6-ping, ietf-interface-counters]
  + cert-thumbprint = <SHA-256 of device certificate>

P-2 and PE-3 reply with their own TLV. All three nodes are now aware of each other's MCP endpoint and capabilities.¶

Step-2: Mutual TLS & Capability Negotiation¶

PE-1 opens a TLS 1.3 connection to P-2 and PE-3 on port 9514. Both sides send their X.509 device certificates and the MCP initialize message:¶

{ "jsonrpc": "2.0",
  "method": "mcp/initialize",
  "params": {
      "protocolVersion": "2025-12",
      "capabilities": [ "ietf-srv6-ping", "ietf-interface-counters" ]
  },
  "id": 1 }

The responder echoes its own capability list. If the intersection is non-empty the session is marked authorised for those capabilities.¶

Step-3: Parallel Resource & Tool Calls¶

PE-1 CA schedules three operations in parallel:¶

Local call (no network RPC)¶

  Tool: "ietf-srv6-ping"
  Params: { Head=PE-1, Tail=PE-3, SID-List=[P-2, PE-3], Count=100 }

RPC toward P-2¶

  POST https://[P-2]:9514/mcp/tool/call
  Body: { tool: "ietf-srv6-ping",
          arguments: { Local=P-2, Tail=PE-3, Count=100 } }

RPC toward PE-3¶

  POST https://[PE-3]:9514/mcp/resource/read
  Body: { uri: "urn:ietf:params:xml:ns:yang:ietf-interfaces/interfaces/interface=SID-Endpoint/statistics" }

Each callee returns a JSON-RPC result plus an ed25519 signature covering the result body and a monotonic nonce. The nonce prevents replay if the controller later fetches the audit-log.¶

Step-4: Local Correlation & Inference¶

PE-1 CA feeds the three data sets into its on-box LLM with the prompt:¶

  "Compare loss % from PE-1 vs P-2; if PE-1>0 and P-2=0, root cause is
   in {PE-1→P-2 link, upstream ACL}; output a single sentence."

Model output:¶

  "Loss 3.2 % observed only on PE-1→P-2 direction; P-2→PE-1 0 %;
   suggest checking PE-1 egress ACL 2001."

The deterministic part of the reply (ACL 2001) is extracted as a structured fault hypothesis and stored in the candidate list.¶

Step-5: Roll-up Report & User Consent¶

PE-1 CA assembles the signed results, inference, and raw packets into an MCP resource:¶

  URI:  urn:ietf:params:xml:ns:yang:ietf-mcp/report/74e8
  Body: { "creator": "PE-1",
          "created": "2025-10-30T14:23:42Z",
          "hypothesis": "PE-1 egress ACL 2001",
          "evidence": [ <base64 encoded PCAP>, ... ],
          "signatures": { "PE-1": <sig1>, "P-2": <sig2>, "PE-3": <sig3> } }

If the operator is still on-line the CA presents the hypothesis and asks for explicit approval before any mitigating action (e.g., edit ACL) is executed. If the controller is reachable the report is pushed through the conventional star channel; otherwise it remains on-box until the next controller sync.¶

Failure Handling¶

If any D2D call times out or returns an authentication error, PE-1 CA marks that node "untrusted" and falls back to controller-based polling if available. All intermediate temporary states (e.g., cleared counters) are rolled back immediately to preserve atomicity.¶

Using the Model Context Protocol (MCP) for Intent-Based Network Troubleshooting Automation

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

2. Terminology

3. Mapping of MCP Primitives to Network Management

3.1. Resources

3.2. Tools

3.3. Prompts

3.4. Sampling (Optional)

4. Controller-to-Device Troubleshooting

4.1. Architecture

4.2. Role Allocation

4.3. Capability Negotiation

4.4. Example Use Cases

4.4.1. Intent: "Verify reachability between Site-A and Site-B"

4.4.2. Intent: "Diagnose why BGP neighbor 1.1.1.1 is down"

5. Device-to-Device Troubleshooting Collaboration

5.1. Architecture

5.2. Role Allocation

5.3. Example Use Cases

5.3.1. Intent: Verify packet loss on the SRv6 path from PE-1 to PE-3 via P-2

6. Transport & Encoding Considerations

7. Security Considerations

7.2. Least-Privilege Capability Tokens

7.3. LLM Isolation

7.4. Audit and Post-Mortem

7.5. Privacy

8. IANA Considerations

9. Normative References

10. Informative References

Appendix A. JSON-RPC Examples

A.1. Client Call: ping

A.2. Server Resource Subscription

Authors' Addresses