k8s-gpu-mcp-server
Verified Safeby ArangoGutierrez
Overview
Provides just-in-time, real-time NVIDIA GPU hardware introspection for Kubernetes clusters for AI-assisted SRE troubleshooting.
Installation
npx k8s-gpu-mcp-server@latestEnvironment Variables
- K8S_GPU_MCP_NAMESPACE
- K8S_GPU_MCP_SERVICE
- K8S_GPU_MCP_CONTEXT
- K8S_GPU_MCP_SERVICE_PORT
- K8S_GPU_MCP_LOCAL_PORT
- KUBECONFIG
Security Notes
The project prioritizes security with detailed RBAC configurations, security contexts (e.g., `readOnlyRootFilesystem: true`, `allowPrivilegeEscalation: false`), and optional NetworkPolicies. The default agent mode is `read-only`. The agent container requires `runAsUser: 0` (root) for NVML access, which is explicitly justified but inherently carries higher privilege. Access to `/dev/kmsg` for XID error analysis requires `CAP_SYSLOG` and potentially `privileged: true`. The gateway supports `kubectl exec` routing as a fallback, which can be less secure than direct HTTP if not properly constrained, but the recommended `HTTP` routing mode mitigates this. Comprehensive documentation on the security model and verification steps is provided.
Similar Servers
kubernetes-mcp-server
Facilitates AI agent interaction with Kubernetes and OpenShift clusters by exposing management and observability tools via the Model Context Protocol.
mcp-kubernetes
Enables AI assistants to interact with and debug Kubernetes clusters by translating natural language requests into Kubernetes operations.
asya
A microservices platform for orchestrating asynchronous, event-driven AI/ML workflows via an MCP JSON-RPC gateway.
SRE-agent
An autonomous multi-agent system designed for Kubernetes incident detection, diagnosis, and mitigation using LLMs and modular workflows to reduce Mean Time to Resolution (MTTR).