GPU Kubernetes Reference Architecture
How we structure GPU node pools, scheduling, and observability for production ML workloads on EKS and AKS.
This playbook captures the GPU Kubernetes architecture we deploy for production ML workloads — the structure of node pools, the scheduling discipline, and the observability stack that makes it operable.
Node-pool topology
Three pools, each with a specific role:
| Pool | Instance type | Purpose | Scale |
|---|---|---|---|
| Inference (online) | g5.2xlarge / Standard_NC6s_v3 | Real-time serving with strict p99 SLOs | KEDA queue-depth, scale-to-zero |
| Inference (batch) | g5.12xlarge / Standard_NC24s_v3 | Bulk scoring, embeddings | Karpenter spot, scale-to-zero |
| Training | p4d.24xlarge / Standard_ND96asr_v4 | Multi-GPU training, fine-tuning | Manual or Kubeflow-driven |
Why three pools, not one: noisy-neighbor isolation. A long-running training job on the same node as latency-sensitive inference will tank p99. Separate pools, separate taints.
Scheduling discipline
spec:
nodeSelector:
node.kubernetes.io/gpu-pool: inference-online
tolerations:
- key: gpu-online
operator: Exists
effect: NoSchedule
containers:
- name: model-server
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
cpu: 2000m
memory: 8Gi
Three rules we always enforce:
- Every GPU pod has both
requestsandlimits— otherwise Karpenter can’t make sane consolidation decisions - Toleration matches taint exactly — no pod accidentally lands on a training node
- CPU / memory requests sized to the actual GPU instance — otherwise nodes are oversubscribed on CPU and pods evict each other
GPU time slicing for shared inference
For models that don’t saturate a full GPU (most small inference workloads):
# nvidia-device-plugin config
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4 # one physical GPU exposed as 4 slices
This lets up to 4 inference pods share a single GPU. Cuts cost without measurable latency impact for sub-100ms inference.
Observability stack
Non-negotiable layers:
- DCGM Exporter → Prometheus — per-GPU SM utilization, memory, power, temp
- kube-state-metrics — pod / node / pool labels for cost attribution
- Datadog GPU Fleet (or Grafana with the Nvidia mixin) — fleet-wide dashboards
- Tracing (Jaeger / OpenTelemetry) — inference latency broken down by stage
- Log shipping (Loki / Splunk) — model errors correlated with input features
Cost attribution
Required labels on every GPU pod:
metadata:
labels:
cost-center: "${CC_CODE}"
product: "${PRODUCT}"
environment: "${ENV}"
workload-type: "inference|training|batch"
Enforce via OPA Gatekeeper — pods missing required labels are denied at admission. Anything else, and your cost dashboard becomes a fiction within a quarter.