Reference

GPU Kubernetes Reference Architecture

How we structure GPU node pools, scheduling, and observability for production ML workloads on EKS and AKS.

This playbook captures the GPU Kubernetes architecture we deploy for production ML workloads — the structure of node pools, the scheduling discipline, and the observability stack that makes it operable.

Node-pool topology

Three pools, each with a specific role:

PoolInstance typePurposeScale
Inference (online)g5.2xlarge / Standard_NC6s_v3Real-time serving with strict p99 SLOsKEDA queue-depth, scale-to-zero
Inference (batch)g5.12xlarge / Standard_NC24s_v3Bulk scoring, embeddingsKarpenter spot, scale-to-zero
Trainingp4d.24xlarge / Standard_ND96asr_v4Multi-GPU training, fine-tuningManual or Kubeflow-driven

Why three pools, not one: noisy-neighbor isolation. A long-running training job on the same node as latency-sensitive inference will tank p99. Separate pools, separate taints.

Scheduling discipline

spec:
  nodeSelector:
    node.kubernetes.io/gpu-pool: inference-online
  tolerations:
    - key: gpu-online
      operator: Exists
      effect: NoSchedule
  containers:
    - name: model-server
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
          cpu: 2000m
          memory: 8Gi

Three rules we always enforce:

  1. Every GPU pod has both requests and limits — otherwise Karpenter can’t make sane consolidation decisions
  2. Toleration matches taint exactly — no pod accidentally lands on a training node
  3. CPU / memory requests sized to the actual GPU instance — otherwise nodes are oversubscribed on CPU and pods evict each other

GPU time slicing for shared inference

For models that don’t saturate a full GPU (most small inference workloads):

# nvidia-device-plugin config
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4   # one physical GPU exposed as 4 slices

This lets up to 4 inference pods share a single GPU. Cuts cost without measurable latency impact for sub-100ms inference.

Observability stack

Non-negotiable layers:

  • DCGM Exporter → Prometheus — per-GPU SM utilization, memory, power, temp
  • kube-state-metrics — pod / node / pool labels for cost attribution
  • Datadog GPU Fleet (or Grafana with the Nvidia mixin) — fleet-wide dashboards
  • Tracing (Jaeger / OpenTelemetry) — inference latency broken down by stage
  • Log shipping (Loki / Splunk) — model errors correlated with input features

Cost attribution

Required labels on every GPU pod:

metadata:
  labels:
    cost-center: "${CC_CODE}"
    product: "${PRODUCT}"
    environment: "${ENV}"
    workload-type: "inference|training|batch"

Enforce via OPA Gatekeeper — pods missing required labels are denied at admission. Anything else, and your cost dashboard becomes a fiction within a quarter.

See also