Reference

GPU Kubernetes Reference Architecture

How we structure GPU node pools, scheduling, and observability for production ML workloads on EKS and AKS.

This playbook captures the GPU Kubernetes architecture we deploy for production ML workloads — the structure of node pools, the scheduling discipline, and the observability stack that makes it operable.

Node-pool topology

Three pools, each with a specific role:

Pool	Instance type	Purpose	Scale
Inference (online)	g5.2xlarge / Standard_NC6s_v3	Real-time serving with strict p99 SLOs	KEDA queue-depth, scale-to-zero
Inference (batch)	g5.12xlarge / Standard_NC24s_v3	Bulk scoring, embeddings	Karpenter spot, scale-to-zero
Training	p4d.24xlarge / Standard_ND96asr_v4	Multi-GPU training, fine-tuning	Manual or Kubeflow-driven

Why three pools, not one: noisy-neighbor isolation. A long-running training job on the same node as latency-sensitive inference will tank p99. Separate pools, separate taints.

Scheduling discipline

spec:
  nodeSelector:
    node.kubernetes.io/gpu-pool: inference-online
  tolerations:
    - key: gpu-online
      operator: Exists
      effect: NoSchedule
  containers:
    - name: model-server
      resources:
        limits:
          nvidia.com/gpu: 1
        requests:
          nvidia.com/gpu: 1
          cpu: 2000m
          memory: 8Gi

Three rules we always enforce:

Every GPU pod has both requests and limits — otherwise Karpenter can’t make sane consolidation decisions
Toleration matches taint exactly — no pod accidentally lands on a training node
CPU / memory requests sized to the actual GPU instance — otherwise nodes are oversubscribed on CPU and pods evict each other

GPU time slicing for shared inference

For models that don’t saturate a full GPU (most small inference workloads):

# nvidia-device-plugin config
sharing:
  timeSlicing:
    resources:
      - name: nvidia.com/gpu
        replicas: 4   # one physical GPU exposed as 4 slices

This lets up to 4 inference pods share a single GPU. Cuts cost without measurable latency impact for sub-100ms inference.

Observability stack

Non-negotiable layers:

DCGM Exporter → Prometheus — per-GPU SM utilization, memory, power, temp
kube-state-metrics — pod / node / pool labels for cost attribution
Datadog GPU Fleet (or Grafana with the Nvidia mixin) — fleet-wide dashboards
Tracing (Jaeger / OpenTelemetry) — inference latency broken down by stage
Log shipping (Loki / Splunk) — model errors correlated with input features

Cost attribution

Required labels on every GPU pod:

metadata:
  labels:
    cost-center: "${CC_CODE}"
    product: "${PRODUCT}"
    environment: "${ENV}"
    workload-type: "inference|training|batch"

Enforce via OPA Gatekeeper — pods missing required labels are denied at admission. Anything else, and your cost dashboard becomes a fiction within a quarter.

Node-pool topology

Scheduling discipline

GPU time slicing for shared inference

Observability stack

Cost attribution

See also