Where 70% of EKS spend hides: a 5-step GPU FinOps audit
How to find the unallocated GPU and compute spend that cost dashboards can't see — and what to do about it.
In our most recent engagement, 70% of an EKS cluster’s monthly spend was unallocated to any product team. None of it showed up in the cost dashboard. That’s not unusual — it’s the rule, not the exception, for organizations running GPU workloads on Kubernetes.
This post walks through the audit framework we use to find that hidden spend in two weeks.
The five steps
1. Stand up GPU observability before anything else
You can’t optimize what you can’t see. The minimum viable stack:
- DCGM Exporter → Prometheus for per-pod GPU utilization, memory pressure, SM occupancy
- Datadog GPU Fleet (or Grafana with Nvidia mixin) for fleet-wide trends
- kube-state-metrics to correlate utilization with pod / namespace / cost-center labels
If your inference pods don’t have cost-center labels, stop here and fix that first. Everything downstream depends on it.
2. Find the unallocated bucket
In your cost tool (Harness CCM, Kubecost, AWS Cost Explorer with allocation tags), filter by “untagged” or “unallocated.” Common culprits:
- Stale node groups at 0.001% utilization — left over from migrations, no taints, no occupants
- CPU pools running at 0.1% — provisioned for headroom, never right-sized
- GPU node churn — Karpenter / KEDA misconfiguration causing nodes to come up, sit idle, get scaled down, repeat
- Pods missing labels — service teams that never adopted the labeling convention
3. Trace the GPU node churn
This is usually the biggest hidden cost. Look for:
- Cyclic up/down patterns in the GPU node count graph — typically a misconfigured KEDA
cooldownPeriodor a Karpenter consolidation policy fighting an autoscaler - Nodes that come up but never schedule a pod — affinity / toleration mismatch
- Nodes that schedule a pod, run for 5 minutes, then evict — usually a pod with no
requestsending up on a Karpenter spot node that gets reclaimed
Each of these eats GPU-hours billed at on-demand rates.
4. Consolidate workloads
Don’t optimize each namespace in isolation. Instead:
- Merge namespaces onto shared node pools with appropriate taints and resource quotas
- Right-size inference pools by p95 utilization, not peak
- Enable GPU time slicing for workloads that don’t saturate a full GPU — multiple inference containers can share one GPU without contention
- Use scale-to-zero between batches via KEDA ScaledObjects on queue depth
5. Lock the wins in
The hard part is keeping the savings. We use:
- Resource quotas per namespace so a misconfigured deployment can’t spawn 100 GPU nodes
- Cost-center label enforcement via OPA Gatekeeper — pods without required labels are denied
- Weekly Grafana review of cost-per-1k-inference, cost-per-training-run
What this looks like in practice
For one Fortune 100 healthcare AI client, this framework recovered ~$170–180K/month in wasted GPU + compute spend. Allocated spend visibility went from ~30% to ~95%. Inference SLOs held stable through the migration.
The full case study is here.
If you’re running GPU workloads on EKS or AKS and your cost line is growing faster than your traffic, book a FinOps audit. Two weeks, fixed scope, line-item findings.