← All posts

Where 70% of EKS spend hides: a 5-step GPU FinOps audit

How to find the unallocated GPU and compute spend that cost dashboards can't see — and what to do about it.

#finops#gpu#eks#kubernetes

In our most recent engagement, 70% of an EKS cluster’s monthly spend was unallocated to any product team. None of it showed up in the cost dashboard. That’s not unusual — it’s the rule, not the exception, for organizations running GPU workloads on Kubernetes.

This post walks through the audit framework we use to find that hidden spend in two weeks.

The five steps

1. Stand up GPU observability before anything else

You can’t optimize what you can’t see. The minimum viable stack:

  • DCGM Exporter → Prometheus for per-pod GPU utilization, memory pressure, SM occupancy
  • Datadog GPU Fleet (or Grafana with Nvidia mixin) for fleet-wide trends
  • kube-state-metrics to correlate utilization with pod / namespace / cost-center labels

If your inference pods don’t have cost-center labels, stop here and fix that first. Everything downstream depends on it.

2. Find the unallocated bucket

In your cost tool (Harness CCM, Kubecost, AWS Cost Explorer with allocation tags), filter by “untagged” or “unallocated.” Common culprits:

  • Stale node groups at 0.001% utilization — left over from migrations, no taints, no occupants
  • CPU pools running at 0.1% — provisioned for headroom, never right-sized
  • GPU node churn — Karpenter / KEDA misconfiguration causing nodes to come up, sit idle, get scaled down, repeat
  • Pods missing labels — service teams that never adopted the labeling convention

3. Trace the GPU node churn

This is usually the biggest hidden cost. Look for:

  • Cyclic up/down patterns in the GPU node count graph — typically a misconfigured KEDA cooldownPeriod or a Karpenter consolidation policy fighting an autoscaler
  • Nodes that come up but never schedule a pod — affinity / toleration mismatch
  • Nodes that schedule a pod, run for 5 minutes, then evict — usually a pod with no requests ending up on a Karpenter spot node that gets reclaimed

Each of these eats GPU-hours billed at on-demand rates.

4. Consolidate workloads

Don’t optimize each namespace in isolation. Instead:

  • Merge namespaces onto shared node pools with appropriate taints and resource quotas
  • Right-size inference pools by p95 utilization, not peak
  • Enable GPU time slicing for workloads that don’t saturate a full GPU — multiple inference containers can share one GPU without contention
  • Use scale-to-zero between batches via KEDA ScaledObjects on queue depth

5. Lock the wins in

The hard part is keeping the savings. We use:

  • Resource quotas per namespace so a misconfigured deployment can’t spawn 100 GPU nodes
  • Cost-center label enforcement via OPA Gatekeeper — pods without required labels are denied
  • Weekly Grafana review of cost-per-1k-inference, cost-per-training-run

What this looks like in practice

For one Fortune 100 healthcare AI client, this framework recovered ~$170–180K/month in wasted GPU + compute spend. Allocated spend visibility went from ~30% to ~95%. Inference SLOs held stable through the migration.

The full case study is here.


If you’re running GPU workloads on EKS or AKS and your cost line is growing faster than your traffic, book a FinOps audit. Two weeks, fixed scope, line-item findings.