May 1, 2026

Where 70% of EKS spend hides: a 5-step GPU FinOps audit

How to find the unallocated GPU and compute spend that cost dashboards can't see — and what to do about it.

#finops#gpu#eks#kubernetes

In our most recent engagement, 70% of an EKS cluster’s monthly spend was unallocated to any product team. None of it showed up in the cost dashboard. That’s not unusual — it’s the rule, not the exception, for organizations running GPU workloads on Kubernetes.

This post walks through the audit framework we use to find that hidden spend in two weeks.

The five steps

1. Stand up GPU observability before anything else

You can’t optimize what you can’t see. The minimum viable stack:

DCGM Exporter → Prometheus for per-pod GPU utilization, memory pressure, SM occupancy
Datadog GPU Fleet (or Grafana with Nvidia mixin) for fleet-wide trends
kube-state-metrics to correlate utilization with pod / namespace / cost-center labels

If your inference pods don’t have cost-center labels, stop here and fix that first. Everything downstream depends on it.

2. Find the unallocated bucket

In your cost tool (Harness CCM, Kubecost, AWS Cost Explorer with allocation tags), filter by “untagged” or “unallocated.” Common culprits:

Stale node groups at 0.001% utilization — left over from migrations, no taints, no occupants
CPU pools running at 0.1% — provisioned for headroom, never right-sized
GPU node churn — Karpenter / KEDA misconfiguration causing nodes to come up, sit idle, get scaled down, repeat
Pods missing labels — service teams that never adopted the labeling convention

3. Trace the GPU node churn

This is usually the biggest hidden cost. Look for:

Cyclic up/down patterns in the GPU node count graph — typically a misconfigured KEDA cooldownPeriod or a Karpenter consolidation policy fighting an autoscaler
Nodes that come up but never schedule a pod — affinity / toleration mismatch
Nodes that schedule a pod, run for 5 minutes, then evict — usually a pod with no requests ending up on a Karpenter spot node that gets reclaimed

Each of these eats GPU-hours billed at on-demand rates.

4. Consolidate workloads

Don’t optimize each namespace in isolation. Instead:

Merge namespaces onto shared node pools with appropriate taints and resource quotas
Right-size inference pools by p95 utilization, not peak
Enable GPU time slicing for workloads that don’t saturate a full GPU — multiple inference containers can share one GPU without contention
Use scale-to-zero between batches via KEDA ScaledObjects on queue depth

5. Lock the wins in

The hard part is keeping the savings. We use:

Resource quotas per namespace so a misconfigured deployment can’t spawn 100 GPU nodes
Cost-center label enforcement via OPA Gatekeeper — pods without required labels are denied
Weekly Grafana review of cost-per-1k-inference, cost-per-training-run

What this looks like in practice

For one Fortune 100 healthcare AI client, this framework recovered ~$170–180K/month in wasted GPU + compute spend. Allocated spend visibility went from ~30% to ~95%. Inference SLOs held stable through the migration.

The full case study is here.

If you’re running GPU workloads on EKS or AKS and your cost line is growing faster than your traffic, book a FinOps audit. Two weeks, fixed scope, line-item findings.