2026

Recovering $170K/month in wasted GPU spend

Healthcare AI client running real-time RAG on EKS was burning ~$170–180K/month in idle GPU and over-provisioned compute. We traced and remediated 70% of unallocated spend.

EKSKarpenterKEDANvidia DCGMDatadogHarness CCM

Context

A Fortune 100 healthcare AI program ran a real-time Retrieval-Augmented Generation pipeline for clinical decision support on AWS EKS. Inference workloads were GPU-heavy (CUDA / PyTorch) and bursty — long quiet periods punctuated by traffic spikes during clinic hours.

Problem

Cloud bills had ballooned to over $1M/month, with 70% of EKS spend unallocated to any product team — invisible to Harness CCM cost dashboards. Engineering had no visibility into per-model GPU utilization, and the autoscaler was thrashing.

Approach

A two-week diagnostic followed by a structured remediation:

Stand up GPU observability — DCGM Exporter to Prometheus, Datadog GPU Fleet integration, Splunk dashboards for per-pod utilization, memory pressure, and idle-node signals.
Root-cause the unallocated spend — traced to: KEDA/Karpenter misconfiguration causing cyclic GPU node churn, CPU pools at 0.1% utilization, stale node groups at 0.001% utilization, and pods missing namespace/cost-center labels.
Consolidate workloads — merged 200+ namespaces onto shared node pools with appropriate taints/tolerations and resource quotas; right-sized inference pools by p95 utilization.
GPU time slicing — enabled multiple inference containers to share a single GPU without contention, deferring expensive scale-outs.
Scale-to-zero between batches — KEDA ScaledObjects on queue-depth signals; idle GPU pools drop to zero between jobs.

Outcome

~$170–180K/month recovered in wasted GPU + compute spend
20%+ reduction in baseline Kubernetes infrastructure cost
Allocated spend visibility went from ~30% to ~95% across product teams
p99 inference latency held stable through migration (no SLO regression)

Stack

EKS, Karpenter, KEDA, Nvidia Device Plugin + DCGM Exporter, Datadog (APM, GPU Fleet, Cost), Splunk, Harness CCM, Terraform.