Recovering $170K/month in wasted GPU spend
Healthcare AI client running real-time RAG on EKS was burning ~$170–180K/month in idle GPU and over-provisioned compute. We traced and remediated 70% of unallocated spend.
Context
A Fortune 100 healthcare AI program ran a real-time Retrieval-Augmented Generation pipeline for clinical decision support on AWS EKS. Inference workloads were GPU-heavy (CUDA / PyTorch) and bursty — long quiet periods punctuated by traffic spikes during clinic hours.
Problem
Cloud bills had ballooned to over $1M/month, with 70% of EKS spend unallocated to any product team — invisible to Harness CCM cost dashboards. Engineering had no visibility into per-model GPU utilization, and the autoscaler was thrashing.
Approach
A two-week diagnostic followed by a structured remediation:
- Stand up GPU observability — DCGM Exporter to Prometheus, Datadog GPU Fleet integration, Splunk dashboards for per-pod utilization, memory pressure, and idle-node signals.
- Root-cause the unallocated spend — traced to: KEDA/Karpenter misconfiguration causing cyclic GPU node churn, CPU pools at 0.1% utilization, stale node groups at 0.001% utilization, and pods missing namespace/cost-center labels.
- Consolidate workloads — merged 200+ namespaces onto shared node pools with appropriate taints/tolerations and resource quotas; right-sized inference pools by p95 utilization.
- GPU time slicing — enabled multiple inference containers to share a single GPU without contention, deferring expensive scale-outs.
- Scale-to-zero between batches — KEDA ScaledObjects on queue-depth signals; idle GPU pools drop to zero between jobs.
Outcome
- ~$170–180K/month recovered in wasted GPU + compute spend
- 20%+ reduction in baseline Kubernetes infrastructure cost
- Allocated spend visibility went from ~30% to ~95% across product teams
- p99 inference latency held stable through migration (no SLO regression)
Stack
EKS, Karpenter, KEDA, Nvidia Device Plugin + DCGM Exporter, Datadog (APM, GPU Fleet, Cost), Splunk, Harness CCM, Terraform.