Playbooks

GPU FinOps Audit: scope, deliverables, and the spreadsheet

Exactly what we look at in a two-week GPU FinOps audit, what we deliver, and the categories we use to attribute spend.

This is the audit template we use for GPU FinOps engagements. It runs two weeks, has fixed scope and deliverables.

Week 1 — Discovery and instrumentation

Read-only access required

  • Cloud cost data (AWS Cost Explorer, Azure Cost Management, or Harness CCM / Kubecost export)
  • Kubernetes API read access (cluster-wide view is enough)
  • Prometheus / Grafana read access
  • Read access to GitOps repo (Argo / Flux) for deployment manifests

What we instrument

If not already in place:

  • DCGM Exporter → Prometheus for per-pod GPU utilization
  • kube-state-metrics for label correlation
  • A shared Grafana dashboard for the audit period

We bring this stack up via Helm in under a day. Removed cleanly at end-of-engagement if not adopted.

Week 2 — Analysis and remediation plan

Spend categorization

Every dollar of spend lands in one of these buckets:

CategoryDefinition
ProductiveWorkload running, utilization above threshold, attributable to a product / cost center
Allocated idleWorkload running, utilization below threshold (often headroom or over-provisioning)
Unallocated idleResources running, no workload (stale node groups, churn cycles)
UnattributedResources missing labels — unknown owner
TaxCluster overhead — control plane, observability, system pods

What we look for

  • Stale node groups — utilization < 1% for the audit window
  • GPU node churn cycles — count of GPU node creations / deletions per day
  • CPU pools at < 5% utilization — over-provisioning
  • Pods missing cost-center labels — unattributable
  • Inference pools sized for peak, not p95 — over-provisioned by definition
  • GPU workloads that don’t saturate a full GPU — candidates for time slicing
  • Long-running pods with low utilization — candidates for scale-to-zero

Deliverables

At end of week 2, you receive:

  1. A line-item findings spreadsheet — every issue with cost impact, severity, and remediation effort
  2. A prioritized remediation plan — quick wins (≤1 week), medium (≤1 month), longer projects
  3. A label and policy proposal — what to enforce via OPA / Kyverno to keep the savings
  4. A Grafana dashboard pack for ongoing monitoring
  5. One executive-summary slide for the cost story

Optional follow-on

  • 4–6 week remediation engagement to implement the prioritized plan
  • Monthly cost review with the platform team

Common findings (across engagements)

The Pareto we see most often:

  • 70% of unallocated spend is two or three large issues, not a hundred small ones
  • The biggest hidden cost is GPU node churn, not over-provisioning
  • The biggest preventable cost is missing labels — without attribution, no team owns the waste
  • Quick wins recover 50–60% of waste in the first month; the rest requires structural change (workload migration, app re-architecture)

What this is not

  • A cost-cutting hatchet job. We won’t recommend changes that risk SLOs.
  • A replacement for ongoing FinOps practice. The audit is the diagnosis. The team owns the cure.

Interested? Email us with a short summary of your environment and we’ll come back with a fixed quote.