Playbooks

GPU FinOps Audit: scope, deliverables, and the spreadsheet

Exactly what we look at in a two-week GPU FinOps audit, what we deliver, and the categories we use to attribute spend.

This is the audit template we use for GPU FinOps engagements. It runs two weeks, has fixed scope and deliverables.

Week 1 — Discovery and instrumentation

Cloud cost data (AWS Cost Explorer, Azure Cost Management, or Harness CCM / Kubecost export)
Kubernetes API read access (cluster-wide view is enough)
Prometheus / Grafana read access
Read access to GitOps repo (Argo / Flux) for deployment manifests

If not already in place:

We bring this stack up via Helm in under a day. Removed cleanly at end-of-engagement if not adopted.

Every dollar of spend lands in one of these buckets:

Category	Definition
Productive	Workload running, utilization above threshold, attributable to a product / cost center
Allocated idle	Workload running, utilization below threshold (often headroom or over-provisioning)
Unallocated idle	Resources running, no workload (stale node groups, churn cycles)
Unattributed	Resources missing labels — unknown owner
Tax	Cluster overhead — control plane, observability, system pods

At end of week 2, you receive:

A line-item findings spreadsheet — every issue with cost impact, severity, and remediation effort
A prioritized remediation plan — quick wins (≤1 week), medium (≤1 month), longer projects
A label and policy proposal — what to enforce via OPA / Kyverno to keep the savings
A Grafana dashboard pack for ongoing monitoring
One executive-summary slide for the cost story

The Pareto we see most often:

70% of unallocated spend is two or three large issues, not a hundred small ones
The biggest hidden cost is GPU node churn, not over-provisioning
The biggest preventable cost is missing labels — without attribution, no team owns the waste
Quick wins recover 50–60% of waste in the first month; the rest requires structural change (workload migration, app re-architecture)

A cost-cutting hatchet job. We won’t recommend changes that risk SLOs.
A replacement for ongoing FinOps practice. The audit is the diagnosis. The team owns the cure.

Interested? Email us with a short summary of your environment and we’ll come back with a fixed quote.