← All posts

Champion / challenger model promotion that doesn't break inference SLOs

A safe-by-default pipeline for promoting models in production: alias-based rollouts, evaluation gates, and canary traffic splits.

#mlops#mlflow#kserve#drift

When a drift alert fires at 3 AM, you want the system to handle it. Not page someone.

This is the model promotion pipeline we run in production for real-time inference. It’s opinionated about three things: how models are named, how they’re evaluated, and how traffic moves to a new version.

The shape of the pipeline

drift detected (Evidently)

retrain (Kubeflow)

evaluate against @champion
   ↓ pass
promote @challenger (MLflow alias)

build & sign image (model baked in)

KServe revision created (Knative)

canary 5% → 25% → 50% → 100%

Each arrow is automated. Each gate is opinionated.

Aliases, not version numbers

In MLflow, prefer aliases (@champion, @challenger, @previous) to version numbers in deployment configs.

Why: the alias is what changes when you promote. The version number is an implementation detail. If your KServe InferenceService references models:/risk-classifier@champion, promotion is a single MLflow API call — no YAML changes, no PR.

Convention:

  • @champion — currently serving 100% of traffic
  • @challenger — newly retrained, undergoing canary rollout
  • @previous — last good model, available for instant rollback

The evaluation gate is the most important part

Most pipelines fail here. Common bugs:

  • No baseline — evaluating a new model in isolation. You don’t know if it’s better.
  • Wrong eval data — using training distribution instead of recent production traffic
  • Single-metric optimization — accuracy goes up, p99 latency doubles, nobody notices

What we evaluate against @champion:

MetricThreshold
Primary quality metric (e.g. AUROC)New ≥ Old × 1.0
Calibration drift (PSI)New < 0.2
p99 inference latencyNew ≤ Old × 1.1
Memory footprintNew ≤ Old × 1.2

If any threshold fails, promotion is blocked. The pipeline exits with a clear failure reason and a link to the eval report.

Canary stages, not all-or-nothing

Once promoted to @challenger and deployed as a new KServe revision, traffic ramps:

  • 5% for 30 minutes — verify no errors, latency stable
  • 25% for 2 hours — accumulate enough traffic for real metrics
  • 50% for 4 hours — confirm distribution shift didn’t break downstream consumers
  • 100%@challenger becomes @champion; old @champion becomes @previous

At each stage, automated guardrails check error rate, p99 latency, and a sanity-grader on outputs. If any guardrail fires, traffic instantly reverts to 100% on the previous revision — Knative makes this a one-API-call operation.

What you get

  • Mean time from drift detected to fix in production: hours, not days
  • Zero-downtime model rollouts — failed promotions never affect users
  • Instant rollback — revert is one API call against @previous
  • Full traceability — every prediction tied to a specific model version, training data slice, and evaluation report

What this requires

  • Drift detection that’s actually trustworthy. We use Evidently with PSI / KS / KL on inputs and outputs, evaluated on a sliding window of recent traffic.
  • A retraining pipeline that’s reproducible. Kubeflow Pipelines, MLflow Tracking for runs, deterministic dataset slicing.
  • Canary infrastructure. KServe + Knative on Istio gives you traffic splitting at the service-mesh layer, with the rollback API at hand.

The full reference architecture is in our MLOps case study.