Mar 18, 2026

Champion / challenger model promotion that doesn't break inference SLOs

A safe-by-default pipeline for promoting models in production: alias-based rollouts, evaluation gates, and canary traffic splits.

#mlops#mlflow#kserve#drift

When a drift alert fires at 3 AM, you want the system to handle it. Not page someone.

This is the model promotion pipeline we run in production for real-time inference. It’s opinionated about three things: how models are named, how they’re evaluated, and how traffic moves to a new version.

The shape of the pipeline

drift detected (Evidently)
   ↓
retrain (Kubeflow)
   ↓
evaluate against @champion
   ↓ pass
promote @challenger (MLflow alias)
   ↓
build & sign image (model baked in)
   ↓
KServe revision created (Knative)
   ↓
canary 5% → 25% → 50% → 100%

Each arrow is automated. Each gate is opinionated.

Aliases, not version numbers

In MLflow, prefer aliases (@champion, @challenger, @previous) to version numbers in deployment configs.

Why: the alias is what changes when you promote. The version number is an implementation detail. If your KServe InferenceService references models:/risk-classifier@champion, promotion is a single MLflow API call — no YAML changes, no PR.

Convention:

@champion — currently serving 100% of traffic
@challenger — newly retrained, undergoing canary rollout
@previous — last good model, available for instant rollback

The evaluation gate is the most important part

Most pipelines fail here. Common bugs:

No baseline — evaluating a new model in isolation. You don’t know if it’s better.
Wrong eval data — using training distribution instead of recent production traffic
Single-metric optimization — accuracy goes up, p99 latency doubles, nobody notices

What we evaluate against @champion:

Metric	Threshold
Primary quality metric (e.g. AUROC)	New ≥ Old × 1.0
Calibration drift (PSI)	New < 0.2
p99 inference latency	New ≤ Old × 1.1
Memory footprint	New ≤ Old × 1.2

If any threshold fails, promotion is blocked. The pipeline exits with a clear failure reason and a link to the eval report.

Canary stages, not all-or-nothing

Once promoted to @challenger and deployed as a new KServe revision, traffic ramps:

5% for 30 minutes — verify no errors, latency stable
25% for 2 hours — accumulate enough traffic for real metrics
50% for 4 hours — confirm distribution shift didn’t break downstream consumers
100% — @challenger becomes @champion; old @champion becomes @previous

At each stage, automated guardrails check error rate, p99 latency, and a sanity-grader on outputs. If any guardrail fires, traffic instantly reverts to 100% on the previous revision — Knative makes this a one-API-call operation.

What you get

Mean time from drift detected to fix in production: hours, not days
Zero-downtime model rollouts — failed promotions never affect users
Instant rollback — revert is one API call against @previous
Full traceability — every prediction tied to a specific model version, training data slice, and evaluation report

What this requires

Drift detection that’s actually trustworthy. We use Evidently with PSI / KS / KL on inputs and outputs, evaluated on a sliding window of recent traffic.
A retraining pipeline that’s reproducible. Kubeflow Pipelines, MLflow Tracking for runs, deterministic dataset slicing.
Canary infrastructure. KServe + Knative on Istio gives you traffic splitting at the service-mesh layer, with the rollback API at hand.

The full reference architecture is in our MLOps case study.