Champion / challenger model promotion that doesn't break inference SLOs
A safe-by-default pipeline for promoting models in production: alias-based rollouts, evaluation gates, and canary traffic splits.
When a drift alert fires at 3 AM, you want the system to handle it. Not page someone.
This is the model promotion pipeline we run in production for real-time inference. It’s opinionated about three things: how models are named, how they’re evaluated, and how traffic moves to a new version.
The shape of the pipeline
drift detected (Evidently)
↓
retrain (Kubeflow)
↓
evaluate against @champion
↓ pass
promote @challenger (MLflow alias)
↓
build & sign image (model baked in)
↓
KServe revision created (Knative)
↓
canary 5% → 25% → 50% → 100%
Each arrow is automated. Each gate is opinionated.
Aliases, not version numbers
In MLflow, prefer aliases (@champion, @challenger, @previous) to version numbers in deployment configs.
Why: the alias is what changes when you promote. The version number is an implementation detail. If your KServe InferenceService references models:/risk-classifier@champion, promotion is a single MLflow API call — no YAML changes, no PR.
Convention:
@champion— currently serving 100% of traffic@challenger— newly retrained, undergoing canary rollout@previous— last good model, available for instant rollback
The evaluation gate is the most important part
Most pipelines fail here. Common bugs:
- No baseline — evaluating a new model in isolation. You don’t know if it’s better.
- Wrong eval data — using training distribution instead of recent production traffic
- Single-metric optimization — accuracy goes up, p99 latency doubles, nobody notices
What we evaluate against @champion:
| Metric | Threshold |
|---|---|
| Primary quality metric (e.g. AUROC) | New ≥ Old × 1.0 |
| Calibration drift (PSI) | New < 0.2 |
| p99 inference latency | New ≤ Old × 1.1 |
| Memory footprint | New ≤ Old × 1.2 |
If any threshold fails, promotion is blocked. The pipeline exits with a clear failure reason and a link to the eval report.
Canary stages, not all-or-nothing
Once promoted to @challenger and deployed as a new KServe revision, traffic ramps:
- 5% for 30 minutes — verify no errors, latency stable
- 25% for 2 hours — accumulate enough traffic for real metrics
- 50% for 4 hours — confirm distribution shift didn’t break downstream consumers
- 100% —
@challengerbecomes@champion; old@championbecomes@previous
At each stage, automated guardrails check error rate, p99 latency, and a sanity-grader on outputs. If any guardrail fires, traffic instantly reverts to 100% on the previous revision — Knative makes this a one-API-call operation.
What you get
- Mean time from drift detected to fix in production: hours, not days
- Zero-downtime model rollouts — failed promotions never affect users
- Instant rollback — revert is one API call against
@previous - Full traceability — every prediction tied to a specific model version, training data slice, and evaluation report
What this requires
- Drift detection that’s actually trustworthy. We use Evidently with PSI / KS / KL on inputs and outputs, evaluated on a sliding window of recent traffic.
- A retraining pipeline that’s reproducible. Kubeflow Pipelines, MLflow Tracking for runs, deterministic dataset slicing.
- Canary infrastructure. KServe + Knative on Istio gives you traffic splitting at the service-mesh layer, with the rollback API at hand.
The full reference architecture is in our MLOps case study.