Reference

MLOps stack: zero to canary in 6 weeks

An opinionated, six-week plan to take a model from notebook to canary-rollout production on Kubernetes.

A six-week plan to take a model from notebook to canary-rollout production. Designed for teams that already have a model worth shipping, but no production discipline around it.

Week 1 — Container and registry hygiene

Wrap the model in a deterministic Docker image: pinned base image, pyproject.toml lock, no pip install at runtime
Push to an immutable registry with content-addressable digests (ECR / ACR)
Sign images with Cosign — make signature verification a required admission policy
Add a CI pipeline with Trivy, Grype, and Semgrep scans as required checks

Week 2 — Inference service on Kubernetes

Deploy a minimal KServe InferenceService with a digest-pinned image
Set up Istio IngressGateway with mTLS internal, TLS external
Add liveness / readiness probes with timeouts that match real model warmup
Define resource requests and limits based on a load test, not a guess

Week 3 — Observability before scale

Before adding any traffic, instrument:

Prometheus + Grafana — inference latency p50 / p95 / p99, throughput, error rate
Jaeger / OpenTelemetry — distributed traces with span breakdown by stage
Loki / Splunk — structured logging with request IDs

If you don’t have a Grafana dashboard you’d be willing to put on a wall during launch, you’re not ready for traffic.

Week 4 — Model registry and promotion

Stand up MLflow for experiment tracking and registry
Adopt alias-based promotion (@champion, @challenger) — never deploy by version number
Build a promotion pipeline that bakes the promoted artifact into the inference image (see why)
Add evaluation gates against @champion — block promotion on quality, latency, or memory regressions

Week 5 — Drift detection and automated retraining

Wire Evidently for input distribution monitoring (PSI / KS / KL)
Add output score drift monitoring with a sliding production window
Connect drift alerts to a Kubeflow retraining pipeline
Promotion still requires the eval gate from Week 4

Week 6 — Canary rollout discipline

Configure Knative traffic splitting for staged rollouts (5% → 25% → 50% → 100%)
Define automated guardrails at each stage: error rate, p99, output sanity
Wire instant rollback on guardrail breach — @previous is one API call away
Practice the rollback in staging before you need it in production

What “done” looks like

A drift alert fires
Retraining runs unattended
Eval gate passes
New revision rolls out at 5%
Guardrails pass at each stage
100% traffic shifts within a few hours
Nobody got paged

If any one of those doesn’t work, you’re not done.

What this is not

A way to ship sloppy science. Quality of the underlying model is yours to own.
A substitute for human judgment on high-impact decisions. Auto-promotion is for routine retraining; novel model architectures should still go through review.