Reference
MLOps stack: zero to canary in 6 weeks
An opinionated, six-week plan to take a model from notebook to canary-rollout production on Kubernetes.
A six-week plan to take a model from notebook to canary-rollout production. Designed for teams that already have a model worth shipping, but no production discipline around it.
Week 1 — Container and registry hygiene
- Wrap the model in a deterministic Docker image: pinned base image,
pyproject.tomllock, nopip installat runtime - Push to an immutable registry with content-addressable digests (ECR / ACR)
- Sign images with Cosign — make signature verification a required admission policy
- Add a CI pipeline with Trivy, Grype, and Semgrep scans as required checks
Week 2 — Inference service on Kubernetes
- Deploy a minimal KServe
InferenceServicewith a digest-pinned image - Set up Istio IngressGateway with mTLS internal, TLS external
- Add liveness / readiness probes with timeouts that match real model warmup
- Define resource requests and limits based on a load test, not a guess
Week 3 — Observability before scale
Before adding any traffic, instrument:
- Prometheus + Grafana — inference latency p50 / p95 / p99, throughput, error rate
- Jaeger / OpenTelemetry — distributed traces with span breakdown by stage
- Loki / Splunk — structured logging with request IDs
If you don’t have a Grafana dashboard you’d be willing to put on a wall during launch, you’re not ready for traffic.
Week 4 — Model registry and promotion
- Stand up MLflow for experiment tracking and registry
- Adopt alias-based promotion (
@champion,@challenger) — never deploy by version number - Build a promotion pipeline that bakes the promoted artifact into the inference image (see why)
- Add evaluation gates against
@champion— block promotion on quality, latency, or memory regressions
Week 5 — Drift detection and automated retraining
- Wire Evidently for input distribution monitoring (PSI / KS / KL)
- Add output score drift monitoring with a sliding production window
- Connect drift alerts to a Kubeflow retraining pipeline
- Promotion still requires the eval gate from Week 4
Week 6 — Canary rollout discipline
- Configure Knative traffic splitting for staged rollouts (5% → 25% → 50% → 100%)
- Define automated guardrails at each stage: error rate, p99, output sanity
- Wire instant rollback on guardrail breach —
@previousis one API call away - Practice the rollback in staging before you need it in production
What “done” looks like
- A drift alert fires
- Retraining runs unattended
- Eval gate passes
- New revision rolls out at 5%
- Guardrails pass at each stage
- 100% traffic shifts within a few hours
- Nobody got paged
If any one of those doesn’t work, you’re not done.
What this is not
- A way to ship sloppy science. Quality of the underlying model is yours to own.
- A substitute for human judgment on high-impact decisions. Auto-promotion is for routine retraining; novel model architectures should still go through review.