Production RAG: KServe + Knative + Istio + champion/challenger MLflow
End-to-end MLOps stack for real-time RAG inference at a Fortune 100 healthcare AI program — full lifecycle from experiment tracking to canary rollout on drift.
Context
A real-time clinical RAG pipeline needed production-grade ML lifecycle management — experiment tracking, automated promotion, drift monitoring, and zero-touch rollouts. Existing infrastructure had no model registry discipline; deployments were manual and risky.
Problem
- Models were promoted by hand, often without grading against a baseline
- No drift monitoring — silent quality regressions went undetected for weeks
- Inference pods pulled artifacts from MLflow at startup, creating runtime dependency on a non-production-grade service
- No safe rollout — every release was effectively all-or-nothing
Approach
Built a complete serverless ML serving stack on Kubernetes:
- Service mesh: Istio 1.29 with VirtualService traffic splitting, DestinationRule subsets, IngressGateway, and mTLS between services
- Serverless serving: Knative 1.21 for scale-to-zero, revision management, and traffic-percentage canary rollouts
- Model serving: KServe 0.16 InferenceService in both Serverless and RawDeployment modes depending on workload latency profile
- Model lifecycle: MLflow Model Registry with
@champion/@challengeralias promotion. Promotedmodel.pklartifacts are baked into versioned Docker images — eliminating runtime MLflow dependency in production pods. - Drift monitoring: Evidently with PSI, KL, and KS divergence on input distributions and output scores. Drift alerts feed an automated Kubeflow retraining pipeline.
- Auto-promotion: Retrained model evaluated against
@champion; if it wins, promoted via MLflow alias and deployed as a new KServe revision with a 5% canary traffic split.
Outcome
- Mean time from “drift detected” to “remediated model in canary” dropped from days to hours
- Zero production incidents from model rollouts after launch
- Inference SLOs (p50 < 80ms, p99 < 400ms) held through every promotion
- Full traceability: every prediction can be tied to a specific model version, training data slice, and evaluation run
Observability
Prometheus + Grafana for inference latency (p50/p95/p99), throughput, and queue depth. Jaeger for distributed traces. Kiali for service-mesh visualization. Langfuse for LLM-level observability — prompt traces, token usage, cost per request.
Stack
EKS / AKS, Istio, Knative, KServe, MLflow, Kubeflow, Evidently, Feast, Prometheus, Grafana, Jaeger, Kiali, Langfuse, Argo CD, Flux.