2026

Production RAG: KServe + Knative + Istio + champion/challenger MLflow

End-to-end MLOps stack for real-time RAG inference at a Fortune 100 healthcare AI program — full lifecycle from experiment tracking to canary rollout on drift.

EKSAKSKServeKnativeIstioMLflowKubeflowEvidently

Context

A real-time clinical RAG pipeline needed production-grade ML lifecycle management — experiment tracking, automated promotion, drift monitoring, and zero-touch rollouts. Existing infrastructure had no model registry discipline; deployments were manual and risky.

Problem

Models were promoted by hand, often without grading against a baseline
No drift monitoring — silent quality regressions went undetected for weeks
Inference pods pulled artifacts from MLflow at startup, creating runtime dependency on a non-production-grade service
No safe rollout — every release was effectively all-or-nothing

Approach

Built a complete serverless ML serving stack on Kubernetes:

Service mesh: Istio 1.29 with VirtualService traffic splitting, DestinationRule subsets, IngressGateway, and mTLS between services
Serverless serving: Knative 1.21 for scale-to-zero, revision management, and traffic-percentage canary rollouts
Model serving: KServe 0.16 InferenceService in both Serverless and RawDeployment modes depending on workload latency profile
Model lifecycle: MLflow Model Registry with @champion / @challenger alias promotion. Promoted model.pkl artifacts are baked into versioned Docker images — eliminating runtime MLflow dependency in production pods.
Drift monitoring: Evidently with PSI, KL, and KS divergence on input distributions and output scores. Drift alerts feed an automated Kubeflow retraining pipeline.
Auto-promotion: Retrained model evaluated against @champion; if it wins, promoted via MLflow alias and deployed as a new KServe revision with a 5% canary traffic split.

Outcome

Mean time from “drift detected” to “remediated model in canary” dropped from days to hours
Zero production incidents from model rollouts after launch
Inference SLOs (p50 < 80ms, p99 < 400ms) held through every promotion
Full traceability: every prediction can be tied to a specific model version, training data slice, and evaluation run

Observability

Prometheus + Grafana for inference latency (p50/p95/p99), throughput, and queue depth. Jaeger for distributed traces. Kiali for service-mesh visualization. Langfuse for LLM-level observability — prompt traces, token usage, cost per request.

Stack

EKS / AKS, Istio, Knative, KServe, MLflow, Kubeflow, Evidently, Feast, Prometheus, Grafana, Jaeger, Kiali, Langfuse, Argo CD, Flux.