Why we bake model.pkl into Docker images instead of pulling from MLflow at runtime
MLflow is great for experiment tracking and registry. It's not great as a runtime dependency for production inference pods.
A pattern we see repeatedly when teams adopt MLflow:
# In the inference container, at startup
model = mlflow.pyfunc.load_model(f"models:/{name}@champion")
It looks clean. It uses the registry. The model name and alias are managed via the UI. Teams ship it.
Then production happens.
What goes wrong
MLflow becomes a runtime dependency for every inference pod. Every cold start hits the MLflow tracking server. Every horizontal scale-up hits the artifact store. If MLflow is on a t3.medium because “it’s just the registry,” you’ve made it part of your inference critical path.
Failure modes we’ve seen in production:
- MLflow service restart causes a brief 503 wave across all inference pods scaling up during the outage
- Artifact store throttling under sudden scale (Black Friday, clinic open, market open)
- Network policy changes accidentally block egress from the inference namespace to MLflow — pods come up healthy, fail at first request
- A bad migration on the MLflow backing database makes every new pod fail to start
You also pay an SRE tax:
- MLflow needs to be on-call-grade reliable, with PostgreSQL HA, S3 / blob redundancy, observability
- Or you accept that inference pods can fail to start when MLflow is down — which means MLflow downtime is inference downtime
The pattern we use instead
Bake the promoted model artifact into a versioned Docker image at promotion time.
FROM kserve/kserve-runtime:0.16
# Bake the model in at build time — copied from MLflow during the promotion job
COPY model.pkl /mnt/models/model.pkl
ENV MODEL_NAME="risk-classifier"
ENV MODEL_VERSION="2026-04-12-a1b2c3d"
The promotion pipeline:
- Trigger fires on
@challenger → @championalias change in MLflow - Pipeline downloads
model.pklfrom MLflow (one-time, in CI) - Builds and signs a Docker image with the artifact baked in
- Pushes to the immutable registry with a digest-pinned tag
- KServe
InferenceServiceis updated to point at the new digest - Knative rolls a new revision; canary traffic split begins
MLflow is no longer in the request path. Inference pods need only:
- The container image (already pulled to the node)
- The model artifact (already inside the image)
Other things this unlocks
- Faster cold starts — no S3 round-trip on pod start
- Air-gapped deployments — model travels with the image
- Trivial rollback — revert to the previous image digest. KServe handles the traffic shift
- Reproducible deploys — image digest is the model version
When not to do this
- Very large models that would balloon image size into the multi-GB range — push storage costs up and pull times up. Use a sidecar or init container to pull from a local cache instead, but treat the artifact source as production-grade.
- Models with frequent fine-tuning where image rebuild cost outweighs the runtime decoupling benefit. Probably should rethink the deployment cadence anyway.
Bottom line
MLflow is excellent at what it’s designed for — experiment tracking and registry. Don’t extend it into your runtime.
Bake the artifact at promotion. Let your inference pods serve.
For a full reference architecture using KServe, Knative, and Istio with this pattern, see our MLOps stack case study.