Apr 12, 2026

Why we bake model.pkl into Docker images instead of pulling from MLflow at runtime

MLflow is great for experiment tracking and registry. It's not great as a runtime dependency for production inference pods.

#mlops#mlflow#kserve#kubernetes

A pattern we see repeatedly when teams adopt MLflow:

# In the inference container, at startup
model = mlflow.pyfunc.load_model(f"models:/{name}@champion")

It looks clean. It uses the registry. The model name and alias are managed via the UI. Teams ship it.

Then production happens.

What goes wrong

MLflow becomes a runtime dependency for every inference pod. Every cold start hits the MLflow tracking server. Every horizontal scale-up hits the artifact store. If MLflow is on a t3.medium because “it’s just the registry,” you’ve made it part of your inference critical path.

Failure modes we’ve seen in production:

MLflow service restart causes a brief 503 wave across all inference pods scaling up during the outage
Artifact store throttling under sudden scale (Black Friday, clinic open, market open)
Network policy changes accidentally block egress from the inference namespace to MLflow — pods come up healthy, fail at first request
A bad migration on the MLflow backing database makes every new pod fail to start

You also pay an SRE tax:

MLflow needs to be on-call-grade reliable, with PostgreSQL HA, S3 / blob redundancy, observability
Or you accept that inference pods can fail to start when MLflow is down — which means MLflow downtime is inference downtime

The pattern we use instead

Bake the promoted model artifact into a versioned Docker image at promotion time.

FROM kserve/kserve-runtime:0.16

# Bake the model in at build time — copied from MLflow during the promotion job
COPY model.pkl /mnt/models/model.pkl

ENV MODEL_NAME="risk-classifier"
ENV MODEL_VERSION="2026-04-12-a1b2c3d"

The promotion pipeline:

Trigger fires on @challenger → @champion alias change in MLflow
Pipeline downloads model.pkl from MLflow (one-time, in CI)
Builds and signs a Docker image with the artifact baked in
Pushes to the immutable registry with a digest-pinned tag
KServe InferenceService is updated to point at the new digest
Knative rolls a new revision; canary traffic split begins

MLflow is no longer in the request path. Inference pods need only:

The container image (already pulled to the node)
The model artifact (already inside the image)

Other things this unlocks

Faster cold starts — no S3 round-trip on pod start
Air-gapped deployments — model travels with the image
Trivial rollback — revert to the previous image digest. KServe handles the traffic shift
Reproducible deploys — image digest is the model version

When not to do this

Very large models that would balloon image size into the multi-GB range — push storage costs up and pull times up. Use a sidecar or init container to pull from a local cache instead, but treat the artifact source as production-grade.
Models with frequent fine-tuning where image rebuild cost outweighs the runtime decoupling benefit. Probably should rethink the deployment cadence anyway.

Bottom line

MLflow is excellent at what it’s designed for — experiment tracking and registry. Don’t extend it into your runtime.

Bake the artifact at promotion. Let your inference pods serve.

For a full reference architecture using KServe, Knative, and Istio with this pattern, see our MLOps stack case study.