← All posts

Why we bake model.pkl into Docker images instead of pulling from MLflow at runtime

MLflow is great for experiment tracking and registry. It's not great as a runtime dependency for production inference pods.

#mlops#mlflow#kserve#kubernetes

A pattern we see repeatedly when teams adopt MLflow:

# In the inference container, at startup
model = mlflow.pyfunc.load_model(f"models:/{name}@champion")

It looks clean. It uses the registry. The model name and alias are managed via the UI. Teams ship it.

Then production happens.

What goes wrong

MLflow becomes a runtime dependency for every inference pod. Every cold start hits the MLflow tracking server. Every horizontal scale-up hits the artifact store. If MLflow is on a t3.medium because “it’s just the registry,” you’ve made it part of your inference critical path.

Failure modes we’ve seen in production:

  • MLflow service restart causes a brief 503 wave across all inference pods scaling up during the outage
  • Artifact store throttling under sudden scale (Black Friday, clinic open, market open)
  • Network policy changes accidentally block egress from the inference namespace to MLflow — pods come up healthy, fail at first request
  • A bad migration on the MLflow backing database makes every new pod fail to start

You also pay an SRE tax:

  • MLflow needs to be on-call-grade reliable, with PostgreSQL HA, S3 / blob redundancy, observability
  • Or you accept that inference pods can fail to start when MLflow is down — which means MLflow downtime is inference downtime

The pattern we use instead

Bake the promoted model artifact into a versioned Docker image at promotion time.

FROM kserve/kserve-runtime:0.16

# Bake the model in at build time — copied from MLflow during the promotion job
COPY model.pkl /mnt/models/model.pkl

ENV MODEL_NAME="risk-classifier"
ENV MODEL_VERSION="2026-04-12-a1b2c3d"

The promotion pipeline:

  1. Trigger fires on @challenger → @champion alias change in MLflow
  2. Pipeline downloads model.pkl from MLflow (one-time, in CI)
  3. Builds and signs a Docker image with the artifact baked in
  4. Pushes to the immutable registry with a digest-pinned tag
  5. KServe InferenceService is updated to point at the new digest
  6. Knative rolls a new revision; canary traffic split begins

MLflow is no longer in the request path. Inference pods need only:

  • The container image (already pulled to the node)
  • The model artifact (already inside the image)

Other things this unlocks

  • Faster cold starts — no S3 round-trip on pod start
  • Air-gapped deployments — model travels with the image
  • Trivial rollback — revert to the previous image digest. KServe handles the traffic shift
  • Reproducible deploys — image digest is the model version

When not to do this

  • Very large models that would balloon image size into the multi-GB range — push storage costs up and pull times up. Use a sidecar or init container to pull from a local cache instead, but treat the artifact source as production-grade.
  • Models with frequent fine-tuning where image rebuild cost outweighs the runtime decoupling benefit. Probably should rethink the deployment cadence anyway.

Bottom line

MLflow is excellent at what it’s designed for — experiment tracking and registry. Don’t extend it into your runtime.

Bake the artifact at promotion. Let your inference pods serve.


For a full reference architecture using KServe, Knative, and Istio with this pattern, see our MLOps stack case study.