Blog
Notes from production AI infrastructure.
MLOps patterns, GPU FinOps, DevSecOps, SRE — written from on-call experience, not theory.
- Where 70% of EKS spend hides: a 5-step GPU FinOps auditHow to find the unallocated GPU and compute spend that cost dashboards can't see — and what to do about it.#finops#gpu#eks#kubernetes
- Why we bake model.pkl into Docker images instead of pulling from MLflow at runtimeMLflow is great for experiment tracking and registry. It's not great as a runtime dependency for production inference pods.#mlops#mlflow#kserve#kubernetes
- Champion / challenger model promotion that doesn't break inference SLOsA safe-by-default pipeline for promoting models in production: alias-based rollouts, evaluation gates, and canary traffic splits.#mlops#mlflow#kserve#drift