← All case studies
2024

Multi-site failover for a 50,000-server estate

Tier-1 financial services firm: automated multi-site / multi-zone failover for business continuity across a 50,000+ server estate. Zero-downtime release patterns on VMware Tanzu and Kubernetes.

VMware TanzuKubernetesKafkaAerospikeAnsibleSaltStack

Context

A Tier-1 financial services firm needed multi-site, multi-zone failover automation across a 50,000+ server estate spanning low-latency trading clusters, batch settlement, and regulatory reporting workloads. Manual failover drills were expensive, error-prone, and ran quarterly at best.

Problem

  • Failover runbooks were Word documents, drift from reality
  • DR drills required weekend windows and full team mobilization
  • Real-time trading clusters (Kafka, Aerospike, DB2, MQ) had no automated cutover discipline
  • Vulnerability and patch backlogs blocked compliance reporting

Approach

  • Automated multi-site / multi-zone failovers for business continuity across the full estate
  • Zero-downtime release patterns on VMware Tanzu (PCF) and Kubernetes — Rolling, Canary, and Blue/Green with automated rollback
  • Real-time monitoring via Splunk and Grafana for Kafka, Aerospike, DB2, and MQs in low-latency trading clusters
  • Enterprise vulnerability management using Qualys (scanning), SCCM (Windows deployment), and ServiceNow (change / incident); maintained SLA compliance and drove KRI reporting via Splunk dashboards
  • Privileged machine identity lifecycle (PKI / SSL / TLS / SSH) using CyberArk PAM, Venafi, HashiCorp Vault, and CredHub across containerized and VM environments

Outcome

  • DR drills shifted from quarterly weekend events to weekly automated runs
  • Mean-time-to-failover for trading clusters reduced from hours to minutes
  • 100% SLA compliance on vulnerability remediation across the estate
  • Zero standing privileged access on the server fleet

Stack

VMware Tanzu (PCF), Kubernetes, Kafka, Aerospike, MQ, DB2, Ansible, SaltStack, CyberArk PAM, Venafi, HashiCorp Vault, CredHub, Qualys, SCCM, ServiceNow, Splunk, Grafana.