During routine system maintenance, customer traffic was inadvertently sent to a cluster of systems that were still in the process of rebooting. There was a brief four-minute period where transaction failures occurred for some customers while the systems were brought fully online.
To maximize overall system availability, some core Spreedly components are split up into separate clusters. An Application Load Balancer (ALB) directs customer traffic to all clusters usually, or just one cluster (the “Active Cluster”) during system maintenance. Typical system maintenance happens programmatically, communicating with the ALB to:
In this instance, we inadvertently configured the first step manually (instead of using the programatic “infrastructure as code” method). This caused the other programatic ALB operations to be out of sync, resulting in the cluster being marked as “Active” (step 4) before all the system maintenance had completed (step 3).
Some customer transactions were affected between 21:01 and 21:05 UTC. Impacted transactions were those that received a “500 Internal Server Error” response.