During maintenance to address a hardware failure, backend traffic was inadvertently sent to a node within a database cluster that was still in the process of rebooting. There was a brief 4 minute period where transaction failures occurred for some customers while the node was brought fully online.
To maximize overall system availability, some core Spreedly components are split up into separate clusters, with multiple nodes that replicate data between one another. As part of addressing a hardware failure, a new node was brought online and started to receive traffic before it had fully joined the cluster.
When the node received a request where information was needed the system responded as if the information did not exist (because it was unaware of its other cluster members that held data to handle the request). As a result, some customer transactions were affected between 17:46 and 17:49 UTC. Impacted transactions were those that receive a “401 Unauthorized” response.