Increased level of API authentication (401) errors
Incident Report for Spreedly
Postmortem

During maintenance to address a hardware failure,  backend traffic was inadvertently sent to a node within a database cluster that was still in the process of rebooting. There was a brief 4 minute period where transaction failures occurred for some customers while the node was brought fully online. 

What Happened

To maximize overall system availability, some core Spreedly components are split up into separate clusters, with multiple nodes that replicate data between one another.  As part of addressing a hardware failure, a new node was brought online and started to receive traffic before it had fully joined the cluster.  

When the node received a request where information was needed  the system responded as if the information did not exist (because it was unaware of its other cluster members that held data to handle the request).  As a result, some customer transactions were affected between 17:46 and 17:49 UTC. Impacted transactions were those that receive a “401 Unauthorized” response.  

Actions

  • Improve the methods (process and procedures) in handling the addition of a node to a cluster while the cluster is active.
Posted Feb 02, 2021 - 10:53 EST

Resolved
From approximately 17:46-17:49 UTC, a subset of transactions resulted in unexpected 401 errors (failure to authenticate).
This was caused by a hardware failure in our cloud environment.

We are in the process of investigating the incident and will provide further details at a later time.
Posted Jan 27, 2021 - 12:46 EST