On Wednesday, October 10th for about a 20 minute period of time starting at 18:59 UTC and a 6 minute period starting at 20:44 UTC, a planned hardware change of one of our power distribution units (PDU) started a series of events that led to core.spreedly.com returning 500 and 504 errors for a percentage of traffic. This impacted attempts to authorize, store, and purchase against credit cards.
On Wednesday, October 10th we were performing an upgrade of one of our Power Distribution Units (PDU) and in the act of replacing the existing unit, discovered that one of our servers did not have a working redundant power connection. Almost immediately, we detected 500 and 504 errors being returned for a percentage of our traffic.
When this server lost power, it had been hosting several services, one of which was serving as the primary node for our cache cluster. The health check that triggers automatic promotion of a secondary node failed to kick in automatically. This led to a number of incoming requests failing when they attempted to access this service. Additionally, a significant fraction of our identity servers were impacted and removed from their load balancer pool, potentially impacting latency as the remainder took up the load. Once we identified the culprit (the failed server still listed as a primary for the critical service), we manually promoted a new primary node and our services resumed normal operation.
Later that evening when the initial server was powered back on, it rejoined the storage cluster prior to fully recovering from the unplanned power cutoff. This caused several more 500 errors before we isolated the server from service pools.
Overall, we detected just under 1,000 errors returned during this incident. During this time, transactions either failed to be processed entirely or flowed through to their gateways intact, no transactions partially executed and then returned a 500/504 code.
As a result of this incident, we’ve identified and started working on a number of improvements:
Spreedly helps our customers perform millions of transactions per month. Our priority is to ensure you can continue to transact with your customers, so we run a balancing act between changing as little as possible that impacts your current business and scaling the systems and infrastructure to support your best day next week. This incident uncovered several weaknesses we can improve on to support you better today and tomorrow, and we intend to make improvements from them. We apologize for the disruption this incident caused.