5xx Responses

Incident Report for Spreedly

Postmortem

On Wednesday, October 10th for about a 20 minute period of time starting at 18:59 UTC and a 6 minute period starting at 20:44 UTC, a planned hardware change of one of our power distribution units (PDU) started a series of events that led to core.spreedly.com returning 500 and 504 errors for a percentage of traffic. This impacted attempts to authorize, store, and purchase against credit cards.

What Happened

On Wednesday, October 10th we were performing an upgrade of one of our Power Distribution Units (PDU) and in the act of replacing the existing unit, discovered that one of our servers did not have a working redundant power connection. Almost immediately, we detected 500 and 504 errors being returned for a percentage of our traffic.

When this server lost power, it had been hosting several services, one of which was serving as the primary node for our cache cluster. The health check that triggers automatic promotion of a secondary node failed to kick in automatically. This led to a number of incoming requests failing when they attempted to access this service. Additionally, a significant fraction of our identity servers were impacted and removed from their load balancer pool, potentially impacting latency as the remainder took up the load. Once we identified the culprit (the failed server still listed as a primary for the critical service), we manually promoted a new primary node and our services resumed normal operation.

Later that evening when the initial server was powered back on, it rejoined the storage cluster prior to fully recovering from the unplanned power cutoff. This caused several more 500 errors before we isolated the server from service pools.

Impact

Overall, we detected just under 1,000 errors returned during this incident. During this time, transactions either failed to be processed entirely or flowed through to their gateways intact, no transactions partially executed and then returned a 500/504 code.

Next Steps

As a result of this incident, we’ve identified and started working on a number of improvements:

Improved health checks to detect and failover on node failures for the cache service
Enable identity servers to start and run degraded when the cache service is unavailable
Procedure improvements for maintenance on powered items in the data center
Improved status monitoring of Power Supply redundancy for servers in the DC
Plus additional improvements to host, PSU, and cluster alerting

Conclusion

Spreedly helps our customers perform millions of transactions per month. Our priority is to ensure you can continue to transact with your customers, so we run a balancing act between changing as little as possible that impacts your current business and scaling the systems and infrastructure to support your best day next week. This incident uncovered several weaknesses we can improve on to support you better today and tomorrow, and we intend to make improvements from them. We apologize for the disruption this incident caused.

Posted Nov 02, 2018 - 14:12 EDT

Resolved

At 18:59 UTC our master cache server failed without a secondary automatically taking over, resulting in a percentage of traffic receiving 504 Internal Server errors for 14 minutes (18:59 to 19:13 UTC). We detected and manually promoted a secondary server to take it’s place, and all traffic returned to normal at 19:13 UTC.

Between 20:41 and 20:44, we experienced a small number of connection errors (208) as the server above came back online, prior to being removed from the cluster for further investigation.

We’re continuing to monitor traffic but have seen no follow-up errors. We’re investigating the failover to determine why it didn’t kick in automatically and will provide more information as part of our post-mortem.

Posted Oct 10, 2018 - 14:59 EDT