Investigating Spike in 503 Error
Incident Report for Spreedly
Postmortem

​​What Happened

​​Over the course of the day on June 1st, Spreedly had five instances where, for 5-15 seconds, we returned a 503 response code for ~25% of requests. The final time this happened, around 22:41 UTC, our external monitoring service saw one of the 503’s, triggering a notification and an investigation.

​​Due to the extremely short nature of the incident (measured in seconds), it was completely resolved before we had a chance to investigate. Our investigation revealed that a gateway that typically experiences very bursty traffic slowed down significantly. The slowdown occurred concurrently with a normal burst of transactions, which caused requests to pile up and tied up our response handlers. This, in turn, caused our load balancers to return 503’s before any request handling was done. As soon as the request queue was worked down, the service returned to normal each time without any intervention required.

​​Next Steps

​​We have previously put in protections to guard against any one gateway timing out a significant fraction of requests and tying up a disproportionate number of resources. This incident highlighted that our continued growth means we need to take the protections a step further and cover the case of a gateway in heavy usage slowing down unreasonably vs. completely timing out.

​​We’re also actively working to give the service more head room in terms of resources - while we can handle big transaction bursts with capacity to spare, being reasonably over-provisioned has helped to make these types of gateway slowdowns a non-event in the past and will give us more flexibility going forward.

​​Finally, the most disconcerting aspect of this incident is that it was only when our outside polling (every 30 seconds) overlapped with one of these short incidents that we got notified. We are going to add explicit monitoring for 503 status codes internally (in addition to the external polling we have today) since, in normal operations, we shouldn’t ever be returning a 503.

​​Conclusion

​​Our apologies if any of your API calls were affected: we strive to have Spreedly available to process transactions 100% of the time, and in this case we failed at that goal. The good news is that the challenges are related to ever increasing transaction volumes and gateway partners, and we have solid plans to continue to scale the service out to meet the increased demand.

Posted over 1 year ago. Jun 06, 2017 - 10:27 EDT

Resolved
Traffic on the affected gateway has normalized and we are no longer seeing an abnormal amount of errors. We will now focus on prevention of similar incidents in the future. Additional details and action items will be available in a post mortem.
Posted over 1 year ago. Jun 01, 2017 - 19:41 EDT
Identified
We've isolated this issue to one gateway. An increase in traffic resulted in an increased response time from the gateway. This consumed additional Spreedly resources, ultimately leading to the 503 response. We are continuing to monitor while working towards a solution.
Posted over 1 year ago. Jun 01, 2017 - 19:17 EDT
Investigating
We've noticed an increase in responses with an error code 503. We are investigating now and will provide additional information as available. These transactions are safe to retry.
Posted over 1 year ago. Jun 01, 2017 - 18:57 EDT