Over the course of the day on June 1st, Spreedly had five instances where, for 5-15 seconds, we returned a 503 response code for ~25% of requests. The final time this happened, around 22:41 UTC, our external monitoring service saw one of the 503’s, triggering a notification and an investigation.
Due to the extremely short nature of the incident (measured in seconds), it was completely resolved before we had a chance to investigate. Our investigation revealed that a gateway that typically experiences very bursty traffic slowed down significantly. The slowdown occurred concurrently with a normal burst of transactions, which caused requests to pile up and tied up our response handlers. This, in turn, caused our load balancers to return 503’s before any request handling was done. As soon as the request queue was worked down, the service returned to normal each time without any intervention required.
We have previously put in protections to guard against any one gateway timing out a significant fraction of requests and tying up a disproportionate number of resources. This incident highlighted that our continued growth means we need to take the protections a step further and cover the case of a gateway in heavy usage slowing down unreasonably vs. completely timing out.
We’re also actively working to give the service more head room in terms of resources - while we can handle big transaction bursts with capacity to spare, being reasonably over-provisioned has helped to make these types of gateway slowdowns a non-event in the past and will give us more flexibility going forward.
Finally, the most disconcerting aspect of this incident is that it was only when our outside polling (every 30 seconds) overlapped with one of these short incidents that we got notified. We are going to add explicit monitoring for 503 status codes internally (in addition to the external polling we have today) since, in normal operations, we shouldn’t ever be returning a 503.
Our apologies if any of your API calls were affected: we strive to have Spreedly available to process transactions 100% of the time, and in this case we failed at that goal. The good news is that the challenges are related to ever increasing transaction volumes and gateway partners, and we have solid plans to continue to scale the service out to meet the increased demand.