For a period of 40 minutes, starting at 19:02 UTC on May 17th, Spreedly pro-actively returned an error when customers attempted to interact with one of our gateway partners. This was one of our built-in system protection systems (described below) kicking in to keep a significant slowdown of one high-volume gateway from affecting / blocking traffic to all of our gateways. All of the transactions blocked are safe to retry since they were never processed and passed on to the gateway.
On the day of the incident one of our customers had a large on-sale event. This is a recurring/normal pattern that we’re well acquainted with, and the volumes were well within other similar events we have handled. However, for the forty minutes of the incident the gateway began to slow down significantly, forcing Spreedly systems to spend more and more of their capacity waiting on responses from that gateway.
Knowing from past experience that this is a risk for high volume gateways, we have previously put into place a system protection we call the Circuit Breaker (inspired by Release It!). One of the things this system tracks is how much of our available capacity is being used by a specific gateway in a given period of time. The Circuit Breaker will “open” and immediately return an error if it sees one gateway saturating too many system resources, and that is what happened in this case.
Overall, the good news is that the system worked exactly as intended! It prevented a fault at one high-volume gateway from causing any negative effects to processing on any other of our gateways. Two challenges arose with operating the Circuit Breaker that elevated this from a “normal course of business” to an incident on the Spreedly side, though.
First, this is the first time the Circuit Breaker has tripped due to system saturation, and we had to do some research to ensure the transactions were failing due to a Circuit Breaker trip. The team now in charge of the Circuit Breaker didn’t know that there was an existing runbook for its operation, and so it took longer than it should have to nail down exactly what was going on.
Second, as we researched during the incident, we realized that the Circuit Breaker system saturation threshold had not been updated when we recently added system capacity. Thus, it may have been a bit overeager in tripping and needs to be tuned based on current system parameters.
One of the most important things we realized as we worked through this was that we are not doing a good job of surfacing the fact that Spreedly is blocking transactions due to a Circuit Breaker (or other system protection) event in a consistent, easily machine-readable way. This makes it hard for customers to use this as a back-pressure indicator and react in real-time, and it also makes later retries of transactions harder than they should be. We will be queuing up work to remediate this.
The other big take-away for the team was that we need to pro-actively refresh everyone on this particular function of the Circuit Breaker, the runbooks associated, review all the parameters in light of our current system configuration, and ensure operation of the system is something new team members learn about. That work is being queued up for the near future.
We’re actually really happy with the isolation that the Circuit Breaker provided in this case - it did its job and prevented a significant slowdown at one partner gateway from becoming a system-wide incident. Now that we’ve seen it in operation, we know how to tweak it - and our responses to it - to hopefully make the next time it triggers a non-incident.