Intermittent transaction failures

Incident Report for Spreedly

Postmortem

On November 1st (UTC), Spreedly experienced an unusual traffic pattern from a large customer that caused resource exhaustion in the cryptographic sub-system. This resulted in two brief periods of API and transactional failures.

What Happened

On Wednesday, November 1st at 02:17 UTC, a large customer kicked off a high volume job that overloaded a single component in the Spreedly system, resulting in two ten-minute periods of service degradation over the next few hours (02:17- 02:28 UTC and 03:59 - 04:10 UTC).

During the initial period of failures, by the time an engineer began investigating, the incident had resolved itself. The investigation continued and identified the customer account where the traffic was originating from, but the decision was made not to contact the customer or institute any protections at that time.

Approximately one and half hours later the same traffic pattern resulted in similar downtime. Engineers were paged and began investigating. It was quickly determined that this was a continuation of the previous incident and efforts were made to both contact the customer and institute a rate limit against their traffic (which was deemed to be non-transactional in nature).

Although the traffic pattern subsided in about 10 minutes, like the previous period, immediate efforts to prevent future such incidents continued. A rate limit fix was identified and staged for deployment, but due to the high risk nature of such changes it was decided not to deploy it. After making contact with the customer in question, and confirming that the traffic pattern wouldn’t recur, we marked the issue as resolved.

Impact

Overall, we had two 10 minute periods of elevated error rates for the Core API. During this time, any 500 responses from Spreedly were not submitted to the gateway and can be safely retried.

Next Steps

Denial of service attacks, whether intentional or not, are always a concern for an online service like Spreedly. Capacity can always be increased (Spreedly is continuously increasing capacity across its systems) but that alone will never prevent these types of incidents. This incident highlighted the fact that we need to be better in our response – identifying, isolating, and limiting aggravating traffic patterns.

We will be evaluating the following as a result of this incident:

Investing in our response playbooks to better outline targeted rate limit options
Adding additional automated pages and visibility dashboards to alert us sooner to these types of error cases
Increasing capacity in the specific areas that failed during this incident

Conclusion

We strive to create and operate a robust service for our customer, and incidents like these show we can do better. We apologize for this disruption and know we have to continue improving our service.

Posted Nov 07, 2018 - 15:41 EST

Resolved

The aggravating traffic pattern has abated for over 24 hours now and we are marking the incident as resolved. We will focus our efforts now on longer term remediation and prevention tasks. A public post mortem will be posted by the start of next week.

Posted Nov 01, 2018 - 09:50 EDT

Update

We are investigating a modification that will temporarily rate limit the unusual traffic patterns. All systems are operating normally, but we will wait to formally resolve the incident until we have that protection in place or have confidence that it will not recur.

Posted Oct 31, 2018 - 09:49 EDT

Update

We are continuing to monitor the situation and have a change prepped should the issue reoccur.

Posted Oct 31, 2018 - 01:59 EDT

Monitoring

Systems are operating normally, though we are still trying to isolate the problematic behavior.

Posted Oct 31, 2018 - 01:10 EDT

Investigating

We are seeing periods of recurring systems instability from the previous incident and are working to isolate the problematic load.

Posted Oct 31, 2018 - 00:43 EDT

This incident affected: Core Transactional API.