On November 1st (UTC), Spreedly experienced an unusual traffic pattern from a large customer that caused resource exhaustion in the cryptographic sub-system. This resulted in two brief periods of API and transactional failures.
On Wednesday, November 1st at 02:17 UTC, a large customer kicked off a high volume job that overloaded a single component in the Spreedly system, resulting in two ten-minute periods of service degradation over the next few hours (02:17- 02:28 UTC and 03:59 - 04:10 UTC).
During the initial period of failures, by the time an engineer began investigating, the incident had resolved itself. The investigation continued and identified the customer account where the traffic was originating from, but the decision was made not to contact the customer or institute any protections at that time.
Approximately one and half hours later the same traffic pattern resulted in similar downtime. Engineers were paged and began investigating. It was quickly determined that this was a continuation of the previous incident and efforts were made to both contact the customer and institute a rate limit against their traffic (which was deemed to be non-transactional in nature).
Although the traffic pattern subsided in about 10 minutes, like the previous period, immediate efforts to prevent future such incidents continued. A rate limit fix was identified and staged for deployment, but due to the high risk nature of such changes it was decided not to deploy it. After making contact with the customer in question, and confirming that the traffic pattern wouldn’t recur, we marked the issue as resolved.
Overall, we had two 10 minute periods of elevated error rates for the Core API. During this time, any 500 responses from Spreedly were not submitted to the gateway and can be safely retried.
Denial of service attacks, whether intentional or not, are always a concern for an online service like Spreedly. Capacity can always be increased (Spreedly is continuously increasing capacity across its systems) but that alone will never prevent these types of incidents. This incident highlighted the fact that we need to be better in our response – identifying, isolating, and limiting aggravating traffic patterns.
We will be evaluating the following as a result of this incident:
We strive to create and operate a robust service for our customer, and incidents like these show we can do better. We apologize for this disruption and know we have to continue improving our service.