Core API 500 Errors and Dashboard Experiencing Delays

Incident Report for Spreedly

Postmortem

Spreedly experienced a surge in traffic that overwhelmed our system’s ability to process Dashboard updates and also caused some errors in processing transactions.

What Happened

On May 13th, 2020 at 6:31 pm EDT, Spreedly began receiving alerts from our core systems that indicated they were under unusually high load. Initially it was believed that these requests were test requests from a customer. The high load on the system strained ingestion into our data processing pipeline which powers our Dashboard. As a result of this strain, messages to the Dashboard system began to fail. These failures continued until we were fully able to recover the Dashboard system at 7:00 pm EDT.

Recovery of the Dashboard system did not end the incident, however, as the high load on the system continued. As we continued to investigate the load, the Dashboard processing pipeline again failed at 7:35 pm EDT. The increase in traffic continued and began to stress our primary data store and increased latency in the system. This resulted in impact to transactions where our systems began to serve 500s for some requests during this time.

We identified that the increased traffic was originating from a single location and that the profile did not fit test requests from a customer as we’ve previously seen. While the traffic was designed to appear as if it was legitimate, we believe that the traffic was an attempt to use Spreedly’s systems to verify stolen credit cards. Once we identified this, we blocked the traffic, and Spreedly’s systems stabilized. At 8:18 pm EDT, the system was restored to normal operation.

All the data that failed to be processed by our data processing pipeline for the Dashboard was replayed on May 14th during normal business hours and is now present in the Dashboard.

Next Steps

We have already put in place additional monitoring and a rate limiting system to detect and prevent the behavior observed during this incident. We have designed the system so that it should not impact normal behavior but will protect the system from the increase in traffic observed in this case. These protections are in addition to the other measures we already employ, which include DDoS protections and utilizing a WAF.

We will continue investigating further monitoring and rate limiting system improvements to see if there is more we should be doing to guard against related scenarios. We are also investigating if there are other changes we should make to the system to protect against the observed behavior.

We know that the stability and performance of Spreedly matters greatly to all our customers. We continue to invest in our platform to make it as stable and performant as possible and to prevent issues like this in the future. We are sorry for the disruption this incident caused.

Posted May 22, 2020 - 11:33 EDT

Resolved

Spreedly Core API and Spreedly Insights Dashboard are fully operational.

Posted May 13, 2020 - 21:23 EDT

Update

Spreedly Core API is operating as expected. Some transactions may be missing from the Dashboard until the data is fully restored tomorrow morning. We are continuing to monitor for any further issues.

Posted May 13, 2020 - 21:00 EDT

Monitoring

A fix has been implemented addressing the 500 errors for the Core API and the delays on Dashboard. We are monitoring the results.

Posted May 13, 2020 - 20:31 EDT

Update

We have identified an issue causing intermittent 500 errors on Spreedly's Core API

Posted May 13, 2020 - 19:55 EDT

Identified

The issue impacting the Spreedly Insights Dashboard has been identified and a fix is being implemented.

Posted May 13, 2020 - 19:18 EDT

Investigating

We are currently experiencing degraded performance on the Spreedly Insights Dashboard. Transaction activity may not be reporting in real time.

We will provide updates as they become available.

Posted May 13, 2020 - 18:50 EDT

This incident affected: Core Transactional API.