Spreedly experienced a surge in traffic that overwhelmed our system’s ability to process Dashboard updates and also caused some errors in processing transactions.
On May 13th, 2020 at 6:31 pm EDT, Spreedly began receiving alerts from our core systems that indicated they were under unusually high load. Initially it was believed that these requests were test requests from a customer. The high load on the system strained ingestion into our data processing pipeline which powers our Dashboard. As a result of this strain, messages to the Dashboard system began to fail. These failures continued until we were fully able to recover the Dashboard system at 7:00 pm EDT.
Recovery of the Dashboard system did not end the incident, however, as the high load on the system continued. As we continued to investigate the load, the Dashboard processing pipeline again failed at 7:35 pm EDT. The increase in traffic continued and began to stress our primary data store and increased latency in the system. This resulted in impact to transactions where our systems began to serve 500s for some requests during this time.
We identified that the increased traffic was originating from a single location and that the profile did not fit test requests from a customer as we’ve previously seen. While the traffic was designed to appear as if it was legitimate, we believe that the traffic was an attempt to use Spreedly’s systems to verify stolen credit cards. Once we identified this, we blocked the traffic, and Spreedly’s systems stabilized. At 8:18 pm EDT, the system was restored to normal operation.
All the data that failed to be processed by our data processing pipeline for the Dashboard was replayed on May 14th during normal business hours and is now present in the Dashboard.
We have already put in place additional monitoring and a rate limiting system to detect and prevent the behavior observed during this incident. We have designed the system so that it should not impact normal behavior but will protect the system from the increase in traffic observed in this case. These protections are in addition to the other measures we already employ, which include DDoS protections and utilizing a WAF.
We will continue investigating further monitoring and rate limiting system improvements to see if there is more we should be doing to guard against related scenarios. We are also investigating if there are other changes we should make to the system to protect against the observed behavior.
We know that the stability and performance of Spreedly matters greatly to all our customers. We continue to invest in our platform to make it as stable and performant as possible and to prevent issues like this in the future. We are sorry for the disruption this incident caused.