Spreedly API Errors
Incident Report for Spreedly
Postmortem

Upon releasing a new version of the Spreedly API, the automated code deployment process failed to properly deploy all resources needed by the application, resulting in an outage of the Spreedly API. A majority of customers were unable to process requests for approximately 30 minutes. 

What Happened

On April 7th, 2021 Spreedly released a new version of the application through an automated code deployment process. Soon after, Spreedly’s internal monitoring systems detected an elevated number of errors starting at 15:45 UTC. Spreedly engineers redeployed application instances, resolving the system issue. Impacted requests were those that received a “502 Gateway Unreachable” with a smaller number of customer requests receiving a “500 Internal Server Error” response. 

Engineers continued to monitor and discovered a secondary issue as a by-product of the automated deployment process failure. A large volume of monitoring events overwhelmed a downstream service, resulting in degraded performance for a smaller subset of customer requests between 16:28 to 16:36 UTC. Additional action was taken to recycle the application. Impacted customers received a “500 Internal Server Error” response. 

At approximately 17:10 UTC, Spreedly engineers released an update to the deployment process which addressed the internal automated deployment process. As a result, internal systems indicated signs of a return to normal activity.

Next Steps

  • Add a gated canary deployment to prevent failures from propagating into production. 
  • Update our automated code deployment health checks to provide more robust self checking of deployment health.
  • Separate application monitoring events from production services to limit blowback during periods of high volume.
Posted Apr 14, 2021 - 17:54 EDT

Resolved
After deploying the fix, all systems appear to be stabilized and functioning. The incident is being considered resolved.

We are still investigating to understand the specific causes of the incident and any residual impact. A post incident review will be published.

We apologize for any inconvenience and disruption to service.
Posted Apr 07, 2021 - 13:38 EDT
Monitoring
We have deployed an additional fix for the latest errors.

We will continue to monitor closely.
Posted Apr 07, 2021 - 12:50 EDT
Identified
We have identified an additional subset of API errors.

We are currently working on implementing a fix.

Updates will be provided as they become available.
Posted Apr 07, 2021 - 12:37 EDT
Update
Our team has deployed a fix for this issue. All systems appear to be stabilized and functional.

We are continuing to monitor the situation actively.
Posted Apr 07, 2021 - 12:24 EDT
Monitoring
A fix has been implemented addressing the API errors.

We are actively monitoring the results.
Posted Apr 07, 2021 - 12:17 EDT
Investigating
We have identified an issue causing intermittent 5xx errors on Spreedly's Core API.

This is impacting a number of transactions and requests to Spreedly's API.

Updates will be provided as they become available.
Posted Apr 07, 2021 - 12:02 EDT
This incident affected: Core Transactional API.