Spreedly Elevated Rate of 500 Errors
Incident Report for Spreedly
Postmortem

Changes to a security function on the cryptographic systems overwhelmed the resources on those servers, taking a subset of the cryptographic systems offline. This caused a brief period of transaction failures while the system adjusted to the reduced capacity. 

What Happened

We made changes to a security function that runs on the systems that handle our cryptographic processes. After these changes, a routine update process ran. However, this time the update consumed more resources than expected. This appears to have been due to the update changing how many resources the security function needed, but this was not an expected side effect of the update. The security function then consumed all the available resources, which took a number of the cryptographic systems offline. This caused any inflight requests that needed the cryptographic systems to fail while the system adjusted to the reduced capacity.

As we brought the affected systems back online to fully restore capacity, there was an additional brief failure of some transactions as one of the systems was not yet ready to serve traffic but rejoined the rotation. 

Intermittent transaction failures occurred during two periods:

1:00 to 1:05 UTC 

1:34 to 1:35 UTC

The majority of the failures occurred in the first window, with a much smaller amount occurring during the second. Outside of these times, transactions processed as expected. 

Next Steps

  • We have already increased the resources available to the cryptographic processes to ensure this issue does not repeat itself.
  • We are reevaluating our process for bringing the systems back online to determine how to prevent a system rejoining the rotation when it should not.
  • We are also reevaluating the security function to ensure we understand why this change triggered the need for additional resources, and we are looking at how we can catch an issue like this before it affects our production systems.

We know that the stability and performance of Spreedly matters greatly to all our customers. We continue to invest in our platform to make it as stable and performant as possible and to prevent issues like this in the future. We are sorry for the disruption this incident caused.

Posted Aug 21, 2020 - 15:16 EDT

Resolved
After deploying the fix, all systems appear to be stabilized and functioning. The incident is being considered resolved.

We are still investigating to understand the specific causes of the incident and will publish a post-mortem.

We apologize for any inconvenience and disruption to service.
Posted Aug 14, 2020 - 22:01 EDT
Monitoring
A fix has been implemented addressing the elevated 500 errors.

We are actively monitoring the results.
Posted Aug 14, 2020 - 21:47 EDT
Identified
We have identified an issue causing an elevated rate of 500 errors on Spreedly's Core API.

This is impacting all transactions and requests to Spreedly's API.

Updates will be provided as they become available.
Posted Aug 14, 2020 - 21:27 EDT
This incident affected: Core Transactional API.