Changes to a security function on the cryptographic systems overwhelmed the resources on those servers, taking a subset of the cryptographic systems offline. This caused a brief period of transaction failures while the system adjusted to the reduced capacity.
We made changes to a security function that runs on the systems that handle our cryptographic processes. After these changes, a routine update process ran. However, this time the update consumed more resources than expected. This appears to have been due to the update changing how many resources the security function needed, but this was not an expected side effect of the update. The security function then consumed all the available resources, which took a number of the cryptographic systems offline. This caused any inflight requests that needed the cryptographic systems to fail while the system adjusted to the reduced capacity.
As we brought the affected systems back online to fully restore capacity, there was an additional brief failure of some transactions as one of the systems was not yet ready to serve traffic but rejoined the rotation.
Intermittent transaction failures occurred during two periods:
1:00 to 1:05 UTC
1:34 to 1:35 UTC
The majority of the failures occurred in the first window, with a much smaller amount occurring during the second. Outside of these times, transactions processed as expected.
We know that the stability and performance of Spreedly matters greatly to all our customers. We continue to invest in our platform to make it as stable and performant as possible and to prevent issues like this in the future. We are sorry for the disruption this incident caused.