Spreedly API Errors

Incident Report for Spreedly

Postmortem

Summary

A planned operating system upgrade on July 12th resulted in slightly higher ongoing CPU usage than the prior OS. This compounded so that by July 19th, we began to exhaust CPU credits, which reduced our ability to successfully process a number of transactions.

What Happened

As prologue, on Wednesday July 12th, we exercised a planned-for operating system upgrade on the servers running the cryptography service. This involved our usual and practiced behavior of gracefully moving traffic from one cluster of servers (i.e.: “blue” to “green”), patching the offline cluster and then performing the same in reverse. Testing illustrated that the upgrade was a success and the system continued processing ~100m+ internal service API requests daily without issue or incident.

On July 18th at 18:26 UTC, internal monitoring alerted on an increased number of failures being generated from our cryptography service. Teams were immediately assembled to work the issue. Logs indicated a resource contention (limiting connections by zone "default") as individual service nodes exceeded the configured limit of simultaneous connections.

We began the effort to alter the configuration of these limits and the task of quieting one cluster, effecting the repair, bringing the cluster online, and then performing the steps for the other cluster.

Unfortunately, once reconfigured, a different resource contention was found, this time at the operating system level. We began the effort to alter that configuration and percolate the change through both “blue” and “green” clusters but the errors merely moved to another type of resource constraint.

We reverted potentially relevant changes from earlier in the day, without positive effect on the system.

Finally, we reverted the operating system upgrade from the July 12th activity which ultimately resolved the issue. By July 19th at 00:29 UTC, the error rate on all cryptography service nodes returned to zero and we left the incident in a monitor state for approximately 12 hours before declaring the incident resolved.

A subsequent root cause analysis has determined that we overran the “CPU Credits” allocated to our AWS EC2 instance type on which the cryptography service was running. We believe that the OS upgrade incrementally consumed enough CPU capacity to eventually exhaust CPU credits, as credits are utilized at relatively small CPU usage, in excess of 20%

Next Steps

We are re-provisioning this service to use an AWS EC2 instance type that has no CPU credit limits which should prevent the problem from happening again.

Conclusion

We want to apologize to all customers who were impacted by failed transactions due to this incident. We understand how much you rely on our systems being fully operational in order to successfully operate your business. We also appreciate your patience while this issue was being investigated and resolved, thank you again for the trust you place on us every day.

Posted Jul 21, 2023 - 15:44 EDT

Resolved

After closely monitoring our Core Transactional API and confirming that all systems are stabilized and functioning as expected, this incident is considered resolved.

We are completing our investigation with regard to the causes of the incident and any residual impact. A post-incident review will be published.

We apologize for any inconvenience or disruption.

Posted Jul 19, 2023 - 10:08 EDT

Monitoring

Our Engineering team has identified and implemented some changes to address the issue. As we actively monitor the fix, we are seeing a reduction in error rates indicating the fix is stabilizing the errors and addressing the issue.

We’ll continue to monitor the fix closely and provide updates.

Posted Jul 18, 2023 - 20:56 EDT

Update

Our team is actively investigating an issue with one of the libraries that feeds our Core component resulting in some transactions receiving errors. Some customers may have experienced 500 errors. We are still investigating a potential solution.

Thank you for your patience.

Posted Jul 18, 2023 - 18:20 EDT

Identified

Since 2023-07-18 3:01 ET, there has been an issue detected causing intermittent errors on the Spreedly API.

Our team is actively working towards a solution to rectify the problem, and we will keep you updated on any progress made in resolving the matter. We value your patience as we work to resolve this issue.

Posted Jul 18, 2023 - 15:38 EDT

This incident affected: Core Transactional API.