Core API Errors

Incident Report for Spreedly

Postmortem

Summary

Routine maintenance activities triggered a previously unidentified bug that resulted in a loss of API availability.

What Happened

Spreedly leverages a “Leader/Follower” architectural pattern in its containerization models where a self-elected leader orchestrates work for other follower services. During routine systems maintenance work, a previously unidentified bug resulted in state communication loss across an orchestration cluster. This meant that over time the “leader/follower” relationships atrophied and could not be reestablished. Redeploying the cluster resulted in workers that were unable to pickup jobs from their leader, including the jobs that ran a critical component of our API infrastructure.

Our systems detected the issue automatically and alerted our systems engineers who quickly restored the appropriate state/relationship for all jobs. As jobs became active, the critical componentry restarted and full API functionality resumed. No data loss resulted from this activity.

Conclusion

At Spreedly we understand' the critical role we play in our customers online experience, and deeply regret the interruption this incident caused. We have instituted immediate changes, as well as begun work on short and long term efforts, to ensure this type of issue does not recur.

Posted Dec 16, 2022 - 18:00 EST

Resolved

After a period of monitoring, all systems are functioning normally. This incident is considered resolved.

We apologize for this disruption to service.

Posted Dec 13, 2022 - 16:58 EST

Monitoring

From 20:14 to 20:28 UTC the Core API experienced errors affecting customers. We have identified the cause of the errors and have implemented a fix.

We will continue to monitor to ensure the issue is resolved.

Posted Dec 13, 2022 - 15:45 EST

Investigating

Spreedly is aware of errors for customers with the Core API and is currently investigating.

Posted Dec 13, 2022 - 15:27 EST

This incident affected: Core Transactional API.