Routine maintenance activities triggered a previously unidentified bug that resulted in a loss of API availability.
Spreedly leverages a “Leader/Follower” architectural pattern in its containerization models where a self-elected leader orchestrates work for other follower services. During routine systems maintenance work, a previously unidentified bug resulted in state communication loss across an orchestration cluster. This meant that over time the “leader/follower” relationships atrophied and could not be reestablished. Redeploying the cluster resulted in workers that were unable to pickup jobs from their leader, including the jobs that ran a critical component of our API infrastructure.
Our systems detected the issue automatically and alerted our systems engineers who quickly restored the appropriate state/relationship for all jobs. As jobs became active, the critical componentry restarted and full API functionality resumed. No data loss resulted from this activity.
At Spreedly we understand' the critical role we play in our customers online experience, and deeply regret the interruption this incident caused. We have instituted immediate changes, as well as begun work on short and long term efforts, to ensure this type of issue does not recur.