On September 15th 2017 at 16:31 UTC, during normal systems maintenance, the addition of multiple storage nodes back into rotation at the same time caused our primary transactional API system to fail their restart sequence, resulting in 11 minutes of 100% downtime for our customers. Once the situation was diagnosed, the transactional API systems were restarted and service was immediately restored. We recognize that any amount of downtime is lost business for our customers and we apologize for this incident. Please read more to learn about the details and what we plan to do to make things better.
Spreedly's primary data store is a distributed database that operates with multiple nodes. It is a common procedure for us to take a single node out of rotation, perform some maintenance, and add the node back into rotation. This happens several times a quarter without incident. Bringing a node back into service involves bringing the host machine up, the service itself, and also telling the API system that the new node is available for use. When a storage node broadcasts that it is available for use, it broadcasts its location to the API services which restart to pull in the new value.
In this case, due to an error in our automated playbooks, three of our storage nodes had been through a maintenance reboot sequence without actually broadcasting to the API systems that they were available for use. On recognition of this, the three storage nodes were added back into rotation at the same time (or very near the same time - all within 1 minute). This then propagated three separate restart commands to each of the API application servers. These restart commands, which are intended only to run sequentially, were run concurrently which caused the final restart command to fail and the service to remain down.
The incident response team became aware of the problem at 16:31 UTC when our automated monitoring systems reported that the API was down. Due to the timing, the rapid-fire addition of the three storage nodes back into service was presented as a likely culprit and its effects were confirmed in short order, upon which the primary API services were started and service restored. Service was completely restored at 16:42 UTC for a total of 11m of downtime.
Spreedly has already begun to modify our automated playbooks to be more resilient to errors so subsequent steps aren't inadvertently skipped. We are also adding in more explicit state visibility of the storage nodes so it's clear to our operators when a node is running vs. running and actually available for use by our application servers. Additionally, we are looking into how to make the application restart process more robust to either prevent concurrent runs, or not error if called in close succession.
Our goal at Spreedly is and will always be maximum availability and reliability. We sincerely apologize for the disruption this incident caused.