Between 18:20 UTC and 20:30 UTC on Friday February 15th, less than one percent of calls to the Spreedly API resulted in 404 (record not found) error. These were Spreedly errors and any transactions that resulted in this error can be safely retried.
As part of scheduled maintenance work, we deployed a configuration change to our storage subsystem that inadvertently modified an unrelated aspect of the storage service. This introduced inconsistent operating modes between the various storage nodes which caused a small percentage of calls to indicate a missing record when, in fact, the record was present. Because this change only affected a small number of our total storage capacity, and due to the distributed nature of our storage architecture, a small subset of total calls were affected.
As part of a long term migration path intended to increase our redundancy and scalability, Spreedly is currently operating across two distinct infrastructure environments – a self-managed colocated datacenter and a public cloud environment. Maintaining two separate environments stresses our systems and our processes, and increases the likelihood of disruptions such as this. We can do a better job managing this risk, though it will never be brought down to zero. We feel the long term benefits that our customers will experience from this migration will far outweigh the increased risk of disruption during the migration, but we also acknowledge there are steps we can take to improve the system stability during this time as well.
Going forward, we will be evaluating improvements to our processes and tooling to ensure parity between environments, and between version control and the deployed configuration.
We apologize for any disruption this incident may have caused. You rely on Spreedly to appropriately manage change and improvements against the high-uptime needs of a service like ours, and we can and should do better.