Record not found errors

Incident Report for Spreedly

Postmortem

Between 18:20 UTC and 20:30 UTC on Friday February 15th, less than one percent of calls to the Spreedly API resulted in 404 (record not found) error. These were Spreedly errors and any transactions that resulted in this error can be safely retried.

What Happened

As part of scheduled maintenance work, we deployed a configuration change to our storage subsystem that inadvertently modified an unrelated aspect of the storage service. This introduced inconsistent operating modes between the various storage nodes which caused a small percentage of calls to indicate a missing record when, in fact, the record was present. Because this change only affected a small number of our total storage capacity, and due to the distributed nature of our storage architecture, a small subset of total calls were affected.

As part of a long term migration path intended to increase our redundancy and scalability, Spreedly is currently operating across two distinct infrastructure environments – a self-managed colocated datacenter and a public cloud environment. Maintaining two separate environments stresses our systems and our processes, and increases the likelihood of disruptions such as this. We can do a better job managing this risk, though it will never be brought down to zero. We feel the long term benefits that our customers will experience from this migration will far outweigh the increased risk of disruption during the migration, but we also acknowledge there are steps we can take to improve the system stability during this time as well.

Next Steps

Going forward, we will be evaluating improvements to our processes and tooling to ensure parity between environments, and between version control and the deployed configuration.

Conclusion

We apologize for any disruption this incident may have caused. You rely on Spreedly to appropriately manage change and improvements against the high-uptime needs of a service like ours, and we can and should do better.

Posted Feb 21, 2019 - 13:41 EST

Resolved

The error rate has stayed within normal operating bounds through the weekend and we are marking this incident as resolved. Any transactions that failed during this period with a "record not found" error can be safely retried.

A full post-mortem will be posted this week.

Posted Feb 18, 2019 - 09:56 EST

Update

We will continue to monitor the situation through the weekend and will post a final update Monday.

Posted Feb 16, 2019 - 17:43 EST

Monitoring

The fix has been applied and the intermittent errors have subsided. We are monitoring to ensure there are no deleterious side effects.

Posted Feb 15, 2019 - 15:42 EST

Identified

We have identified a potential fix and are applying it to production systems.

Posted Feb 15, 2019 - 15:33 EST

Investigating

We are seeing intermittent errors when fetching records from our data storage systems. This is causing some transactions to fail. We've begun investigating the cause. In the meantime, it is safe to retry these transactions on failure.

Posted Feb 15, 2019 - 14:45 EST

This incident affected: Core Transactional API.