Server Maintenance Issues Causing 500 Errors

Incident Report for Spreedly

Postmortem

December 11th, 2024 — Internal System Errors

Server maintenance performed during Spreedly’s regular change window to increase capacity to an internal service caused two brief periods of service interruption to APIs that access decrypted information.

What Happened

At approximately 19:15 UTC, Spreedly Engineering initiated a capacity expansion of an internal service used for data decryption. From 19:28 to 19:30, a first wave of failed API requests during the maintenance window. A second wave of failed API requests occurred from 19:39 to 19:41 UTC when rebalancing request traffic to the capacity-expanded internal service. Service was restored at 19:41 UTC, and the system was fully operational.

API calls that required decrypted data were impacted during the outage timeframes.

Next Steps

Spreedly Engineering is improving internal observability, implementing automated monitors, and investigating the use of automation for scaling capacity in the future to prevent this issue from recurring.

Posted Dec 16, 2024 - 14:26 EST

Resolved

While maintaining a backend service to increase capacity, we saw intermittent errors during the timeframes of 2:28 PM EST and 2:30 PM EST and again during the timeframes of 2:39 PM EST and 2:41 PM EST. Our internal monitors picked up this issue, and we resolved it quickly.

We have since seen the system return to normal, and we do not expect any further customer impact. This incident is considered resolved.

We apologize for any inconvenience this may have caused.

Posted Dec 11, 2024 - 14:30 EST