Spreedly logo
  • Operational
  • Degraded Performance
  • Partial Outage
  • Major Outage
  • Maintenance
Intermittent 500 Errors
Incident Report for Spreedly
Postmortem

March 27th, 204 — ID Server Misconfiguration

A server configuration issue resulted in the ID service being unable to handle all incoming requests, which led to intermittent availability of Spreedly services, some transaction impacts, and 500 type errors being served to customers.

What Happened

On 3/27, at 2:30 PM UTC, Spreedly engineers began scheduled maintenance of the Spreedly ID service which handles authentication and authorization for other Spreedly services. A redeploy of the ID service, following the maintenance, revealed a latent server misconfiguration which led to application issues on specific servers in the ID cluster.

At 5:41 PM UTC, the Spreedly Core service began intermittently responding with errors, as the ID service was no longer able to handle all incoming authorization requests.

At 5:47 PM UTC, Spreedly engineers observed the increase in 500s from ID and began investigating. The misconfiguration was discovered and corrected at 6:22 PM UTC at which point the ID service began to recover. Errors from Spreedly Core services ceased entirely at 6:26 PM UTC.

Next Steps

Spreedly engineers have corrected the configurations that contributed to the application issues and are working to determine additional improvements that would prevent reoccurrence.

Spreedly engineers are taking steps to lower the alerting thresholds for the ID service to provide earlier notification of similar issues.

Conclusion

We apologize for the disruption of the service, our goal is to maintain a reliable service for our customers so avoiding these kind of events is one of our top priorities. We’ll drive our efforts to keep improving our services.

Posted Apr 05, 2024 - 13:39 EDT

Resolved
After a period of monitoring, this incident is considered resolved.

We appreciate your patience and apologize for any inconvenience. We will publish a post mortem related to this incident.
Posted Mar 27, 2024 - 14:07 EDT
Monitoring
We identified an issue that resulted in intermittent transaction failures. A fix has been implemented at 1:22 EST, we are monitoring the results.
We apologize for any disruption to service.
Posted Mar 27, 2024 - 13:48 EDT
Investigating
We are currently investigating an issue with access to the Dashboard application and related intermittent 500 errors.

We apologize for any disruption to service.
Posted Mar 27, 2024 - 13:25 EDT
This incident affected: Core Transactional API and Supporting Services (Spreedly Dashboard).