A server configuration issue resulted in the ID service being unable to handle all incoming requests, which led to intermittent availability of Spreedly services, some transaction impacts, and 500 type errors being served to customers.
On 3/27, at 2:30 PM UTC, Spreedly engineers began scheduled maintenance of the Spreedly ID service which handles authentication and authorization for other Spreedly services. A redeploy of the ID service, following the maintenance, revealed a latent server misconfiguration which led to application issues on specific servers in the ID cluster.
At 5:41 PM UTC, the Spreedly Core service began intermittently responding with errors, as the ID service was no longer able to handle all incoming authorization requests.
At 5:47 PM UTC, Spreedly engineers observed the increase in 500s from ID and began investigating. The misconfiguration was discovered and corrected at 6:22 PM UTC at which point the ID service began to recover. Errors from Spreedly Core services ceased entirely at 6:26 PM UTC.
Spreedly engineers have corrected the configurations that contributed to the application issues and are working to determine additional improvements that would prevent reoccurrence.
Spreedly engineers are taking steps to lower the alerting thresholds for the ID service to provide earlier notification of similar issues.
We apologize for the disruption of the service, our goal is to maintain a reliable service for our customers so avoiding these kind of events is one of our top priorities. We’ll drive our efforts to keep improving our services.