At 2021-10-29 1:00AM UTC requests to the Spreedly Core API intermittently returned 500 error response codes for a period of approximately 5 minutes. An automated recovery of Spreedly’s internal systems was triggered and all systems resumed normal operations at approximately 1:05AM UTC.
At 2021-10-29 1:02AM UTC internal monitoring detected an elevated number of error responses being returned from the Spreedly Core API. Engineers were paged and began investigating. The issue arose due to a dependent internal system becoming partially unavailable beginning at 1:00AM UTC. An automated antivirus scan that runs on this dependent system resulted in constrained resources on a subset of hosts, this then resulted in those hosts being deemed “unhealthy” by the automated health check process and removed from service. New hosts were automatically brought into service and normal operations resumed at approximately 1:05AM UTC.
Approximately 4,500 requests received a 500 error response during this time.
Spreedly engineers have made changes to mitigate the effects of the automated antivirus scan such that it should no longer cause the system to become unresponsive.