Spreedly API Errors
Incident Report for Spreedly
Postmortem

Intermittent network failures within Spreedly’s cloud infrastructure led to communication failures between some internal services. Once a core component of the Spreedly infrastructure was unreachable for a period of time, all remaining traffic was refused, resulting in customers receiving a “502” error response to their request. Customers were unable to process transactions for approximately 27 minutes.   

What Happened

To maximize overall system availability and security, some core Spreedly components operate with ephemeral memory storage. The contents of such storage is only available via network connection (not disk). During an intermittent network failure within AWS’s infrastructure, the storage was unavailable and the core system was subsequently unable to process requests. A majority of customers’ transactions were affected between 13:00 UTC and  13:27 UTC. Impacted transactions were those that received a “502 Gateway Unreachable” response. 

Approximately an hour later, intermittent network failures within AWS continued but this occurrence was limited to a degradation of service—as opposed to a full outage—of less than eight minutes. Impacted customers were those that received a “500 Internal Server Error” response between 14:40 UTC and 14:47 UTC.

 

Actions

  • Improve the automatic failover process and service restart capabilities to build additional resiliency into the application environment.
  • Improve our criteria for considering an incident fully resolved.
Posted Feb 10, 2021 - 16:51 EST

Resolved
All systems appear to be stable and functioning. The incident is being considered resolved.

No additional errors observed since ~14:45 UTC.

We are still investigating to understand the specific cause(s) of the incident.

We apologize for any inconvenience and disruption to service.
Posted Feb 08, 2021 - 13:16 EST
Update
We have not seen any additional errors since ~14:45 UTC. Our preliminary investigation has identified a likely cause.

Details will be provided as they become available.
Posted Feb 08, 2021 - 10:27 EST
Monitoring
We have identified an instance of intermittent 500 errors on Spreedly's Core API.

We are actively monitoring.

Updates will be provided as they become available.
Posted Feb 08, 2021 - 09:47 EST
This incident affected: Core Transactional API.