Elevated rate of API Errors

Incident Report for Spreedly

Postmortem

AWS experienced an issue within the region where the Spreedly application is deployed, which caused a disruption in outbound network traffic.

What Happened

Spreedly’s internal systems recognized communication issues with a number of gateways. Internal response teams immediately started to investigate the issue. As a number of working theories were tested, it became clear that the communication issues were not limited to communication with gateways, but also between Spreedly systems within AWS.

As this correlation was discovered, more reports were cropping up with other users of AWS of network communication issues. At that point, AWS posted an update on their status page regarding issues had been detected and they were working to resolve.

Prior to confirmation of resolution on the AWS status page, our internal systems showed signs of returning to normal. This occurred twice in the span of two hours.

Next Steps

Spreedly has identified the following steps as actions to proactively alert our customers:

Bolster monitoring and alerting mechanisms for external network communications.
Provide more real-time notification via Status Page when the Spreedly team discovers evidence of service degradation.

Posted Jan 19, 2021 - 11:21 EST

Resolved

AWS has reported no further errors since ~21:45 UTC, which closely aligns with our last reported error at ~21:56 UTC.

Spreedly is considering the issue resolved. We will continue to monitor throughout the evening and provide further updates if and when they become available.

We apologize for any disruption in service.

Posted Dec 18, 2020 - 19:21 EST

Update

We have correlated the errors with an AWS incident impacting certain regions.

The issues caused connectivity issues between our internal systems and also with external endpoints, such as gateways and receivers.

We have not seen additional errors since ~21:56 UTC.

We will continue to monitor AWS status and our systems for further issues.

Posted Dec 18, 2020 - 18:05 EST

Monitoring

The API errors have subsided.

The errors occurred between approximately 21:07 and 21:56 UTC.

We are actively monitoring for further errors.

Posted Dec 18, 2020 - 17:09 EST

Investigating

We are seeing another increase in errors.

This is impacting many transactions and requests to Spreedly's API.

Updates will be provided as they become available.

Posted Dec 18, 2020 - 16:20 EST

Monitoring

The API errors have subsided since approximately 20:10 UTC.

We are actively monitoring for further errors.

Posted Dec 18, 2020 - 15:26 EST

Investigating

We are currently investigating an elevated number of errors on Spreedly's API.

This is impacting many transactions and requests to Spreedly's API.

Updates will be provided as they become available.

Posted Dec 18, 2020 - 14:57 EST

This incident affected: Core Transactional API.