Elevated Rate of 500 Errors

Incident Report for Spreedly

Postmortem

During routine system maintenance, customer traffic was inadvertently sent to a cluster of systems that were still in the process of rebooting. There was a brief four-minute period where transaction failures occurred for some customers while the systems were brought fully online.

What Happened

To maximize overall system availability, some core Spreedly components are split up into separate clusters. An Application Load Balancer (ALB) directs customer traffic to all clusters usually, or just one cluster (the “Active Cluster”) during system maintenance. Typical system maintenance happens programmatically, communicating with the ALB to:

‌

Remove the soon-to-be “Inactive Cluster” from receiving new customer transactions
Wait for all existing customer transactions to be processed
Wait for system maintenance to be completed
Mark the cluster as “Active” again and start sending it a balanced share of customer transactions

‌

In this instance, we inadvertently configured the first step manually (instead of using the programatic “infrastructure as code” method). This caused the other programatic ALB operations to be out of sync, resulting in the cluster being marked as “Active” (step 4) before all the system maintenance had completed (step 3).

Some customer transactions were affected between 21:01 and 21:05 UTC. Impacted transactions were those that received a “500 Internal Server Error” response.

Next Steps

We have reinforced existing processes for system maintenance.
The “runbook” for performing system maintenance has been enhanced with a cautionary note regarding this incident, presenting clearer directions to the specific programmatic steps necessary to complete maintenance without service degradation or interruption.

Posted Oct 14, 2020 - 09:16 EDT

Resolved

At approximately 9:01 PM UTC, we identified an issue causing an elevated rate of 500 errors on Spreedly's Core API.

This is impacted some transactions and requests to Spreedly's API.

A fix was quickly deployed and the system was operational by ~9:05 PM UTC.

We apologize for any inconvenience and disruption to service.

Posted Oct 02, 2020 - 17:00 EDT