Core API errors
Incident Report for Spreedly

Incident Summary

On Oct 5th, 2017 at approximately 12:43 UTC Spreedly experienced intermittent API request failures which resulted in a 500 response code returned for some requests. These failures began when an event-tracking component failed to authenticate against an external service ingesting system events. Once the situation was diagnosed, the degraded component was disabled to prevent additional authentication attempts and service was fully restored. In total, ~2% of all requests within the 13 minute window of degraded performance failed. Please read on to understand the details of this incident and what we are doing to prevent similar incidents in the future.

What Happened

On July 31st, 2017 the external service ingesting our system events announced a rotation of SSL certificates used to authenticate against their service. During this rotation period there would be two valid certificates for authentication: the current certificate and a new certificate to eventually replace it. We made note of this upcoming rotation and planned accordingly for applications currently connecting but failed to update the certificate on the hosts where the event tracking component was ultimately deployed. The event tracking component operated for some time with the expiring certificate and on Oct 5th the external service fully decommissioned and rotated the certificate rendering the certificate in use invalid.

The event tracking component used in our system is built to run in a background thread along side the main application thread. This allows for system events to be delivered in parallel to request processing and was meant to provide additional resiliency against unexpected errors or connectivity issues. However in this particular incident, it did not protect our customers requests as we had designed and intended.

All requests to the Spreedly API generate new events which are buffered and eventually delivered in batches. This allows for multiple requests to fully complete prior to a batch of events being delivered. Although the buffer and delivery are running in background threads the spawning of these new threads was incorrectly configured to abort all threads, including the main execution thread, in the event of an unhandled exception. While many requests were able to fully complete any request that was still processing while the background thread attempted to deliver a batch of events was then forced to exit when the SSL certificate verification exception was raised.

Next Steps

There were a few missteps that led to this particular incident and we’ll be addressing them in parallel to ensure this does not become a recurring issue for us. First, we’ll be reviewing our process for managing authentication credentials to our external services, ensuring we have a clear visibility into upcoming rotations. Second, we’ll be adjusting our event tracking component to be fully independent so any failure in the background thread is unable to affect the main execution thread. Finally, we will continue to refine our development practices to account for the variety of failure scenarios that exist in a complex system such as ours.

Conclusion

We had intentionally meant to design our event tracking component in such a way that would prevent customer requests from failing should our own component experience connectivity issues. However our design had a flaw which did not perform as expected, protecting your requests, and for that we apologize.

Posted 6 days ago. Oct 11, 2017 - 10:58 EDT

Resolved
We have resolved the issue and all systems are fully functional. We will be publishing a post-mortem to this incident once we understand the sequence of events that caused this outage (currently scheduled for Thursday, October 12th).
Posted 12 days ago. Oct 05, 2017 - 09:54 EDT
Monitoring
We've implemented the fix, and everything appears to be back to normal. We're monitoring to ensure there are no lingering issues.

Initial investigation indicates ~2% of API calls were affected over a 15 minute period. As always, we'll post a full post-mortem within 2-3 business days.
Posted 12 days ago. Oct 05, 2017 - 09:07 EDT
Identified
We've identified what we think is the cause of the errors, and are working on a fix right now.
Posted 12 days ago. Oct 05, 2017 - 09:01 EDT
Investigating
We are currently investigating an increase in SSL connection errors.
Posted 12 days ago. Oct 05, 2017 - 08:55 EDT