Incident Summary

At approximately 19:21 UTC an incorrectly entered internal credential caused all API requests to fail, resulted in an API outage for all customers until 19:28 UTC when the change was reverted. We want to acknowledge that this was the second non-trivial API outage in as many weeks which is not an acceptable level of available for a service such as ours. Please read on to understand this particular incident, as well as our overall thoughts on these incidents occurring in such close succession.

What Happened

The Spreedly transactional API is actually several different components working together to perform discrete functions such as authentication, external endpoint calls, automatic card updates, etc… All these components communicate over secure channels with unique authentication credentials between each which are rotated on a regular basis as part of our general security processes. During this period’s credential rotation the shared credentials between the main API component and our internal authentication service were incorrectly entered, resulting in all API calls immediately failing.

Our internal authentication system was created when we had few systems to coordinate and is, thusly, a manual process. As the scope and number of systems that comprise our environment have grown, the number of manual steps required to rotate credentials has increased to the point where it has become more time consuming, brittle, and prone to manual entry error. To date we have managed this process via explicit documentation, but that is proving to be an insufficient way to manage such a vital and sensitive aspect to our systems.

Next Steps

We will be doing several things in parallel to ensure this doesn’t become a recurring issue. First, we will be reviewing our credential rotation documentation to better structure it into discrete steps which can be more easily followed. Second, we will be looking into automating all, or most, of the rotation process. Finally, we will be considering new approaches to intra-system component authentication in an effort to create a more scalable process that doesn’t increase linearly in complexity with each new component.

Conclusion

We want to take a moment to talk about the two recent incidents which, while not appearing to be similar in nature on the surface, share many commonalities.

When maintaining a system of any complexity, there is an ongoing tension between stability and change. Anytime a system change is deployed (which can be anything from simple app deployments to the introduction of complex new machines/services) there’s an increased risk of disruption. But conversely, systems that undergo no changes at all become stale and outdated amidst a changing environment, susceptible to new security exploits and accruing material amounts of technical debt which limits functionality and features. Balancing these competing concerns (stability vs. progress) is one of the primary challenges for any engineering organization.

At Spreedly, because of the nature of our business, these two concerns co-exist in particularly antagonistic ways. Our service directly facilitates the revenue generation for our customers’ businesses which demands the highest uptime possible (hence, stability). But we also store extremely sensitive and valuable credit card information which demands constant vigilance and protection from always-changing security threats (hence, change).

None of this should come across as an excuse. Rather it is intended to give some insight into the competing concerns we are managing for you on a daily basis. We are always striving to provide extremely high levels of service on both these axes – that is our expectation of ourselves and should be your expectation of us as well. We have failed at finding this right balance recently and apologize for these incidents. We can, and will, do better.

Posted Sep 28, 2017 - 16:12 EDT

Resolved

An issue during a regularly scheduled credential rotation caused an increase in errors returned by Spreedly, from 18:23-18:29 UTC. Activity has since stabilized and we are monitoring responses. We will publish a post-mortem to this incident once we have more information on the source of the issue.

Posted Sep 25, 2017 - 15:59 EDT