On Thursday August 16th, during a seven minute period from 14:28 GMT to 14:35 GMT, a planned modification to our decryption sub-system resulted in all card-based operations failing. This included gateway and receiver transactions (purchase,auth, deliver, etc.), card fingerprinting, and an in-progress Account Updater batch run.
Spreedly’s systems are in a constant state of growth and evolution. Everyday we may be deploying new application code to production, adding new capacity, or adjusting some aspect of our security posture. Maintaining this dynamism is critical for us to keep pace with the demands of our customers, and the multitude of other external factors we have to be cognizant of.
On Thursday August 16th, we deployed a modification to the system intended to provide an additional layer of flexibility in how our decryption and fingerprinting sub-system is accessed from other aspects of the system. The change was deployed in an incremental fashion, as we do with all changes, intended to confirm the efficacy of the change prior to activating it for use in production transactions. All tests and confirmation steps showed that the change did not adversely affect our transaction processing capacity and so it was promoted into active service.
Once promoted, we began seeing auxiliary indications of failures, but our primary monitoring mechanisms were reporting as up and functional. As additional failure indications surfaced, roughly seven minutes after the change was deployed, we made the decision to revert and all systems returned to normal state. We immediately began to investigate the impact and, about 20 minutes into the investigation, opened the public status incident in the “monitoring”state.
Upon further investigation we discovered that the built-in failure detection mechanism of the affected components was misconfigured, resulting in health checks that were not indicative of actual system availability. The misconfigured health check was obscuring the fact that the new changes resulted in an SSL certificate verification error for all requests to the cryptographic sub-system. When these changes were promoted to handle production traffic, the SSL errors prevented the decryption of all cards and, by extension, caused all transactions and other sensitive data operations to fail during this time period.
During the seven minutes of the incident, all transactions requiring the transmission of card data (which includes purchases, authorizations, delivers, etc…) failed before being sent to the target gateway or endpoint. These transactions are safe to retry.
The account updater service, which runs in batches, was also disrupted by the incident. The small number of customers whose batches did not complete have been notified directly. Any cards not sent during this run will be sent during the next run, currently scheduled for September 1st.
Finally, any credit cards added during the incident were stored successfully, but don’t have a fingerprint. We will be re-fingerprinting these cards and will update this incident when the task has been completed.
As a result of this incident we have identified the following items for evaluation and/or investment:
There is often a direct tension between change and stability. It is our job at Spreedly to provide a stable service that can be trusted to collect revenue on behalf of your business while also scaling to accommodate your growth. This incident exposed several things we can be doing to increase our stability while not sacrificing our ability to scale with our customers, and we intend to do just that. We do apologize for the disruption this period of failure caused.