Transaction Errors

Incident Report for Spreedly

Postmortem

On Thursday August 16th, during a seven minute period from 14:28 GMT to 14:35 GMT, a planned modification to our decryption sub-system resulted in all card-based operations failing. This included gateway and receiver transactions (purchase,auth, deliver, etc.), card fingerprinting, and an in-progress Account Updater batch run.

What Happened

Spreedly’s systems are in a constant state of growth and evolution. Everyday we may be deploying new application code to production, adding new capacity, or adjusting some aspect of our security posture. Maintaining this dynamism is critical for us to keep pace with the demands of our customers, and the multitude of other external factors we have to be cognizant of.

On Thursday August 16th, we deployed a modification to the system intended to provide an additional layer of flexibility in how our decryption and fingerprinting sub-system is accessed from other aspects of the system. The change was deployed in an incremental fashion, as we do with all changes, intended to confirm the efficacy of the change prior to activating it for use in production transactions. All tests and confirmation steps showed that the change did not adversely affect our transaction processing capacity and so it was promoted into active service.

Once promoted, we began seeing auxiliary indications of failures, but our primary monitoring mechanisms were reporting as up and functional. As additional failure indications surfaced, roughly seven minutes after the change was deployed, we made the decision to revert and all systems returned to normal state. We immediately began to investigate the impact and, about 20 minutes into the investigation, opened the public status incident in the “monitoring”state.

Upon further investigation we discovered that the built-in failure detection mechanism of the affected components was misconfigured, resulting in health checks that were not indicative of actual system availability. The misconfigured health check was obscuring the fact that the new changes resulted in an SSL certificate verification error for all requests to the cryptographic sub-system. When these changes were promoted to handle production traffic, the SSL errors prevented the decryption of all cards and, by extension, caused all transactions and other sensitive data operations to fail during this time period.

Impact

During the seven minutes of the incident, all transactions requiring the transmission of card data (which includes purchases, authorizations, delivers, etc…) failed before being sent to the target gateway or endpoint. These transactions are safe to retry.

The account updater service, which runs in batches, was also disrupted by the incident. The small number of customers whose batches did not complete have been notified directly. Any cards not sent during this run will be sent during the next run, currently scheduled for September 1st.

Finally, any credit cards added during the incident were stored successfully, but don’t have a fingerprint. We will be re-fingerprinting these cards and will update this incident when the task has been completed.

Next Steps

As a result of this incident we have identified the following items for evaluation and/or investment:

Less reliance on low signal-to-noise alerting mechanisms and the introduction of more deterministic error detection tools
Better testing procedures to verify the efficacy of a change before it is promoted to handle production transactions
Changes in our SSL certificate management to simplify and centralize SSL termination functionality

Conclusions

There is often a direct tension between change and stability. It is our job at Spreedly to provide a stable service that can be trusted to collect revenue on behalf of your business while also scaling to accommodate your growth. This incident exposed several things we can be doing to increase our stability while not sacrificing our ability to scale with our customers, and we intend to do just that. We do apologize for the disruption this period of failure caused.

Posted Aug 20, 2018 - 15:40 EDT

Resolved

We've confirmed that the issue was only present during 14:28 GMT - 14:35 GMT and are marking it as resolved. We are still investigating to understand the specific causes of the incident and will provide a post-mortem within two to three business days.

Posted Aug 16, 2018 - 11:32 EDT

Update

The change that caused this issue was in place from 14:28 GMT - 14:35 GMT. The majority of transactional requests failed during this time.

Posted Aug 16, 2018 - 11:17 EDT

Monitoring

We experienced an error during a regular systems upgrade that resulted in some errors in transaction processing. The change has been reverted and all systems are processing transactions successfully. We are currently gathering the scope of the impact and will provide details as we have them.

Posted Aug 16, 2018 - 11:08 EDT

This incident affected: Core Transactional API.