For a period of 79 minutes, between 06:04 UTC and 07:22 UTC on Wednesday March 7th, all attempted transactions to Payeezy/FirstData GGE4 failed with an SSL connection error.
If you have a record of these transaction failures, they can be retried without double charging as the requests were never received by FirstData.
At 06:00 UTC (01:00 EST) on Wednesday March 7th, First Data rotated the SSL certificates used for, among other services, their Global Gateway E4 (GGE4) API. This was an expected change properly communicated to Spreedly, and we had already made the necessary adjustments to seamlessly transition to the new certificate.
However, due to the remediation from a prior incident, Spreedly was using an alternative API host for FirstData GGE4 which did not transition to the new certificate in a seamless fashion. Over the course of one hour and 19 minutes, we experienced a complete outage for all FirstData GGE4 transactions due to the transitory certificate not being trusted by our production servers.
At 07:22 UTC we were able to update our FirstData GGE4 integration to use the default API host, which had been properly transitioned to the new SSL certificate. All transactions attempted at this point were successfully sent to FirstData.
This incident is a reminder of the unintended consequences of even well-intended actions. We originally transitioned to an alternative API host to subvert a DNS resolution issue we were experiencing due to the interaction between FirstData DNS TTLs and our data center’s DNS caching settings. However, by doing so, we introduced an alternative path from our standard integration settings and found ourselves using a host that experienced non-standard behavior.
After conducting an internal post-mortem, there are two areas of improvement we will be looking at. The first is increasing our internal education around incident response handling. There were too many gaps in fielding the initial alert, triaging, and transferring responsibility to other teams. Each gap was successfully handled by our fallback processes and tooling, but several 5-10 minute delays can really add up. Every minute spent in the machinations of our response means another minute your business isn’t collecting revenue and we take that failure seriously.
Second, after re-evaluating the original DNS issue remediation, we have decided the risks of staying on a secondary host outweigh risks of encountering the original DNS issue with FirstData (FirstData has also upgraded their DNS service since). We are currently integrated to the primary FirstData GGE4 host and plan to remain there. We have also updated our internal process for verifying gateway SSL certificate rotations to include a host confirmation step.
This incident has highlighted the implicit costs incurred when our integration diverges from the standard integration settings, costs born across all parties involved with the integration. We are always evaluating risks and tradeoffs associated with our technical decisions, and will take the lessons of the incident to heart as we move forward.
Thank you for your patience while we addressed the issue, and our apologies for the disruption it caused to your business.