First Data (GGE4) SSL connection errors
Incident Report for Spreedly
Postmortem

For a period of 79 minutes, between 06:04 UTC and 07:22 UTC on Wednesday March 7th, all attempted transactions to Payeezy/FirstData GGE4 failed with an SSL connection error.

If you have a record of these transaction failures, they can be retried without double charging as the requests were never received by FirstData.

What Happened

At 06:00 UTC (01:00 EST) on Wednesday March 7th, First Data rotated the SSL certificates used for, among other services, their Global Gateway E4 (GGE4) API. This was an expected change properly communicated to Spreedly, and we had already made the necessary adjustments to seamlessly transition to the new certificate.

However, due to the remediation from a prior incident, Spreedly was using an alternative API host for FirstData GGE4 which did not transition to the new certificate in a seamless fashion. Over the course of one hour and 19 minutes, we experienced a complete outage for all FirstData GGE4 transactions due to the transitory certificate not being trusted by our production servers.

At 07:22 UTC we were able to update our FirstData GGE4 integration to use the default API host, which had been properly transitioned to the new SSL certificate. All transactions attempted at this point were successfully sent to FirstData.

This incident is a reminder of the unintended consequences of even well-intended actions. We originally transitioned to an alternative API host to subvert a DNS resolution issue we were experiencing due to the interaction between FirstData DNS TTLs and our data center’s DNS caching settings. However, by doing so, we introduced an alternative path from our standard integration settings and found ourselves using a host that experienced non-standard behavior.

Next Steps

After conducting an internal post-mortem, there are two areas of improvement we will be looking at. The first is increasing our internal education around incident response handling. There were too many gaps in fielding the initial alert, triaging, and transferring responsibility to other teams. Each gap was successfully handled by our fallback processes and tooling, but several 5-10 minute delays can really add up. Every minute spent in the machinations of our response means another minute your business isn’t collecting revenue and we take that failure seriously.

Second, after re-evaluating the original DNS issue remediation, we have decided the risks of staying on a secondary host outweigh risks of encountering the original DNS issue with FirstData (FirstData has also upgraded their DNS service since). We are currently integrated to the primary FirstData GGE4 host and plan to remain there. We have also updated our internal process for verifying gateway SSL certificate rotations to include a host confirmation step.

Conclusion

This incident has highlighted the implicit costs incurred when our integration diverges from the standard integration settings, costs born across all parties involved with the integration. We are always evaluating risks and tradeoffs associated with our technical decisions, and will take the lessons of the incident to heart as we move forward.

Thank you for your patience while we addressed the issue, and our apologies for the disruption it caused to your business.

Posted 9 months ago. Mar 09, 2018 - 15:28 EST

Resolved
We are seeing all First Data transactions successfully connecting. This issue is resolved.

We will post a full post-mortem by the end of the week.
Posted 9 months ago. Mar 07, 2018 - 02:37 EST
Monitoring
The URL change has been deployed and all First Data transactions are succeeding. We are monitoring the issue and will resolve it if we see no further errors.
Posted 9 months ago. Mar 07, 2018 - 02:26 EST
Identified
Spreedly was using an alternative API URL for FirstData, which does not appear to have been updated to the pre-communicated SSL certificate. We are modifying the API URL to the default host to resolve the issue.
Posted 9 months ago. Mar 07, 2018 - 02:21 EST
Investigating
All requests to First Data appear to be failing due to an SSL connection error. The First Data certificate was scheduled to be updated today, which we had already accounted for on the Spreedly side. We are investigating why there are failures.
Posted 9 months ago. Mar 07, 2018 - 02:04 EST