Spreedly API Errors

Incident Report for Spreedly

Postmortem

December 30th, 2022 - Spreedly API Errors

On December 30th, 2022, at approximately 18:00 UTC, our secondary index database was heavily throttled by our Database Service Provider. This resulted in intermittent 500 errors across multiple endpoints and limited access to secondary services such as Dashboard and ID.

What Happened

Spreedly presently maintains a “secondary index” database, apart from the main transaction processing database, that facilitates reporting (dashboard), data analytics, and some transaction flows that make use of list or show endpoints. This is done to ensure that the primary money-moving transactions (i.e.: Vaulting, Payment Gateway, and Receiver Transactions) are always the foremost concern. Spreedly engages a Database Service Provider for this secondary index database.

After being assured that by our DB Service Provider in February 2022 that we would not run afoul of any DB sizing limitations or constraints (“plan size is not enforced for our largest plans”, of which Spreedly subscribes to), we nonetheless had our access disabled (ALTER ROLE "[REDACTED]" CONNECTION LIMIT 0) at 18:52 UTC on December 30, 2022.

As we understand now this was an automated control implemented by our DB Service Provider; as our database naturally grew and shrunk during its course of operations, we would cross above & below this threshold (~12.7TB), and access would be restricted and then restored, such as it was again at 18:57 UTC.

During these intervals when our connection limit disabled access, customers would be unable to use list or show commands, and dashboard access may have been impaired. Vaulting, Payment Gateway, and Receiver Transactions (that did not rely on these endpoints as part of their transaction flow) were otherwise not impacted.

We ceased further writes to the database while we worked with the DB Service Provider to understand the issue and reclaim (over the course of the incident) 800GB of database storage, reducing our DB size to ~11.9TB. This resizing was performed through the optimization of our DB usage (primarily indexes) without the loss or deletion of any data.

The resize process itself ebbed and flowed the overall DB size above the automated cut-off limit and it took a while to work with our DB Service Provider before they could hard-code us a connection limit that allowed both normal application function and our maintenance to occur. This meant that we experienced DB connectivity issues (and their corresponding list & show customer impacts) during the following additional times:

START (times UTC)	STOP (times UTC)	DURATION
2022-12-30 19:17	2022-12-30 19:19	2 minutes
2022-12-30 19:23	2022-12-30 19:25	2 minutes
2022-12-30 19:28	2022-12-31 01:12	5 hours, 43 minutes
2022-12-31 04:40	2022-12-31 05:35	55 minutes
2022-12-31 12:19	2022-12-3113:41	1 hour, 22 minutes

Note: The table above was edited to correct a factual error on January 26th, 2023.

After our DB resize efforts had completed, we worked to bring the backlog of data into the database so that dashboard and data analytics once again represented current transactional data.

Next Steps

We are taking the following actions to ensure that this does not occur again:

Further optimizing DB storage and usage to reclaim additional space below any automated cutoff
Working with our DB Service Provider to remove any future limit enforcement.
Working on an emergency plan to change DB Service Providers, if needed.
Migrating our DB (already underway) to a new DB platform that will be self-managed. This effort will complete this calendar quarter, 2023.

Posted Jan 08, 2023 - 15:44 EST

Resolved

After a period of monitoring, we have not seen any recurrence of this issue. We have confirmed that the lagging data has been resolved and all systems are stabilized and functioning normally with current data.
The incident is being considered resolved.

We are still investigating to understand the specific causes of the incident and any residual impact. A post incident review will be published.

We apologize for the impact this caused to affected customers.

Posted Jan 01, 2023 - 20:29 EST

Update

After a period of monitoring we have confirmed that our database environment remains stabilized and there are no errors impacting customer transactions.

We are working on addressing the lagging data issue with Dashboard and secondary services. We estimate transaction data to be caught up and current in about 3 hours from now.

We will continue to update with any changes.

Posted Jan 01, 2023 - 14:07 EST

Monitoring

We experienced a database capacity limit that resulted in intermittent 500s and lagging data in secondary systems. Once identified, we immediately began working with our service provider on a resolution. We’ve been implementing that fix and stabilized the database such that customers should no longer be experiencing 500s.

However, Dashboard and secondary systems are still lagging as we continue to reduce indexes and ensure the system is fully stabilized. We’re prioritizing system stability to support many year-end customer processes and will address the lagging data issue once we’re confident there are no risks to transaction processing.

Until then, we’re continuing to treat this as a SEV-1 incident with the highest priority as we monitor the continued resolution.

Posted Dec 31, 2022 - 18:14 EST

Update

We continue to make progress on restoring the system to normal operations in a safe and efficient manner.

As we continue to implement the required changes to reach a resolution, users may continue to experience intermittent 500 errors across multiple endpoints and access to Dashboard and ID may be limited. Some functionalities will be starting to work again but this is not a full fix yet.

We will continue to provide regular updates as we work towards a full resolution. We apologize for the impacts this incident has caused.

Posted Dec 31, 2022 - 16:06 EST

Update

We are continuing to make progress on restoring the system to normal operations in a safe and efficient manner.

As we continue to implement the resolution, users may continue to see intermittent 500 errors across multiple endpoints and access to Dashboard and ID may be limited.

We will continue to provide regular updates as we work toward full resolution.

Posted Dec 31, 2022 - 13:19 EST

Update

Posted Dec 31, 2022 - 09:53 EST

Update

We are having progress on making our system up and running in a safe and efficient manner.

Access to Dashboard and ID may still be limited.

We will provide another update within the next two hours.

Posted Dec 31, 2022 - 07:51 EST

Update

We are having progress on making our system up and running in a safe and efficient manner.

Access to Dashboard and ID may still be limited.

We will provide another update within the next two hours.

Posted Dec 31, 2022 - 05:40 EST

Update

We are having progress on making our system up and running in a safe and efficient manner.

Access to Dashboard and ID may still be limited.

We will provide another update within the next two hours.

Posted Dec 31, 2022 - 03:18 EST

Update

We continue working on restoring service in a safe and efficient manner.

Access to Dashboard and ID may still be limited.

We will provide another update within the next two hours.

Posted Dec 31, 2022 - 00:17 EST

Identified

We have identified the cause for the service disruption.

We are currently working to restore service in a safe and efficient manner.

We will provide another update within the next two hours.

Posted Dec 30, 2022 - 22:27 EST

Update

We are continuing to investigate this issue with assistance of our service provider. Some information in Dashboard and ID may continue to be unavailable.

Updates will be provided within 2 hours.

Posted Dec 30, 2022 - 20:18 EST

Update

Our engineering teams are continuing to work toward a resolution for this issue, pursuing an investigation related to underlying internal connections between some of our components. We have escalated this to our service provider for additional assistance.

Viewing some information in Dashboard and ID may not be possible due to internal dependencies related to this incident.

Posted Dec 30, 2022 - 17:17 EST

Update

We are continuing to investigate this issue.

We have also identified that Spreedly Dashboard may be inaccessible to some users.

Posted Dec 30, 2022 - 15:57 EST

Investigating

We are investigating an issue causing intermittent HTTP 500 errors to GET endpoints on Spreedly's Core API.

Updates will be provided as they become available.

Posted Dec 30, 2022 - 14:51 EST

This incident affected: Core Transactional API and Core Secondary API.