On December 30th, 2022, at approximately 18:00 UTC, our secondary index database was heavily throttled by our Database Service Provider. This resulted in intermittent 500 errors across multiple endpoints and limited access to secondary services such as Dashboard and ID.
Spreedly presently maintains a “secondary index” database, apart from the main transaction processing database, that facilitates reporting (dashboard), data analytics, and some transaction flows that make use of
show endpoints. This is done to ensure that the primary money-moving transactions (i.e.: Vaulting, Payment Gateway, and Receiver Transactions) are always the foremost concern. Spreedly engages a Database Service Provider for this secondary index database.
After being assured that by our DB Service Provider in February 2022 that we would not run afoul of any DB sizing limitations or constraints (“plan size is not enforced for our largest plans”, of which Spreedly subscribes to), we nonetheless had our access disabled (
ALTER ROLE "[REDACTED]" CONNECTION LIMIT 0) at 18:52 UTC on December 30, 2022.
As we understand now this was an automated control implemented by our DB Service Provider; as our database naturally grew and shrunk during its course of operations, we would cross above & below this threshold (~12.7TB), and access would be restricted and then restored, such as it was again at 18:57 UTC.
During these intervals when our connection limit disabled access, customers would be unable to use
show commands, and dashboard access may have been impaired. Vaulting, Payment Gateway, and Receiver Transactions (that did not rely on these endpoints as part of their transaction flow) were otherwise not impacted.
We ceased further writes to the database while we worked with the DB Service Provider to understand the issue and reclaim (over the course of the incident) 800GB of database storage, reducing our DB size to ~11.9TB. This resizing was performed through the optimization of our DB usage (primarily indexes) without the loss or deletion of any data.
The resize process itself ebbed and flowed the overall DB size above the automated cut-off limit and it took a while to work with our DB Service Provider before they could hard-code us a connection limit that allowed both normal application function and our maintenance to occur. This meant that we experienced DB connectivity issues (and their corresponding
show customer impacts) during the following additional times:
|START (times UTC)||STOP (times UTC)||DURATION|
|2022-12-30 19:17||2022-12-30 19:19||2 minutes|
|2022-12-30 19:23||2022-12-30 19:25||2 minutes|
|2022-12-30 19:28||2022-12-31 01:12||5 hours, 43 minutes|
|2022-12-31 04:40||2022-12-31 05:35||55 minutes|
|2022-12-31 12:19||2022-12-3113:41||1 hour, 22 minutes|
Note: The table above was edited to correct a factual error on January 26th, 2023.
After our DB resize efforts had completed, we worked to bring the backlog of data into the database so that dashboard and data analytics once again represented current transactional data.
We are taking the following actions to ensure that this does not occur again: