Secondary API 50X Errors
Incident Report for Spreedly
Postmortem

Summary

Around 19:30 UTC on Tuesday, October 24th 2017, and for a period of about two hours, Spreedly returned 500 response codes for requests to list created gateways and requests to show or list transactions. This was due to maintenance work on a secondary data store that resulted in an individual query running for much longer than expected. At no point were revenue-affecting endpoints down or inaccessible.

Next Steps

We’ll be evaluating and adjusting our practices to accurately determine the activity and performance of queries to our secondary data store to ensure a clear understanding of our current load before performing maintenance. Additionally, we’ll be updating our documentation to distinguish the differences between our revenue-affecting API endpoints and secondary concerns such as reporting to provide more clarity on the specific impact of degraded API performance.

Conclusion

At Spreedly, our priority is to ensure you are able to collect money from your customers and execute the financial transactions that power your business. We prioritize the uptime of revenue-affecting API calls over other secondary concerns like reporting and visualization, though we also acknowledge that many businesses do rely on these abilities for some of their business processes. While we don’t consider the endpoints in this incident to not be revenue-affecting, we realize it had the potential to disrupt to transaction flow given the specific integration points with Spreedly and apologize for the inconvenience posed by their degradation.

Posted about 1 year ago. Oct 27, 2017 - 13:51 EDT

Resolved
The degraded component has been successfully rebuilt and we have rolled back the temporary fix from yesterday's incident. All "referencing_transaction" API elements are now being properly populated and the API is fully functional. It's important to note that this incident was isolated to the secondary API and the ability to transact was never disrupted.

We apologize for the inconvenience and will be publishing a post-mortem later this week.
Posted about 1 year ago. Oct 25, 2017 - 09:30 EDT
Update
We have identified the cause of the issue and have deployed a temporary fix which will have the side effect of returning empty "referencing_transaction" elements in some secondary transaction API responses, such as "/v1/transactions/key.json" (docs at https://docs.spreedly.com/reference/api/v1/#show45).

We expect the rebuilding of the degraded component to take some time, potentially over night. We will monitor its progress and re-institute the "referencing_transaction" field once we can confirm normal operation of the degraded component has resumed.
Posted about 1 year ago. Oct 24, 2017 - 17:43 EDT
Update
We are testing the temporary fix and will deploy it after confirming it can be deployed without side effects.
Posted about 1 year ago. Oct 24, 2017 - 16:46 EDT
Monitoring
We've identified the issue causing our secondary API to return 50x results and are instituting a temporary fix while we rebuild the necessary components to completely resolve the issue.
Posted about 1 year ago. Oct 24, 2017 - 16:46 EDT
Update
We are still investigating, however we've seen some customers making a GET API call (e.g. gateways.xml) prior to each request. This is an anti-pattern and should be avoided. If you have a similar setup, for a workaround, please remove that from your calls and your transactions should process as normal. We'll continue to update.
Posted about 1 year ago. Oct 24, 2017 - 16:27 EDT
Update
We've determined no transactions are affected, however secondary API calls are affected including Spreedly Insights. Continuing to investigate and will provide updates as we have them.
Posted about 1 year ago. Oct 24, 2017 - 16:07 EDT
Investigating
We're currently experiencing 50x errors on a small percentage of requests. We're investigating and will update when we know more.
Posted about 1 year ago. Oct 24, 2017 - 15:59 EDT