API Request Failures

Incident Report for Spreedly

Postmortem

During the hours of 7:30pm and 9:50pm UTC on January 29th, 2016, Spreedly experienced a period of instability ranging from slightly elevated error rates to a major outage. Anytime we cannot service a payment request for our customers is a time we have failed, and we apologize for that. Please read on to understand the details of this incident and what we are doing to prevent similar incidents in the future.

What Happened

During the week of January 25, 2016, Spreedly began upgrading its primary distributed datastore as part of its annual PCI-DSS recertification. This storage system is designed to operate in a heterogenous environment, where one node in the cluster is on a different version than another, and rolling modifications to these storage nodes is a common practice. By Friday, January 29th 2016 all but one of the storage nodes had been upgraded to the latest version and all metrics were showing a healthy cluster.

However, at around 7:30pm UTC on Friday, our systems started registering a small number of 5xx responses to the Spreedly API. Upon investigation, we observed that our internal authentication service, invoked on every API call, was timing out for some requests. Response time metrics from our storage nodes were still showing as healthy, though, causing us to pursue other potential areas, such the connectivity from load balancers to the authentication service, and the authentication service itself.

At around 8:00pm UTC we noticed that one of the storage nodes was receiving the vast majority of the requests. Clients to the storage nodes are designed in such a way that they prefer healthy nodes when other nodes are experiencing errors, so this indicated that several of the storage nodes were, in fact, not healthy. We also noticed that the healthy node was the one that had not yet been upgraded.

We spent the next ~30m performing diagnostics and cluster modifications in an attempt to understand why our services were avoiding the upgraded nodes when the metrics from those nodes showed healthy response times. At one point we manually queried one of the upgraded nodes and observed that the response from that node was actually quite high (over 1m where normal responses are closer to 1ms). At this point we decided to pull all upgraded storage nodes out of rotation so that the only active node was the healthy one that had not yet been upgraded. This is a sub-optimal configuration to run for any amount of time, but it did stabilize the system and the majority of API requests started responding successfully.

For the next hour we sought to understand more about why the upgraded nodes were compromised and continued testing several hypotheses. As part of this process we observed that using HTTP to connect to the storage nodes, our default connection protocol, the response times were unusually high. However, when using protocol buffers, another protocol supported by the storage system, the response times were immediate. After observing this behavior, our hypothesis was that HTTP connections to the upgraded storage nodes were faulty in some way, and that protocol buffer connections were a potential solution.

We then started incrementally rolling out protocol buffer support to the various services that connect to the storage nodes, starting with a single instance of the authentication service. The service, still connecting to the only non-upgraded storage node, was performant with protocol buffers. We then continued to roll out protocol buffer support to the remaining services, again, still all only connecting to the one, healthy, non-upgraded node.

At about 9:50pm UTC, once the system was stable for some period of time using all protocol buffer based connections, we decided to introduce the upgraded storage nodes back into the cluster. Upon introduction, these nodes started servicing requests and all services continued to report a healthy status.

At this point, with all services healthy and our storage cluster operating with the majority of its nodes in a fully redundant configuration, we marked the issue as resolved.

Next Steps

One of the more frustrating aspects of this incident was that the response times reported by the storage nodes were much different from what our application services were seeing. If we had known earlier that our service layer was timing out due to storage response times we could have isolated the issue sooner. Instead, our storage service was reporting normal operation which obfuscated the actual issue.

In the future we need more visibility, and more accurate visibility, into the operation of our individual components and the system as a whole. We will be taking a critical look across all our operational metrics this week in an effort to remove any blind spots that may exist in other aspects of the system.

We are also left with the unanswered question of why the latest version of our storage system started exhibiting poor behavior with HTTP connections. We have an open issue with the vendor to investigate this further.

Conclusion

Anytime localized system instability manifests as an outage is a sign that we can take steps to improve the robustness of our system as well as our ability to recognize these pre-conditions. We take that responsibility seriously and will be taking the measures listed above to provide a better service to our customers. To our customers, we sincerely apologize for this incident.

Posted Feb 01, 2016 - 10:16 EST

Resolved

We are marking this incident as resolved. Storage nodes are fully operational and all transactional components are successfully servicing requests. Background services will continue to be brought online this evening to work through non-essential tasks.

We will have a full post mortem in the coming days, but want you to understand, right now, the scope of the incident and its impact on your business. All times and severities should be considered estimates until the full post-mortem.

* [2:30pm EST / 7:30pm UTC] A very small number of 5xx API response were logged
* [2:55pm EST / 7:55pm UTC] A large number of requests begin failing with 5xx status and continues for the next ~40 minutes
* [3:40pm EST / 8:40pm UTC] System begins to return to normal, with a very small number of 5xx responses being returned
* [4:50pm EST / 9:50pm UTC] All API errors resolved

We thank you for your understanding and patience as we worked to minimize the scope of this incident and return our systems to their normal state. Expect a more in-depth description to come out early next week, including the cause of the incident and our efforts to identify and resolve the issue.

Posted Jan 29, 2016 - 19:10 EST

Monitoring

Storage nodes are coming online and behaving as expected. We do not yet have a fully operational stack, but are not seeing any failed transactions. We are marking the system as operational, but will not mark this issue as resolved until we are confident in the long term viability of the system.

Posted Jan 29, 2016 - 18:15 EST

Update

We believe we have identified a change that will allow us to bring all storage nodes back to full capacity. Minor interruptions may be experienced during this transition period. We will post further updates if reality deviates from this expectation by a non-trivial margin.

Posted Jan 29, 2016 - 18:03 EST

Update

We are actively working with our storage vendor to more completely understand the issue. All requests continue to complete successfully, though we are still in a compromised state.

Posted Jan 29, 2016 - 17:30 EST

Update

We have locked service components to the performant storage nodes, meaning all requests should be succeeding. However, we are still operating in a compromised state. We have elevated the issue to our storage provider and are still working to resume normal operations.

Posted Jan 29, 2016 - 16:58 EST

Identified

We have identified that the majority of storage nodes are responding very slowly, causing a single node to receive the majority of traffic. Requests that use that storage node are succeeding, but the few requests that are bound to the degraded nodes are still erring. We are investigating options to achieve greater system stability in the short term while also considering our longer term options.

Posted Jan 29, 2016 - 16:43 EST

Update

The majority of requests are succeeding after restarting faulty components and increasing capacity. However, the underlying issue has not been identified and we are still investigating.

Posted Jan 29, 2016 - 16:30 EST

Update

We are investigating inconsistent behavior from our storage nodes, which we believe are affecting transactions.

Posted Jan 29, 2016 - 15:47 EST

Update

We have identified that our internal authentication service is failing to serve requests and are investigating possible resolutions.

Posted Jan 29, 2016 - 15:13 EST

Investigating

We are experiencing transactional API request failures and are investigating the cause. Updates will be posted as we identify the issue.

Posted Jan 29, 2016 - 15:00 EST