During the hours of 7:30pm and 9:50pm UTC on January 29th, 2016, Spreedly experienced a period of instability ranging from slightly elevated error rates to a major outage. Anytime we cannot service a payment request for our customers is a time we have failed, and we apologize for that. Please read on to understand the details of this incident and what we are doing to prevent similar incidents in the future.
During the week of January 25, 2016, Spreedly began upgrading its primary distributed datastore as part of its annual PCI-DSS recertification. This storage system is designed to operate in a heterogenous environment, where one node in the cluster is on a different version than another, and rolling modifications to these storage nodes is a common practice. By Friday, January 29th 2016 all but one of the storage nodes had been upgraded to the latest version and all metrics were showing a healthy cluster.
However, at around 7:30pm UTC on Friday, our systems started registering a small number of 5xx responses to the Spreedly API. Upon investigation, we observed that our internal authentication service, invoked on every API call, was timing out for some requests. Response time metrics from our storage nodes were still showing as healthy, though, causing us to pursue other potential areas, such the connectivity from load balancers to the authentication service, and the authentication service itself.
At around 8:00pm UTC we noticed that one of the storage nodes was receiving the vast majority of the requests. Clients to the storage nodes are designed in such a way that they prefer healthy nodes when other nodes are experiencing errors, so this indicated that several of the storage nodes were, in fact, not healthy. We also noticed that the healthy node was the one that had not yet been upgraded.
We spent the next ~30m performing diagnostics and cluster modifications in an attempt to understand why our services were avoiding the upgraded nodes when the metrics from those nodes showed healthy response times. At one point we manually queried one of the upgraded nodes and observed that the response from that node was actually quite high (over 1m where normal responses are closer to 1ms). At this point we decided to pull all upgraded storage nodes out of rotation so that the only active node was the healthy one that had not yet been upgraded. This is a sub-optimal configuration to run for any amount of time, but it did stabilize the system and the majority of API requests started responding successfully.
For the next hour we sought to understand more about why the upgraded nodes were compromised and continued testing several hypotheses. As part of this process we observed that using HTTP to connect to the storage nodes, our default connection protocol, the response times were unusually high. However, when using protocol buffers, another protocol supported by the storage system, the response times were immediate. After observing this behavior, our hypothesis was that HTTP connections to the upgraded storage nodes were faulty in some way, and that protocol buffer connections were a potential solution.
We then started incrementally rolling out protocol buffer support to the various services that connect to the storage nodes, starting with a single instance of the authentication service. The service, still connecting to the only non-upgraded storage node, was performant with protocol buffers. We then continued to roll out protocol buffer support to the remaining services, again, still all only connecting to the one, healthy, non-upgraded node.
At about 9:50pm UTC, once the system was stable for some period of time using all protocol buffer based connections, we decided to introduce the upgraded storage nodes back into the cluster. Upon introduction, these nodes started servicing requests and all services continued to report a healthy status.
At this point, with all services healthy and our storage cluster operating with the majority of its nodes in a fully redundant configuration, we marked the issue as resolved.
One of the more frustrating aspects of this incident was that the response times reported by the storage nodes were much different from what our application services were seeing. If we had known earlier that our service layer was timing out due to storage response times we could have isolated the issue sooner. Instead, our storage service was reporting normal operation which obfuscated the actual issue.
In the future we need more visibility, and more accurate visibility, into the operation of our individual components and the system as a whole. We will be taking a critical look across all our operational metrics this week in an effort to remove any blind spots that may exist in other aspects of the system.
We are also left with the unanswered question of why the latest version of our storage system started exhibiting poor behavior with HTTP connections. We have an open issue with the vendor to investigate this further.
Anytime localized system instability manifests as an outage is a sign that we can take steps to improve the robustness of our system as well as our ability to recognize these pre-conditions. We take that responsibility seriously and will be taking the measures listed above to provide a better service to our customers. To our customers, we sincerely apologize for this incident.