A planned operating system upgrade on July 12th resulted in slightly higher ongoing CPU usage than the prior OS. This compounded so that by July 19th, we began to exhaust CPU credits, which reduced our ability to successfully process a number of transactions.
As prologue, on Wednesday July 12th, we exercised a planned-for operating system upgrade on the servers running the cryptography service. This involved our usual and practiced behavior of gracefully moving traffic from one cluster of servers (i.e.: “blue” to “green”), patching the offline cluster and then performing the same in reverse. Testing illustrated that the upgrade was a success and the system continued processing ~100m+ internal service API requests daily without issue or incident.
On July 18th at 18:26 UTC, internal monitoring alerted on an increased number of failures being generated from our cryptography service. Teams were immediately assembled to work the issue. Logs indicated a resource contention (limiting connections by zone "default"
) as individual service nodes exceeded the configured limit of simultaneous connections.
We began the effort to alter the configuration of these limits and the task of quieting one cluster, effecting the repair, bringing the cluster online, and then performing the steps for the other cluster.
Unfortunately, once reconfigured, a different resource contention was found, this time at the operating system level. We began the effort to alter that configuration and percolate the change through both “blue” and “green” clusters but the errors merely moved to another type of resource constraint.
We reverted potentially relevant changes from earlier in the day, without positive effect on the system.
Finally, we reverted the operating system upgrade from the July 12th activity which ultimately resolved the issue. By July 19th at 00:29 UTC, the error rate on all cryptography service nodes returned to zero and we left the incident in a monitor state for approximately 12 hours before declaring the incident resolved.
A subsequent root cause analysis has determined that we overran the “CPU Credits” allocated to our AWS EC2 instance type on which the cryptography service was running. We believe that the OS upgrade incrementally consumed enough CPU capacity to eventually exhaust CPU credits, as credits are utilized at relatively small CPU usage, in excess of 20%
We are re-provisioning this service to use an AWS EC2 instance type that has no CPU credit limits which should prevent the problem from happening again.
We want to apologize to all customers who were impacted by failed transactions due to this incident. We understand how much you rely on our systems being fully operational in order to successfully operate your business. We also appreciate your patience while this issue was being investigated and resolved, thank you again for the trust you place on us every day.