503 Service Unavailable
Incident Report for Spreedly
Postmortem

A mistake was made last night configuring firewalls in our new rack. While these devices are not yet a part of the service, they are connected to the switches, so the changes prevented packets from reaching places they were intended to go. This was an oversight, and we apologize for causing a disruption.

What Happened

We are working to expand our infrastructure, which means we have introduced a lot of new servers. The plans we have for networking these servers are detailed, as is the documentation of the current state of existing systems. The road to a completed implementation requires that our current firewall devices stand-in for the role ultimately to be filled by the new firewalls. A single detail was overlooked: the IP addresses that the new firewalls will own are currently owned by the active firewalls. When those addresses were assigned to the new firewalls, it confused the switches. Connections to new servers would time out because the reply packets would end up on the new firewalls, not making their way back through those which are in active service.

This would not have been seen in our service, except for one thing: we are using one of our new servers as an additional log aggregator. The application servers buffer logs locally, but after a good while of not being able to reach the aggregator, the buffer filled up, and rsyslog began to block the application processes (Unicorn). Once they were all waiting for rsyslog, the load balancers began replying immediately with the "503 Service Unavailable".

Only a few transactions failed, since a large number of GET requests occur and were able to use up the application processes.

How We're Going to Improve

The issue here was an unbounded queue. We'll address that by leveraging rsyslog's advanced queueing options without neglecting one very important concern: certain activities must always be logged in the system. Also, we need to know when rsyslog is unable to work off it's queue, so we are going to find a way to be alerted as soon as that is the case.

If you have any questions about this incident, don't hesitate to drop us a line at support@spreedly.com.

Posted almost 5 years ago. Nov 12, 2014 - 11:11 EST

Resolved
At 2014-11-11T23:10, we were alerted that a few transactions had failed. Changes had been made to new firewalls which led to a network partition. The firewalls were immediately removed from the network. Investigation reveals that between 2014-11-11T23:08:26 and 2014-11-11T23:13:31, "503 Service Unavailable" responses were delivered to our customers. We are terribly sorry for the disruption and will provide a postmortem soon.
Posted almost 5 years ago. Nov 12, 2014 - 00:00 EST