At 18:59 UTC our master cache server failed without a secondary automatically taking over, resulting in a percentage of traffic receiving 504 Internal Server errors for 14 minutes (18:59 to 19:13 UTC). We detected and manually promoted a secondary server to take it’s place, and all traffic returned to normal at 19:13 UTC.
Between 20:41 and 20:44, we experienced a small number of connection errors (208) as the server above came back online, prior to being removed from the cluster for further investigation.
We’re continuing to monitor traffic but have seen no follow-up errors. We’re investigating the failover to determine why it didn’t kick in automatically and will provide more information as part of our post-mortem.
Oct 10, 14:59 EDT