We had an outage this morning that lasted about an hour. We’ve established the cause, fixed the problem, and all sites are now back up. Apologies to all those affected. More detail follows.
We were alerted to an initial problem via our monitoring systems around 9:50AM. The symptom was that one of the web servers was seeing intermittent outages, due to a memory leak in one of our users’ web applications causing a lot of swapping.
Our failover procedure involves taking the affected server out of rotation on the load balancer, redistributing its workload across to other servers, and rebooting it. We saw the same user using a lot of memory on a second server, so we able to confirm that that was a repeatable issue. We disabled his web app and rebooted this second server.
At this point the larger issue kicked in, which was that the rebooted servers seemed to be non-functional when they came back, which left the remaining servers struggling to keep up with the load, and causing outages to more customers. By this point two of us were working on the issue, and it took us a while to identify the root cause. It turned out to be due to a change in our logging configuration which was causing nginx to hang on startup. Specifically, it only affected users with custom SSL configuration. The reason that this was particularly baffling to us is that our deploy procedure involves a manual check on a sample of custom SSL users, and we confirmed they were functional when we did that deploy two days ago. Our working theory is that nginx will reload happily with broken logging config, but not restart happily:
On deploy:
- Start nginx
- Add custom SSL webapp configs
- Reload nginx
–> not a problem, despite broken SSL webapp logging
On reboot:
- Custom SSL configs with broken logging are already present on disk
- Nginx refuses to start
We’ll be confirming this theory in development environments over the next few days.
In the meantime, we’ve fixed the offending configuration file template, and confirmed that both regular users and custom-SSL users sites are back up. We’re also adding some safeguards to prevent any other users’ web apps from using up too much memory.
Once again, we apologise to all those affected.