Outage report: 5 September 2015

From 20:30 to 23:50 UTC on 5 September, there were a number of problems on PythonAnywhere. Our own site, and those of our customers, were generally up and running, but were experiencing intermittent failures and frequent slowdowns. We’re still investigating the underlying cause of this issue; this blog post is an interim report.

What happened?

At 20:30 UTC our monitoring systems detected brief outages on a number of customer sites. We were paged, and we were logged in and investigating within a few minutes. The initial indication was that the problem was not serious – that it was simply a slowdown on a couple of specific servers that could be fixed by rebooting them. So we rebooted them.

By 21:15 UTC it was clear that the problem was worse than we thought. We posted on Twitter (which we should have done earlier) to make sure people knew we were working on the problem.

By 21:45 UTC we’d determined that the problem was in our file storage system. We believed that the problem was a networking issue, perhaps caused by underlying AWS hardware failures. We’ve had a couple of hardware failures in that area before, but our system is redundant enough that we could keep running without any visible problems. This time it looked different. We immediately got in touch with AWS support – who, by the way, are excellent – very helpful and they know their stuff.

We spent a lot of time analysing the problem, in conjunction with AWS, and by 23:10 we (and they) had come to the conclusion that the problem wasn’t a hardware one. Rather, it looked like a massive earlier spike in read activity across the filesystem had put some system processes into a strange state. User file storage is shared from the file storage system over to the servers where people’s websites and consoles actually run over NFS. And the storage itself on the NFS servers uses DRBD for replication. As far as we could determine, a proportion (at least half) of the NFS processes were hanging, and the DRBD processes were running inexplicably slowly. This was causing access to file storage to frequently be slow, and to occasionally fail, for all servers using the file storage.

We decided that the best thing to do was simply to reboot the NFS server machines. Historically this has been a big deal, because it would require a rolling reboot of all web and console servers afterwards – hence our reluctance to do it earlier on in the process without being absolutely sure that it was a useful thing to do.

At 23:20 we shut down the file storage infrastructure, and kicked off a full snapshot of all file storage before rebooting (just in case the reboot corrupted anything – keeping everyone’s data safe was obviously #1 priority at this stage). By 23:30 we were happy with the snapshots, and AWS had confirmed to us that they looked fine. So we rebooted the NFS servers.

By 23:40 the NFS servers were back. We were delighted to see that some changes we put in relatively recently (designed to make it easier for us to update PythonAnywhere with minimal downtime) meant that all of the web and console servers were able to reconnect without rebooting them. Initially we were a little worried, because file access appeared to still be slow – but we rapidly came to the conclusion that this was a result of everyone’s websites coming back up in one go, causing much higher than average system load. After a few minutes everything calmed down.

We continued to monitor the system until we were comfortable that it was all back and running properly at around 00:40, and then we reported that everything was back on Twitter and emailed the many people who’d contacted us to report problems.

What next?

We’re still trying to work out how that load spike caused the NFS and DRDB processes to get into a messed-up state. We’re also considering the hypotheses that the spike might have been caused by problems with those processes (causality in the other direction) or that the spike and the processes getting messed up might both have been symptoms of a different, as-yet-unknown cause.

Some people have reported that they needed to reload their web apps after we thought the problem was fixed, because they were still running slowly. We’re trying to work out how that might be the case – it doesn’t quite fit in with our model of how the system works and how the failure affected it. However, just to be sure, we’ve done a rolling reboot of all web servers over the course of today, which has in effect reloaded all web apps. So everything should now be running OK, even allowing for this issue.

In terms of our response to the outage, we’ve determined that:

  • We should have posted on Twitter about the outage a little earlier, probably at around 20:45 when it because clear that this wasn’t just a brief glitch on one or two servers.
  • We were right to be cautious about rebooting the NFS servers this time, because historically it would have been a big deal. But now that we know that our infrastructure can handle it better than it used to, we should consider it as an option earlier on in the process if this happens again.

Hopefully we’ll be able to provide some more details on what the problem was and how we’re going to stop it from happening again soon.

comments powered by Disqus