Outage Report for 15 July 2014

After a lengthy outage last night, we want to let you know about the events that led up to it and how we can improve our outage responses to reduce or eliminate downtime when things go wrong.

As usual with these things, there is no single cause to be blamed. It was a confluence of a number of things happening together:

  1. An AWS instance that was running a critical piece of infrastructure had a hardware failure
  2. We performed a non-standard deploy earlier in the day
  3. We took too long to realize the severity of the hardware failure

It is a fact of life that our machines run on physical hardware. As much as the cloud, in general, and AWS, in particular, try to insulate us from that fact. Hardware fails and we need to deal with it when it does. In fact, we believe that a large part of the long-term value of PythonAnywhere is that we deal with it so you don’t have to.

Since our early days, we have been finding and eliminating single points of failure to increase the robustness of our service, but there are still a few left and we have plans to eliminate them, too. One of the remaining ones is the file server and that’s the machine that suffered the hardware failure last night.

The purpose of the file server is to make your private file storage available to all of the web, console, and scheduled task servers. It does this by owning a set of Elastic Block Storage devices, arranged in a RAID cluster, and sharing them out over NFS. This means that we can easily upgrade the file server hardware, and simply move the data storage volumes over from the old hardware to the new.

Under normal operation, we have a backup server that has a live, constantly updated copy of the data from the file server, so in the event of either a file server outage, or a hardware problem with the file server’s attached storage, we can switch across to using that instead. However, yesterday, we upgraded the storage space on the backup server and switched to SSDs. This meant that, instead of starting off with a hot backup, we had a period where the backup server disks were empty and were syncing with the file server. So we had no fallback when the file server died. Just to be completely clear – the data storage volumes themselves were unaffected. But the file server that was connected to them, and which serves them out via NFS, crashed hard.

With all of that in mind, we decided to try to resurrect the file server. On the advice of AWS support, we tried to stop the machine and restart it so it would spin up on new hardware. The stop ended up taking an enormous amount of time and then restarting it took a long time, too. After trying several times and poring over boot logs, we determined that the boot disk of the instance had been corrupted by the hardware failure. Now the only path we could see to a working cluster was to create an entirely new one which could take over and use the storage disks from the dead file server. So we kicked off the build (which takes about 20min) and waited. After re-attaching the disks, we checked that they were intact and switched over to the new cluster.

Lessons and Responses

Long term

  • A file server hardware failure is a big deal for us, it takes everything down. Even under normal circumstances switching across to the backup is a manual process and takes several minutes. And, as we saw, rare circumstances can make it significantly worse. We need to remove it as a single point of failure.

Short term

  • A new cluster may be necessary to prevent extended downtime like we had last night. So our first response to failure of the file server must be to start spinning up a new cluster so it’s available if we need it. This should mean that our worst downtime could be about 40 mins from when we start working on it to having everything back up.
  • We need to ensure that all deploys (even ones where we’re changing the storage on either the file or backup servers) start with the backup server pre-seeded with data so the initial sync can be completed quickly.

We have learned important lessons from this outage and we’ll be using them to improve our service. We would like to extend a big thank you and a grovelling apology to all our loyal customers who were extremely patient with us.

comments powered by Disqus