After a lengthy outage last night, we want to let you know about the events that led up to it and how we can improve our outage responses to reduce or eliminate downtime when things go wrong.
As usual with these things, there is no single cause to be blamed. It was a confluence of a number of things happening together:
It is a fact of life that our machines run on physical hardware. As much as the cloud, in general, and AWS, in particular, try to insulate us from that fact. Hardware fails and we need to deal with it when it does. In fact, we believe that a large part of the long-term value of PythonAnywhere is that we deal with it so you don't have to.
Since our early days, we have been finding and eliminating single points of failure to increase the robustness of our service, but there are still a few left and we have plans to eliminate them, too. One of the remaining ones is the file server and that's the machine that suffered the hardware failure last night.
The purpose of the file server is to make your private file storage available to all of the web, console, and scheduled task servers. It does this by owning a set of Elastic Block Storage devices, arranged in a RAID cluster, and sharing them out over NFS. This means that we can easily upgrade the file server hardware, and simply move the data storage volumes over from the old hardware to the new.
Under normal operation, we have a backup server that has a live, constantly updated copy of the data from the file server, so in the event of either a file server outage, or a hardware problem with the file server's attached storage, we can switch across to using that instead. However, yesterday, we upgraded the storage space on the backup server and switched to SSDs. This meant that, instead of starting off with a hot backup, we had a period where the backup server disks were empty and were syncing with the file server. So we had no fallback when the file server died. Just to be completely clear -- the data storage volumes themselves were unaffected. But the file server that was connected to them, and which serves them out via NFS, crashed hard.
With all of that in mind, we decided to try to resurrect the file server. On the advice of AWS support, we tried to stop the machine and restart it so it would spin up on new hardware. The stop ended up taking an enormous amount of time and then restarting it took a long time, too. After trying several times and poring over boot logs, we determined that the boot disk of the instance had been corrupted by the hardware failure. Now the only path we could see to a working cluster was to create an entirely new one which could take over and use the storage disks from the dead file server. So we kicked off the build (which takes about 20min) and waited. After re-attaching the disks, we checked that they were intact and switched over to the new cluster.
We have learned important lessons from this outage and we'll be using them to improve our service. We would like to extend a big thank you and a grovelling apology to all our loyal customers who were extremely patient with us.