Outage report 7 July 2020


tl; dr

We had an unplanned outage the day before yesterday; it was our first big one since July 2017. It was caused by an extremely unlikely storage system failure, but despite that it should not have led to such a lengthy downtime, and should not have affected so many people. We have some plans on what our next steps should be, and will be implementing at least some of them over the coming months.

The details

At 16:06 UTC on 7 July 2020, a storage volume failure on one of our storage servers caused a number of outages, starting with our own site and also with our users’ programs (including websites) that were dependent on that volume, and later spreading to other hosted sites. Because all of the data that we store on behalf of our users is backed up and mirrored, no data was lost or at risk, but the outage was significantly longer than we would like.

The effects were:

  • Our own site was unavailable or generating an excessive number of errors from 16:06 to 18:53 UTC.
  • For accounts that were stored on the affected file storage:
    • Websites were unavailable in general from 16:06 to 19:46 UTC, with some taking until 21:24 UTC to be fully available.
    • Scheduled tasks did not run between 16:06 and 18:53-19:46 UTC (the precise end time depending on the account)
    • Always-on tasks did not run between 16:06 and 18:53-19:46 UTC (the precise end time depending on the account)
  • For accounts that were not using the affected file storage
    • Scheduled tasks, and always-on tasks were unaffected apart from a short window of about five minutes between 18:53 and 19:46 UTC.
    • Websites were also unaffected apart from that window if they were already up and running when the problems started at 16:06 UTC. However, they could not be started up or reloaded between 16:06 and 18:53, and would have had a brief outage sometime between 18:53 and 19:46 UTC.
    • However, of course, the problems on our own site would also have affected the owners of these accounts if they needed to log in to run or change things.

This post gives a more detailed timeline of what happened, and describes what steps we are taking to prevent a recurrence.

Background

All of our users’ files on PythonAnywhere are stored on file servers, which have multiple attached EBS storage volumes. Each volume is a located on a network-attached storage device using various separate disks split over a number of datacenters – essentially you can see it as a section of a high-availability RAID array. The volumes are aggregated together into larger volumes using LVM on the file server, and are mirrored remotely in real time from the file server to separate backup servers using DRBD. The backup servers are backed up daily so that we have snapshots of data storage for disaster recovery.

This means that we have three levels of redundancy:

  • The EBS volumes are heavily redundant and give a first line of protection against simple hardware issues like disk failure. AWS state that they have “an annual failure rate (AFR) of between 0.1% - 0.2%", which makes them “20 times more reliable than typical commodity disk drives”.
  • The real-time synchronization with the backup servers gives us a second level of redundancy; in the unlikely event of an EBS failure, we have the backup volumes to fall back on.
  • The daily snapshots are kept on S3, which is even higher reliability (“designed to provide 99.999999999% durability of objects over a given year”)

This issue was caused by a failure at the first level of that redundancy hierarchy; an EBS volume had multiple component failures, essentially meaning that a complete redundant array of disks stopped working – something with a 0.1% - 0.2% chance of happening in any given year.

Each file server has multiple volumes attached, so it’s not quite the one-in-a-thousand-years kind of issue you might think it would be; we estimate that an outage like that might happen once in every 20 years. The downside is that that means it’s not something we’ve seen before – in our live environment, our staging environment, or in development. While we did have plans in place to handle this kind of issue, those plans turned out not to be as good as they should have been.

Timeline

Here’s what happened; all times are in UTC

16:06

Our monitoring system notified our support team of a number of issues spread across multiple machines in our cluster. We were quickly able to identify the underlying cause as being an issue with the machine livefile1.

livefile1 is, as you might expect, the oldest of our file servers. The symptom that we were seeing was that attempts to access files on that server blocked, and then timed out.

The effects of this were:

  • Any user code that was hosted on the affected file server lost filesystem access, so was unlikely to work; this meant that all websites owned by accounts that are hosted on livefile1 were down, and the users of those accounts would not be able to run code in consoles, notebooks, scheduled tasks, or always-on tasks.

  • Our own site started running slowly and returning a higher rate of errors. Just like sites that we host, we have a number of worker processes handling requests coming in to PythonAnywhere. Sometimes (for example, if you click on the “Files” page), they will need to access file servers. Because access to livefile1 was slow and timing out, if many people performed actions that accessed it, it was possible for all of the workers to be blocking, which in turn meant that requests for data that was not directly affected were queued up until blocking workers timed out.

  • For websites with code that was stored on other file servers, there would be delays in starting them up if they were not already running, or in reloading them if they were. This is due to a similar problem to our own site – there are a set of processes on each web server that start or reload websites, so it was possible for them to all be blocking on access to livefile1 leaving none available to start sites that would normally be unaffected.

  • However, websites belonging to accounts that were stored on other file servers and were already running were unaffected at this time, and likewise those accounts’ scheduled tasks and always-on tasks were running without problems.

16:13

We tweeted that we were having some system issues and were investigating.

16:16

Initial indications were that the problem was on the file server itself rather than its attached storage; there were kernel-level errors in syslog that looked very much like machine hardware problems. Computers fail much more often than the storage volumes, and our normal procedure for server hardware failure is to stop the machine and then start it up again. This makes the machine vacate the physical hardware that it is currently on, and then come up on new hardware. This normally happens quickly, and so we kicked that process off.

However, when we tried to start the machine again, it failed to start – in the sense that AWS did not appear to be provisioning new hardware for it at all.

16:34

We got AWS technical support involved to help determine why this was. They confirmed that the issue appeared to be hardware-related, and with their help we were able to get the machine provisioned onto new hardware and booting up by 16:47.

16:52

We were able to log into the machine, but there were clearly still problems. The kernel errors that we had seen before the re-provision were still present, and the attached storage was inaccessible.

17:04

By this point we had become sure that the system was getting blocked on storage IO, but the problem did not seem to be the machine itself like the original indications had suggested. However, we had checked out all of the attached storage volumes, and all of them were showing up as working OK.

17:18

AWS support told us that there appeared to be a problem with one specific volume. While this volume was showing up as working fine in the monitoring provided to us, from their administrative interface they could see problems. We attempted various ways to revive it.

17:25

AWS support escalated the issue with the volume to a higher-level tech support team. We decided that we should in parallel start working on recovering the file server from its backup volumes, as it appeared that it was uncertain whether the volume could be recovered, and that it would take some time to do even if it was possible. (They finally reported that it was unrecoverable at 19:19, so it was good that we did not wait.)

We had some internal discussions about the best way to recover the files from the backups.

There were two possibilities:

  • We could promote the backup server so that it became a new file server. This would have the advantage of being reasonably quick; we would need to reboot the servers that used it so that they spoke to the backup server instead of the primary file server, but we could do that fairly easily. However, it would mean that although everything would be back up and runnning, there would be no backup server for these files until the complete contents of this server were synchronised over to a new set of volumes on the backup server. In our recent experience, this can take a very long time – in some cases weeks – and during that process, we would not be able to recover from a further volume failure.

  • We could take a snapshot of the existing backup volumes, create new volumes from them, and then attach them to the existing file server in place of the ones it had, and attempt a resync. This would be slower, as taking the snapshots would take time, and there was a risk that it might not work. However, it would mean that the file and backup servers would start off with the same data on their filesystems, so there would not be any need to sync stuff across – and anyway, if it did fail, we could always use the first option instead.

The first option has historically been our plan to deal with this kind of outage. However, when that plan was originally made, file-to-backup syncs were faster (we believe because we had less data to store) and so the window during which a further volume failure would cause problems was much smaller. We decided to go with the second option.

17:52

Determining the right way to generate the backup snapshots took us longer than it ideally would have done. While we were doing this, we tweeted again to let people know what was going on.

18:17

By this point we had a set of volumes based on the contents of the backup server.

18:23

At this time, livefile1 had the new volumes attached, and we were logged in and checking everything out. The server needed to do a full filesystem check before the storage was available, but within ten minutes we were able to see that the volumes were mounted and all files were there. Synchronization with the backup server started, and within a minute or so everything was synced up, so we knew that whatever happened, we were back in a situation where a further volume failure would not cause data loss. We performed some final checks to make sure that we were happy that the storage was working well and that all files were present.

18:45

Investigation of the various servers that accessed files from livefile1 showed that they were still unable to access it; it appeared that the NFS mounts had timed out while the server was down. A reboot normally fixes this, so we rebooted the specific web server where our own site runs as a test case.

18:53

Our own server came back with all files available from all file servers. After thorough testing, we tweeted to say that our site was back, and that we were commencing a rolling reboot of all servers so that those accounts that were stored on livefile1 would be able to run their websites, scheduled tasks and always-on tasks again. It was over this period that the other accounts that were stored on other file servers saw brief outages, as the servers they were running on were rebooted.

19:46

The rolling reboot had completed. By this stage, all sites that were on file servers that had not been affected by the outage were up and running normally again, and could be reloaded as normal. Likewise all scheduled and always-on tasks were back up and running after their brief outages during the reboots.

However, we’d also found that the new file server was running slowly. We determined that this was a result of the new volumes needing to “warm up”. As we understand it, when you create a volume from a snapshot on AWS, it is initially empty, and the data that it should hold is copied over from the snapshot on an as-needed basis as it is read. This means that the volumes start off quite slow, but as things are read or written to, they speed up to normal SSD speed.

However, this slowness was significantly better than the complete blocking of reads and writes that had been apparent at the start of this outage; it meant that the problems with backlogs of requests causing errors on our site had gone, and likewise with the issues starting or reloading sites that were on other file servers.

So at this stage, the only affected users were those on livefile1, and the problem that they were having was that file access was slow – not that it wasn’t working at all. We tweeted to that effect.

20:38

A number of people had reported that their sites were still down; we checked, and the affected sites were ones that had their data stored on livefile1. What was happening was that the sites were being started up, but due to the slow filesystem access they were taking some time to become available. In the period between the first access to the site (which was what triggered the startup) and the site finally starting up completely, anyone accessing it would see an error message.

We monitored and tried various tricks to get the volumes warmed up, in parallel with attempts to get the busier sites up and running faster.

21:24

All sites were up and running again. File access was still somewhat slow for files that had not been accessed before, but in general everything was back to normal. We tweeted that the incident was over.

23:50

The new volumes were fully warmed up, and file access was completely back to normal.

Next steps

The first thing to note here is that storage volume failure of this kind is incredibly rare, as we noted above. We do have some ideas about how to reduce it further – for example, by using fewer volumes per storage server. However, the naive approach of trying to add RAID redundancy on top of the redundancy already provided by EBS does not seem like a great plan; RAID on top of RAID is unlikely to be helpful. (And anyway, what would make it right to stop there? We could have RAID on top of RAID on top of RAID on top of…)

We think that instead it’s more productive to accept that from time to time, storage will fail, however redundant it is. What we need to do is handle the problem more gracefully and faster.

Limiting the scope of an outage

livefile1 only contained a portion of our users’ data. 80% of users are stored on other file servers. But this outage affected pretty much everyone who uses our site at some point, and that is a problem. We need to work on ways to improve things so that if one file server fails, the effects are limited to people whose data is stored on that server.

Speeding up recovery

There were four points where our processes were too slow:

  • Identifying the underlying cause of the problem. Clearly it’s not good that that the issue started at 16:06, but we did not know that the underlying cause was a failure in a storage volume until 17:18, more than an hour later. However, it’s a tricky one to work out how to address – AWS tech support, with their better admin tools, took about 45 minutes to determine this. We need to think hard to see if we can find a way to do this, however; while we don’t have the AWS team’s admin tools, we do have root access to the machine, so in theory there should be a way to find out if a particular volume is hanging.

  • Rebuilding a file server’s volumes from its backup server. In retrospect, we were right to do this rather than to promote the backup server to being the primary server. However, because we were putting scripts together (a combination of our normal backup scripts and the tools we use to recover backups when people accidentally delete files) in realtime, it took almost an hour. This is something we should have ready to go in a fully tested script so that we just need to run something like python3 replace_file_server_storage_with_backups.py livefile1

  • The rolling reboot of the cluster. If we can limit the scope of the outage, we might be able to make this step unnecessary. Thought needed there.

  • Warming up the new volumes. This one’s pretty hard to sort out; it is possible to warm up a volume before you start using it (essentially dd if=/dev/something of=/dev/null bs=1024M) but for one of our file servers we estimate that would take about 8 hours, which is worse than useless. One possibility would be to shut down the backup server, take the snapshots, but then put the old backup volumes on the file server, and start up the backup server with the unwarmed volumes that were created from the snapshots. That should mean that the file storage would be warmed up as soon as it came into use, and would just slow the initial sync between the servers down somewhat. We would need to test and profile that carefully before doing it, though. If the initial sync completed in two hours, it would be fine, but if it was more than a day, it would have the risk of leaving us open to problems from another volume failure.

Conclusion

Yesterday’s outage was our first big unplanned outage since July 2017. It was caused by an extremely unlikely storage system failure, but despite that it should not have led to such a lengthy downtime, and should not have affected so many people. We have some plans on what our next steps should be, and will be implementing at least some of them over the coming months.

Onward and upward!

comments powered by Disqus