Outage Report for 19 November 2013

[UPDATE: as of 22 November, backups are working again.]

Backups have always been a source of trouble for us here at PythonAnywhere. We have tried a number of ways to back up your files and all of them have characteristics that make them less than suitable:

EBS snapshots - these generate a nice, consistent point-in-time snapshot of everyone’s files, but they slow disk access down too much and for too long (in our experiments, a snapshot could entirely take down every user website on the disk that’s backing up for half an hour and could cause slow disk accesses for up to 6 hours)
Rsync - is nice and easy, but it also competes with users for disk access and, because it takes a long time to run, can’t be used to provide continually updated backups.

With that in mind, we set about finding a new backup solution that would provide continual backups that we could then take point-in-time snapshots of. As an extra bonus we’d like it to also provide on-line hot fail-over (and a pony!)

We found our solution in DRBD. Essentially, it keeps 2 disks on different machines synchronised across the network. Our users could use a set of primary disks, and they’d be constantly synchronised with a set of secondary disks. We could then use the secondary disks to take snapshots with no effect on the performance of the primary disks that our users relied on and we could (if one of the the primary disks failed) immediately switch to using its secondary disk without anyone even noticing the switch. As an added bonus, DRBD would enable zero-downtime upgrades to PythonAnywhere and that is a goal that we’re very keen to achieve.

That was the theory. In practice, we needed a multi-step process to implement DRBD in our infrastructure without jeopardising our users’ data. The upgrade on 19 November was the second step of the process and, on the surface, it should have been a simple step that was easy to do. Here’s how it went wrong (all times in UTC).

Lightning on the horizon¶

When we wrote the configuration for the live servers, we did not add enough disk space for the backups. A copy-and-paste error meant that we only had half the drives we intended to have.
Our tests all passed because our test environments need much less space.
We spun up the new cluster and DRBD happily started syncing.

Clouds gather¶

(12:41) After we had the new cluster up and running, we took the site down so we could switch across.
(-13:30) The switch was not entirely smooth since several machines failed to mount important storage volumes (not related to the DRBD backup volumes) and we had to manually connect them. We’re investigating the cause of the failed mounts (our current suspect is a new nfs-client that behaves differently to what our upgrade scripts expected).
(13:33) We performed all our tests to ensure that it was ready and everything passed so we switched to the new cluster.

Storm breaks¶

(13:40) DRBD noticed that the secondaries didn’t have enough disk space and stopped working. At this point, we were fully switched to the new cluster, but it wasn’t working at all.
(13:40) We started trying to diagnose the error.
(14:17) We tracked down the issue to the backup disk space and decided that the best thing to do was to remove the faulty server, stop the syncing process and run without DRBD for the moment.

Hiding in the cellar¶

(14:18) We started recovering the disks to a state where they could be used.
(14:44) With the disk recovery complete we could begin re-integrating the disks into the new cluster.

The Sun comes out¶

(14:58) We finally went live with a working system.

We are currently running without automated backups, but we’re going to be putting together an emergency backup process that can tide us over until we have DRBD working correctly.

We deeply regret the inconveniece to all of our users and we’d like to prevent this sort of thing in future. It will be some time before we’re finished implementing zero-downtime upgrades to PythonAnywhere, so we’d like to hear from you how you think we should manage our upgrades:

How do you think we should schedule upgrades?
What is the best way to notify you of planned downtime?
How can we make the downtime impact you less?

Send any suggestions or thoughts to support@pythonanywhere.com. We’ll take everyone’s ideas into consideration and come up with a plan.