Outage report: lengthy upgrade this morning

This morning we upgraded PythonAnywhere, and the upgrade process took much longer than expected. Here’s what happened.

What we planned to happen

The main visible feature in this new version was Python 3.4. But a secondary reason in doing it was to move from one Amazon “Availablity Zone” to another. PythonAnywhere runs on Amazon Web Services, and availability zones are essentially different Amazon data centers.

We needed to move from the zone us-east-1a to us-east-1c. This is because the 1a zone does not support Amazon’s latest generation of servers, called m3. m3 instances are faster than the m1 instances we currently have, and also have local SSD storage. We expect that moving from m1 to m3 instances will make PythonAnywhere – our site, and sites that are hosted with us – significantly faster for everyone.

Moving from one availability zone to another is difficult for us, because we use EBS – essentially network-attached storage volumes – to hold our customers’ data. EBS volumes exist inside a specific availability zone, and machines in one zone can’t connect to volumes in another. So, we worked out some clever tricks to move the data across beforehand.

We use a tool called DRBD to keep data safe. DRBD basically allows us to associate a backup server with each of our file servers. Every time you write to your private file storage in PythonAnywhere, it’s written to a file server, which stores it on an EBS volume but also then also sends the data over to its associated backup server, which writes it to another EBS volume. This means that if the EBS volume on the file server goes wrong (possible, but unlikely) we always have a backup to switch over to.

So, in our previous release of PythonAnywhere last month, we moved all of the backup servers to us-east-1c. Over the course of a few hours after that move, all of our customers’ data was copied over to 1c, and it was then kept in sync on an ongoing basis over the following weeks.

What actually happened

When we pushed our update today, we essentially started a new PythonAnywhere cluster in 1c, but the EBS volumes that were previously attached to the old cluster’s backup servers were attached to the new cluster’s file servers (after making sure that all pending updates had synced across). Because all updates had been replicated from the old file servers in 1a to these disks in 1c, this meant that we’d transparently migrated everything from 1a to 1c with minimal downtime.

That was the theory. And as far as it went, it worked flawlessly.

But we’d forgotten one thing. Not all customer data is on file servers that are backed up this way. The Dropbox storage, and storage for web app logs, is stored in a different way. So while we’d migrated everyone’s normal files – the stuff in /home/USERNAME, /tmp, and so on – we’d not migrated the logs or the Dropbox shares. These were stuck in 1a, and we needed to move them to 1c so that they could be attached to the new servers there.

This was a problem. Without the migration trick using DRBD, the best way to move an EBS volume from one availability zone to another is to take a “snapshot” of it (which creates a copy that isn’t associated with any specific zone) and then create a fresh volume from the snapshot in the appropriate zone. This is not a quick process. And the problem only became apparent to us when we were committed enough to the move to us-east-1c that undoing the migration would have been risky and slow.

So, 25 minutes into our upgrade, we started the snapshot process. We hoped it would be quick.

After 10 minutes of the snapshots of the Dropbox and the log storage data running, we noticed something worrying. The log storage snapshot was running at a reasonable speed. But the Dropbox storage snapshot hadn’t even reached 1% yet. This is when we started talking to Amazon’s tech support team. Unfortunately, after much discussion with them, it was determined that there was essentially nothing that could be done to speed up either of the snapshots.

After discussion internally we came to the conclusion that while the system couldn’t run safely without the log storage having been migrated, we could run it without the Dropbox storage. We’ve deprecated Dropbox support recently due to problems with our connection to Dropbox themselves, so we don’t think anyone’s relying on it. So, we waited until the log storage snapshot completed (which took about 90 minutes), created a new EBS volume in us-east-1c, and brought the system back up.

Where we are now

  • PythonAnywhere is up and running in the us-east-1c availability zone, and we’ll be able to start testing higher-speed m3 servers next week.
  • All of our customers’ normal file data is in us-east-1c (with the old ones in 1a kept as a third-level backup for the time being)
  • All of the log data is also stored in both 1c and 1a, with the 1c copy live.
  • The Dropbox storage is still in us-east-1a and is still snapshotting (15% done as of this post). Once the snapshot is complete, we’ll create a new volume and attach it to the current instance, and start Dropbox synchronisation.

Our apologies for the outage.

comments powered by Disqus