Outage report for today - an AWS interface mishap

We deployed a new version of PythonAnywhere today with some cool new stuff – more on that later. But there was a nasty outage, and it might be worth explaining just in case anyone else is at risk of getting bitten by the same problem.

Amazon AWS provide a feature called “Elastic IPs”. An EIP is an IP address that you can associate to any machine you want that you’re running on AWS. We use these for all of the public-facing IP addresses on our live cluster.

When we deploy a new PythonAnywhere cluster, all of its public-facing IPs are initially random. Once it’s fully up and running and we’ve checked it’s all OK, we run a script that disassociates the EIPs from the old cluster, and associate them with the new cluster. Then, when we’re sure all’s well, we shut down the old cluster. (This is a slight simplification, but should suffice.)

Now, when you shut down a machine in the AWS web-based console, one option you are given is to “release” any EIPs that were associated with it. This is because they charge you a small amount of money for unused EIPs, to stop people hoarding them. This time we deployed the new version of PythonAnywhere, we decided to use this.

Unfortunately, we hadn’t refreshed the web browser we were doing this in after we had run the go-live script had switched the EIPs from the old cluster to the new one. So the interface in that particular browser thought that the live service’s EIPs were still associated with the old cluster… and it released them all. So our live cluster suddenly had no public-facing IPs.

This meant that shortly after the new PythonAnywhere cluster went live, we shut down the old one and the site went down. Even worse, once you’ve released an EIP, you can’t get it back. This meant that all of our DNS settings for www.pythonanywhere.com – and, more importantly, all of our customers’ sites – were pointing to an IP address that we no longer controlled.

This could have been a disaster – sure, DNS settings can be changed, and most of our customers use CNAMEs so we could get everything pointing at the new IPs by changing our own DNS settings. But DNS settings can also take a long time to propagate, in particular if your TTL (time-to-live) settings are high, because a high TTL means that DNS servers all over the Internet might be caching the old values.

We got in touch with Amazon. Thankfully, their engineers were very responsive and were able to reclaim our old EIPs and reassociate them with our account within 45 minutes, and we were able to get the site back up shortly thereafter. We were lucky – when you release an EIP, it can get reallocated to any other Amazon user who asks for one.

Lesson learned – always refresh the AWS console in your browser before doing anything.

[UPDATE] The AWS team have told us that they’ve fixed this problem – the “release EIPs” option when you shut down a server will only release the EIP if it currently belongs to the server that you’re shutting down. The bug was a pain, but we’ve been really pleased with the AWS team’s responsiveness in helping us work around it and in fixing it.

comments powered by Disqus