Outage Report for 15 July 2014

After a lengthy outage last night, we want to let you know about the events that led up to it and how we can improve our outage responses to reduce or eliminate downtime when things go wrong.

As usual with these things, there is no single cause to be blamed. It was a confluence of a number of things happening together:

  1. An AWS instance that was running a critical piece of infrastructure had a hardware failure
  2. We performed a non-standard deploy earlier in the day
  3. We took too long to realize the severity of the hardware failure

It is a fact of life that our machines run on physical hardware. As much as the cloud, in general, and AWS, in particular, try to insulate us from that fact. Hardware fails and we need to deal with it when it does. In fact, we believe that a large part of the long-term value of PythonAnywhere is that we deal with it so you don't have to.

Since our early days, we have been finding and eliminating single points of failure to increase the robustness of our service, but there are still a few left and we have plans to eliminate them, too. One of the remaining ones is the file server and that's the machine that suffered the hardware failure last night.

The purpose of the file server is to make your private file storage available to all of the web, console, and scheduled task servers. It does this by owning a set of Elastic Block Storage devices, arranged in a RAID cluster, and sharing them out over NFS. This means that we can easily upgrade the file server hardware, and simply move the data storage volumes over from the old hardware to the new.

Under normal operation, we have a backup server that has a live, constantly updated copy of the data from the file server, so in the event of either a file server outage, or a hardware problem with the file server's attached storage, we can switch across to using that instead. However, yesterday, we upgraded the storage space on the backup server and switched to SSDs. This meant that, instead of starting off with a hot backup, we had a period where the backup server disks were empty and were syncing with the file server. So we had no fallback when the file server died. Just to be completely clear -- the data storage volumes themselves were unaffected. But the file server that was connected to them, and which serves them out via NFS, crashed hard.

With all of that in mind, we decided to try to resurrect the file server. On the advice of AWS support, we tried to stop the machine and restart it so it would spin up on new hardware. The stop ended up taking an enormous amount of time and then restarting it took a long time, too. After trying several times and poring over boot logs, we determined that the boot disk of the instance had been corrupted by the hardware failure. Now the only path we could see to a working cluster was to create an entirely new one which could take over and use the storage disks from the dead file server. So we kicked off the build (which takes about 20min) and waited. After re-attaching the disks, we checked that they were intact and switched over to the new cluster.

Lessons and Responses

Long term

  • A file server hardware failure is a big deal for us, it takes everything down. Even under normal circumstances switching across to the backup is a manual process and takes several minutes. And, as we saw, rare circumstances can make it significantly worse. We need to remove it as a single point of failure.

Short term

  • A new cluster may be necessary to prevent extended downtime like we had last night. So our first response to failure of the file server must be to start spinning up a new cluster so it's available if we need it. This should mean that our worst downtime could be about 40 mins from when we start working on it to having everything back up.
  • We need to ensure that all deploys (even ones where we're changing the storage on either the file or backup servers) start with the backup server pre-seeded with data so the initial sync can be completed quickly.

We have learned important lessons from this outage and we'll be using them to improve our service. We would like to extend a big thank you and a grovelling apology to all our loyal customers who were extremely patient with us.


Outage report: lengthy upgrade this morning

This morning we upgraded PythonAnywhere, and the upgrade process took much longer than expected. Here's what happened.

What we planned to happen

The main visible feature in this new version was Python 3.4. But a secondary reason in doing it was to move from one Amazon "Availablity Zone" to another. PythonAnywhere runs on Amazon Web Services, and availability zones are essentially different Amazon data centers.

We needed to move from the zone us-east-1a to us-east-1c. This is because the 1a zone does not support Amazon's latest generation of servers, called m3. m3 instances are faster than the m1 instances we currently have, and also have local SSD storage. We expect that moving from m1 to m3 instances will make PythonAnywhere -- our site, and sites that are hosted with us -- significantly faster for everyone.

Moving from one availability zone to another is difficult for us, because we use EBS -- essentially network-attached storage volumes -- to hold our customers' data. EBS volumes exist inside a specific availability zone, and machines in one zone can't connect to volumes in another. So, we worked out some clever tricks to move the data across beforehand.

We use a tool called DRBD to keep data safe. DRBD basically allows us to associate a backup server with each of our file servers. Every time you write to your private file storage in PythonAnywhere, it's written to a file server, which stores it on an EBS volume but also then also sends the data over to its associated backup server, which writes it to another EBS volume. This means that if the EBS volume on the file server goes wrong (possible, but unlikely) we always have a backup to switch over to.

So, in our previous release of PythonAnywhere last month, we moved all of the backup servers to us-east-1c. Over the course of a few hours after that move, all of our customers' data was copied over to 1c, and it was then kept in sync on an ongoing basis over the following weeks.

What actually happened

When we pushed our update today, we essentially started a new PythonAnywhere cluster in 1c, but the EBS volumes that were previously attached to the old cluster's backup servers were attached to the new cluster's file servers (after making sure that all pending updates had synced across). Because all updates had been replicated from the old file servers in 1a to these disks in 1c, this meant that we'd transparently migrated everything from 1a to 1c with minimal downtime.

That was the theory. And as far as it went, it worked flawlessly.

But we'd forgotten one thing. Not all customer data is on file servers that are backed up this way. The Dropbox storage, and storage for web app logs, is stored in a different way. So while we'd migrated everyone's normal files -- the stuff in /home/USERNAME, /tmp, and so on -- we'd not migrated the logs or the Dropbox shares. These were stuck in 1a, and we needed to move them to 1c so that they could be attached to the new servers there.

This was a problem. Without the migration trick using DRBD, the best way to move an EBS volume from one availability zone to another is to take a "snapshot" of it (which creates a copy that isn't associated with any specific zone) and then create a fresh volume from the snapshot in the appropriate zone. This is not a quick process. And the problem only became apparent to us when we were committed enough to the move to us-east-1c that undoing the migration would have been risky and slow.

So, 25 minutes into our upgrade, we started the snapshot process. We hoped it would be quick.

After 10 minutes of the snapshots of the Dropbox and the log storage data running, we noticed something worrying. The log storage snapshot was running at a reasonable speed. But the Dropbox storage snapshot hadn't even reached 1% yet. This is when we started talking to Amazon's tech support team. Unfortunately, after much discussion with them, it was determined that there was essentially nothing that could be done to speed up either of the snapshots.

After discussion internally we came to the conclusion that while the system couldn't run safely without the log storage having been migrated, we could run it without the Dropbox storage. We've deprecated Dropbox support recently due to problems with our connection to Dropbox themselves, so we don't think anyone's relying on it. So, we waited until the log storage snapshot completed (which took about 90 minutes), created a new EBS volume in us-east-1c, and brought the system back up.

Where we are now

  • PythonAnywhere is up and running in the us-east-1c availability zone, and we'll be able to start testing higher-speed m3 servers next week.
  • All of our customers' normal file data is in us-east-1c (with the old ones in 1a kept as a third-level backup for the time being)
  • All of the log data is also stored in both 1c and 1a, with the 1c copy live.
  • The Dropbox storage is still in us-east-1a and is still snapshotting (15% done as of this post). Once the snapshot is complete, we'll create a new volume and attach it to the current instance, and start Dropbox synchronisation.

Our apologies for the outage.


New release: Python 3.4, and more!

We released a new version of PythonAnywhere this morning. There were some nasty problems with the go-live (more about that later) but here's what we added:

  • Python 3.4 support, both for web applications and consoles.
  • sftp and rsync
  • Move from Amazon's us-east-1a region to us-east-1c -- this will allow us to switch to newer, faster instances next week!
  • And various minor bugfixes.

Thanks to gregdelozier, Malcolm, robert, aaronzimmerman, Cartroo, barnsey, andmalc, corvax, giorgostzampanakis, dominochinese, stablum, algoqueue for the suggestions.


Minor release - bugfixes and performance tweaks

A minor release today, which included:

  • A fix for the cairo/matplotlib regression
  • Tweaks to log file permissions, to prevent an issue where they would become non-readable by the user
  • Moving from several smaller servers to fewer larger ones, for web and console servers. Overall visible performance impact should be minor, but positive.

Happy coding everyone!


PythonAnywhere News Round-up

It's been about 6 months since we last delivered a state-of-the-PythonAnywhere address and, looking at everything that's happened since the last one, it's long-overdue.

Following the extremely ... mixed ... reaction to our upworthy/buzzfeed spoof report, we decided to gauge the reaction if we went in totally the opposite direction. So let's get straight into PythonAnywhere's very first newsletter of 2014 — the "style is for wimps" edition!

Heartbleed

One of the biggest tech news items of the year so far is the discovery of the Heartbleed bug. Even though we (like almost every other Linux-based host) were vulnerable to it, we patched our servers just hours after it was announced to the public (for some reason we weren't included in the early disclosure group with Google and Facebook). Our users were safe from Heartbleed and they didn't have to do anything!

Securing PythonAnywhere from the Heartbleed bug

Customer stories

The reaction to our customer stories was so great last time that we've got 2 for you this time! You can read about

Credit card payments

Yes, we heard your cries and you can now pay us using your credit card without involving PayPal in any part of the transaction.

PythonAnywhere now accepts credit cards

Create your own pricing plan

Dismayed by the huge gulf between our $12 web developer accounts and our $99 startup accounts? We've got you covered — you can now design a pricing plan that exactly fits your unique requirements.

Custom plans

PythonAnywhere supports Python 3 web apps

That is all — I did warn you that this would be a no-frills newsletter.

In other news


Git push deployments on PythonAnywhere

Some of our frenemies in the PaaS world, who shall remain nameless, offer a "git push" model for deployment. People are fond of it, and sometimes ask us whether they could do that on PythonAnywhere too.

The answer is: you totally can! Because PythonAnywhere is, at heart, just a web-based UI wrapped around a fully-featured Linux server environment, you can do lots and lots of things.

Here are the ingredients:

  • You'll need a paid account so that SSH access is enabled.
  • You set up a bare repo on PythonAnywhere, and set it as a remote to your local code repo.
  • And then you use a git hook to automatically deploy your code and reload the site on push.

Here are the steps in detail:

(This guide assumes you already have a repo containing a web app which you want to deploy).

Creating a bare repo

We create a directory, and make it into a "bare" git repo, ie, one that be git push'd to. Let's say you've been working in a local repository called mysite; you'd do this:

# From a Bash console on PythonAnywhere:
mkdir -p ~/bare-repos/mysite.git
cd !$
git init --bare
ls # should show HEAD  branches  config...

(Technically there's no reason why the repo on PythonAnywhere needs to be your other repository's name with ".git" on the end, but we'll do that to keep things simple. You can call it something else if you want to, but it should end with ".git" -- that's the convention for bare repos.)

Setting up a post-receive hook on the server

Next, navigate to your bare repository (in the file browser, or using a console-based editor like vim if you prefer), and create a new file at ~/bare-repos/mysite.git/hooks/post-receive. The name matters for this file!

#!/bin/bash
mkdir -p /var/www/sites/mysite
GIT_WORK_TREE=/var/www/sites/mysite git checkout -f

A bare repo doesn't have a working tree, it just stores your repository in the magical git database format, with all the hashes and other voodoo. The git checkout -f along with the GIT_WORK_TREE variable will tell git to checkout an actual working tree with real code, to the specified location. You can put this wherever you like..…

More info in the git docs and in this how-to guide

Now we just need to make the hook executable:

# In a Bash console on PythonAnywhere:
chmod +x ~/bare-repos/mysite.git/hooks/post-receive

Adding PythonAnywhere as a remote in your local repo

Back on your own PC, in the git repo for your web app, adding the remote is a single command, and then we can do our first push to it to make sure everything works:

# In your local code repo. Substitute in your own username + path
git remote add pythonanywhere myusername@ssh.pythonanywhere.com:/home/myusername/bare-repos/mysite.git

git push -u pythonanywhere master
# output should end with "Branch master set up to track remote branch master from pythonanywhere."
  • Reminder: you need a paid account for this to work I'm afraid, because the git protocol relies on SSH.

You may get asked for your password at this point. If so, I strongly recommending setting up private key authentication instead, with a passphrase-encrypted private key on your machine, and adding your public key to ~/.ssh/authorized_keys on the server. There's a good guide here.

Checking it worked

# From a Bash console on PythonAnywhere:
ls /var/www/sites/mysite # should show your code!

Setting up your PythonAnywhere web app

If you want to use your own domain for this step, I'll assume you already have it set up on your registrar with a CNAME pointing at yourusername.pythonanywhere.com. More info.

We've now got our code on the server, let's set it up as a PythonAnywere web app. This should be a one-off operation.

  • Go to the Web tab
  • Add a new web app
  • Enter its domain name
  • Choose "Manual Configuration", and then choose your Python version
  • Hit next and see the "All done" message

Next, edit your WSGI file, and make it point at the wsgi app in your code. I was using a Django app, so mine looked like this:

# /var/www/www_mydomain_com_wsgi.py
import os
import sys

path = '/var/www/sites/mysite'
if path not in sys.path:
    sys.path.append(path)

os.environ['DJANGO_SETTINGS_MODULE'] = 'superlists.settings'

import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()

Go back to the web app tab, hit "Reload", and make sure your site works. It took me a couple of goes to get it right!

NB: There are some subtleties here, like getting static files and your database config working, which I'll gloss over for now. See the example at the end for some inspiration though -- you can definitely get almost anything to work!

Adding to our post-receive hook to automatically reload the site

This is the last step -- we use a sekrit, undocumented feature of PythonAnywhere, which is that our web workers actually watch the WSGI file for changes, so if you touch it, your web app will reload automatically.

So, back in ~/bare-repos/mysite.git/hooks/post-receive:

#!/bin/bash
GIT_WORK_TREE=/var/www/sites/mysite git checkout -f 
touch /var/www/www_mydomain_com_wsgi.py

Substitute in the path to your own WSGI file.

Testing it works

Back on your own PC, make a trivial but visible change to your app, and then:

git commit -am"test change to see if push to pa works"
git push pythonanywhere

It should just take a few seconds (although sometimes it takes as long as a minute) for the web worker to notice the touch to your web app, and then reload your site in your web browser... You should see your changes!

More fun in the post-receive hook

You can do pretty much anything you like in the post-receive, so for example:

#!/bin/bash

set -e # exit if anything fails

# run unit tests in a temp folder
mkdir -p /tmp/testcheckout
GIT_WORK_TREE=/tmp/testcheckout git checkout -f
python3 /tmp/testcheckout/manage.py test lists
rm -rf /tmp/testcheckout

# checkout new code
mkdir -p /var/www/sites/mysite
GIT_WORK_TREE=/var/www/sites/mysite git checkout -f

# update static files and database
python3 /var/www/sites/mysite/manage.py collectstatic -y
python3 /var/www/sites/mysite/manage.py migrate -y

# bounce web app
touch /var/www/test3_ottg_eu_wsgi.py

And so on. When you get it working, why not let us know! If this turns out to be popular, we may look to automate some of the steps as features...

Image credit: Human Pictogram 2.0 at pictogram2.com


New release: a smorgasbord of changes

We've just released a new version of PythonAnywhere. It has a lot of small changes: we've installed a bunch of new packages, and added the option to put basic HTTP authentication in front of your web app (for example, for sites that are under development).

The big changes in this release are all under the hood. We've completely reworked the way we construct the sandboxes that your code runs in. This means that in the future it will be much easier for us to install new packages when people want them, and -- perhaps more importantly -- we'll soon be able to support different sandbox images for different people. This means that we'll soon be able (for example) to provide Django 1.6 for new users without breaking the web apps of the people who use 1.3.


Securing PythonAnywhere from the Heartbleed bug

The short version

The Heartbleed bug impacted PythonAnywhere (along with pretty much every Linux-based web service out there). We don't believe there's any risk that customer data has been leaked as a result of this problem, with the single exception of private keys for HTTPS certificates for custom domains -- that is, for websites hosted with us that don't end with .pythonanywhere.com. We don't have any reason to believe that those private keys were leaked either -- they're just the only data that we think could possibly have been leaked by it.

[UPDATE: Robert Graham at Errata Security points out that Heartbleed could also potentially have been used to harvest session cookies, usernames and passwords from users of affected sites. He's right, though it would be hard to do, and unlikely that someone would have targeted us for that. But just to be sure, we recommend you change your PythonAnywhere password, log out, then log back in again, and get users of your website to do likewise. Just to be clear on this: we don't think this has been used against us, and have no indication that it has. But it's better to be safe than sorry.]

The details

As you may have read, a bug in OpenSSL was announced last night that could potentially have been used to extract data from webservers, for example the private keys used to encrypt websites' SSL certificates. It exploits the SSL heartbeat extension, and has been nicknamed "Heartbleed". There's more information in this TechCrunch article.

All servers running recent versions of Linux were affected -- a very large percentage of the Internet -- and PythonAnywhere's were among them. All of our servers have been patched since early this morning, so the attack is now not possible against us. The only risk is that data might have been leaked before then.

We do not believe at this time that there's any risk that any data apart from SSL certificates' private keys could have been leaked. So for most PythonAnywhere users, everything should be fine. (Our own key for our own certificate for www.pythonanywhere.com might have been leaked, but we've changed it and are working on revoking the old certificate.)

For those customers who host websites on custom domains with PythonAnywhere (that is, domains that don't end with .pythonanywhere.com), there is a possibility that hackers who knew about this bug before this morning could have used it to extract their private keys. We have notified all such customers by email with details on what to do next; if you do have a custom domain with your own certificate and haven't received an email from us, drop us a line and we'll let you know what to do next.

If you have any questions, just let us know.

What we did

Due to some heroic work on the part of the Ubuntu team, patched versions of the affected libraries were ready by the time we started working on this. So patching all of our servers was just a few commands on each server that did HTTPS:

apt-get update
apt-get install openssl libssl-dev libssl1.0.0

And then a service tornado restart or service nginx restart, depending on what the HTTPS service was on the server.

We used @titanous's Heartbleeder command-line tool and Filippo Valsorda's Heartbleed test page both before and after the fix to make sure we really had fixed the problem.

We're confident that the patches we've applied are enough to fix the bug, at least as it's currently understood.


New release: Custom plans

[updated 09:23 GMT to add bit about reload web app button]

We just released a new version of PythonAnywhere, featuring the usual host of impossible-for-you-to-verify-but-we'll-still-claim-them stability improvements and bugfixes, but there's also some highly visible features, which we hope you'll like.

Pick 'n mix pricing plans

People have been telling us there was too big a leap between our $12 Web Developer plan and the $99 Startup plan, so we're pleased to present you with a new option for Custom plans.

You can find the custom plans on the Accounts & Pricing page

Obviously we're extremely interested in what you think about this. How do you feel about the prices? How do they compare against other options you're evaluating, and, more interestingly, how do you feel they're aligned with the value you derive from them?

Other bits 'n pieces

We've also squeezed in a couple of small but (hopefully) helpful features:

File changed on disk warning

Ever do that thing where you've got a file open in the editor, and then you do a git checkout in a console that changes the file, only to forget about it, and overwrite your changes when you hit save in the editor? Or even accidentally open the same file in two different tabs and lose an hour's work?

Our editor will now warn you if it thinks the file has any changes that have happened on disk while you were editing it:

Can you guess how it works?

Start Console here option

The file browser now has the option to start a Bash console in the current folder

Reload Web App from any file in app directory

People were also telling us it's tedious to have to skip back to the Web tab to reload your web app when you're working on the code for your app. So, we've added a button to the editor toolbar to reload the web app.

It's only visible for files that are inside the folder that's associated with your web app, and we only know which folder that is if you went through one of our "start new app" wizards... If you used the "custom" button, unfortunately, we don't know where your app is. But, zap us a note with "Send Feedback" and we can manually register a folder for you...

Your suggestion here

Hope you find those useful. If you have suggestions for other little UI tweaks, we'd love to hear them. Or indeed, any other features you'd like to see. Keep in touch!


Interactive shells on Python.org

We're really proud to announce that we're providing a "Launch interactive shell" feature for the newly-redesigned Python.org website. We hope that the ease of just clicking on something on the site to try it out will help bring even more people over to The World's Best Programming Language!

Light technical details after the pretty picture...

Drawing

PythonAnywhere is a platform as a service for Python developers. One of our neat features is that you can work on your code from anywhere, from a computer or a tablet, by starting up a Python or Bash command line inside your web browser. So it was easy for us to roll out a simplified version of the system that just provided a simple Python 3 console (using IPython) for anyone who wanted one, and make that available to Python.org.

So, how does it work?

Getting started

When your browser loads the main Python.org front page, a small bit of JavaScript runs. This pings a URL on the PythonAnywhere servers to check that our service is up and running -- this allows us to rate-limit things if the number of visitors to the page gets beyond our abilities to handle new consoles. So far we've not had to do that, but it's good for our peace of mind to have a way of pulling the plug in an emergency.

Assuming that the response says that all systems are go, the JavaScript displays a button above the code sample on the front page that says "Launch interactive shell". That's all for now. Starting a new console takes a certain amount of machine resources, so we don't start one for every hit on the Python.org front page.

onclick="..."

When you click the interactive shell, a bit more JavaScript is run. This injects an iframe into the document, sized to cover the code sample. The src of the iframe is a URL on PythonAnywhere's servers. Without going into too much detail, hitting this URL creates a new "unregistered" user on our server cluster (one with very limited capabilities) and returns a load of HTML and JavaScript that displays a vt100 terminal emulator, and connects back using SockJS (which normally uses WebSockets) to one of our cluster of console servers.

The console server

Our console servers run Tornado. Specifically, they run a Tornado application that listens on port 443 for incoming TLS SockJS connections. When one comes in, it and the JavaScript on the front end do a short authentication exchange to make sure it's from a real user (not super-important for the publicly-accessible consoles on Python.org, but much more important for our normal site). After the auth, the console server constructs a sandbox for the user. This involves setting up a filesystem, and enabling the various limitations for the user. The Tornado process then forks off a new Python process, runng as that user, chrooted into the filesystem. (For a more in-depth look at the Tornado server, check out this EuroPython talk by Giles Thomas).

And off and running

Once the chrooted Python process is up and running, the Tornado server just works to forward keystrokes from the browser down to the process, and results back from the process to the browser. The JavaScript vt100 in the browser handles all of the formatting.

That's pretty much it for the overview! If there's anything you'd like more details on, leave a comment below and we'll respond there -- or we'll update the blog post if enough people are interested in the same bits.


Page 1 of 9.

Older posts »

PythonAnywhere is a Python development and hosting environment that displays in your web browser and runs on our servers. They're already set up with everything you need. It's easy to use, fast, and powerful. There's even a useful free plan.

You can sign up here.