XFS to ext4 for user storage - why we made the switch

Last Tuesday, we changed the filesystem we use to store our users' files over from XFS to ext4fs. This required a much longer maintenance outage than normal -- 2 hours instead of our normal 20-30 minutes.

This post explains why we made the change, and how we did it.

tl;dr for PythonAnywhere users:

We discovered that the quota system we were using with XFS didn't survive hard fileserver reboots in our configuration. After much experimentation, we determined that ext4 handles our particular use case better. So we moved over to using ext4, which was hard, but worthwhile for many reasons.

tl;dr for sysadmins:

A bit of architecture

In order to understand what we changed and why, you'll need a bit of background about how we store our users' files. This is relatively complex, in part because we need to give our users a consistent view of their data regardless of which server their code is running on -- for example so they see the same files from their consoles as they do from their web apps, and so all of the worker processes that make up their web apps can see all of their files -- and in part because we need to keep everything properly backed up to allow for hardware failures and human error.

The PythonAnywhere cluster is made up of a number of different server types. The most important for this post are execution servers, file servers, and backup servers.

Execution servers are the servers where users' code runs. There are three kinds: web servers, console servers, and (scheduled) task servers. From the perspective of file storage, they're all the same -- they run our users' code in containers, with each user's files mounted into the containers. They access the users' files from file servers.

File servers are just what you'd expect. All of a given user's files are on the same file server. They're high-capacity servers with large RAID0 SSD arrays (connected using Amazon's EBS). They run NFS to provide the files to the execution servers, and also run a couple of simple services that allow us to manage quotas and the like.

Backup servers are simpler versions of file servers. Each file server has its own backup server, and they have identical amounts of storage. Data that is written to a file server is asynchronously synchronised over to its associated backup server using a service called drbd.

Here's a diagram of what we were doing prior to the recent update:

Simplified architecture diagram

This architecture has a number of benefits:

  • If a file server or one of its disks fails, we have an almost-up-to-date (normally within milliseconds) copy on its associated backup server.
  • At the cost of a short window when disks aren't being kept in sync by drbd, we can do point-in-time snapshots of all of the data without adding load to the file server. We just log on to the backup server, use drbd to disconnect it from the file server, then snapshot the disks. Once that's done, we reconnect it. Prior to using a separate backup server for this, our daily backups visibly impacted filesystem performance, which was unacceptable. They were also "smeared" -- that is, because files were being written to while they were being backed up, the files that were backed up first would be from a point in time earlier to the ones that were backed up later.
  • If we want to grow the disk capacity on a file server, we can add a new set of disks to it and to its backup server, RAID0 them together for speed, and then add that to the LVM volumes on each side.
  • It's even possible to move all of PythonAnywhere from one Amazon datacenter to another, though the procedure for that is complicated enough to be worthy of a separate blog post of its own...

XFS

As you can see from the diagram, the filesystem we used to use to store user data was XFS. XFS is a tried-and tested journaling filesystem, created by Silicon Graphics in 1993, and is perfect for high-capacity storage. We actually started using it because of a historical accident. In an early prototype of PythonAnywhere, all users actually mapped to the same Unix user. When we introduced disk quotas (yes, it was early enough that we didn't even have disk quotas) this was a problem. At that time, we couldn't see any easy way to change the situation with Unix users (that changed later) so we needed some kind of quota system that allowed us to enforce quotas on a per-directory basis, so that (eg.) /home/someuser had a quota of 512MB and /home/otheruser had a quota of 1GB. But most filesystems that provide quotas only support it on a per-user basis.

XFS, however, has a concept of "project quotas". A project is a set of directories, and each project can have its own independent quota. This was perfect for us, so of the tried-and-tested filesystems, XFS was a great choice.

Later on, of course, we worked out how to map each user to a separate Unix user -- so the project quota concept was less useful. But XFS is solid, reliable, and just as fast as, if not faster than, other filesystems, so there was no reason to change.

How things went wrong

A few weeks back, we had an unexpected outage on a core database instance that supports PythonAnywhere. This caused a number of servers to crash (coincidentally due to the code we use to map PythonAnywhere users to Unix users), and we instituted a rolling reboot. This has happened a couple of times before, and has only required execution server reboots. But this time we needed to reboot the file servers as well.

Our normal process for rebooting an execution server is to run sync to synchronise the filesystem (being old Unix hands we run it three times "just to be safe", despite the fact that hasn't been necessary since sometime in the early '90s) and then to do a rapid reboot by echoing "b" to /proc/sysrq-trigger.

File servers, however, require a more gentle reboot procedure, because they have critical data stored on them, and are writing so much to disk that stuff can change between the last sync and the reboot, so a normal slow reboot command is much safer.

This time, however, we made a mistake -- we used the execution-server-style hard reboot on the file servers.

There were no obvious ill effects; when everything came back, all filesystems were up and running as normal. No data was lost, and the site was back up and running. So we wiped the sweat from our respective brows, and carried on as normal.

Quotas

We first noticed that something was going wrong an hour or so later. Some of our users started reporting that instead of seeing their own disk usage and quotas on the "Files" tab in the PythonAnywhere web interface, they were seeing things like "1.1TB used of 1.6TB quota". Basically, they were seeing the disk usage across the storage volumes they were linked to instead of the quota details specific to their accounts.

This had happened in the past; the process of setting up a new project quota on XFS can take some time, especially when a volume has a lot of them (our volumes had tens of thousands) and it was done by a service running on the volume's file server that listened to a beanstalk queue and processed updates one at a time. So sometimes when there was a backlog, people would not see the correct quota information for some time.

But this time, when we investigated, we discovered tons of errors in the "quota queue listener" service's logs.

It appeared that while XFS had managed to store files correctly across the hard reboots, the project quotas had gone wrong. Essentially, all users now had unquota'd disk space. This was obviously a big problem. We immediately set up some alerts so that we could spot anyone going over quota.

We also disabled quota reporting on the PythonAnywhere "Files" interface, so that people wouldn't be confused. Or, indeed, to make sure that people didn't guess what was up and try to take advantage by using tons of storage, and cause problems for other users... we did not make any announcement about what was going on, as the risks were too high. (Indeed, this blog post is the announcement of what happened :-)

So, how to fix it?

Getting the backups back up

In order to get quotas working again, we'd need to run an XFS quota check on the affected filesystems. We'd done this in the past, and we'd found it to be extremely slow. This is odd, because XFS gurus had advised us that it should be pretty quick -- a few minutes at most. But the last time we'd run one it had taken 20 minutes, and that had been with significantly smaller storage volumes. If it scaled linearly, we'd be looking at at least a couple of hours' downtime. And if it was non-linear, it could be even longer.

We needed to get some kind of idea of how long it would take with our current data size. So, we picked a recent backup of 1.6TB worth of RAID0 disks, created fresh volumes for them, attached them to a fresh server, mounted it all, and kicked off the quota check.

24 hours later, it still hadn't completed. Additionally, in the machine's syslog there were a bunch of errors and warnings about blocked processes. The kind of errors and warnings that made us suspect that the process was never going to complete.

This was obviously not a good sign. The backup we were working from pre-dated the erroneous file server reboots. But the process by which we'd originally created it -- remember, we logged on to a backup server, used drbd to disconnect from its file server, did the backup snapshots, then reconnected drbd -- was actually quite similar to what would have happened during the server's hard reboot. Essentially, we had a filesystem where XFS might have been half-way through doing something when it was interrupted by the backup.

This shouldn't have mattered. XFS is a journaling filesystem, which means that it can be (although it generally shouldn't be) interrupted when it's half-way through something, and can pick up the pieces afterwards. This applies both to file storage and to quotas. But perhaps, we wondered, project quotas are different? Or maybe something else was going wrong?

We got in touch with the XFS mailing list, but unfortunately we were unable to explain the problem with the correct level of detail for people to be able to help us. The important thing we came away with was that what we were doing was not all that unusual, and it should all be working. The quotacheck should be completing in a few minutes.

And now for something completely different

At this point, we had multiple parallel streams of investigations ongoing. While one group worked on getting the quotacheck to pass, another was seeing whether another filesystem would work better for us. This team had come to the conclusion that ext4 -- a more widely-used filesystem than XFS -- might be worth a look. XFS is an immensely powerful tool, and (according to Wikipedia) is used by NASA for 300+ terabyte volumes. But, we thought, perhaps the problem is that we're just not expert enough to use it properly. After all, organisations of NASA's size have filesystem experts who can spend lots of time keeping that scale of system up and running. We're a small team, with smaller requirements, and need a simpler filesystem that "just works". On this theory, we thought that perhaps due to our lack of knowledge, we'd been misusing XFS in some subtle way, and that was the cause of our woes. ext4, being the standard filesystem for most current Linux distros, seemed to be more idiot-proof. And, perhaps importantly, now that we no longer needed XFS's project quotas (because PythonAnywhere users were now separate Unix users), it could also support enough quota management for our needs.

So we created a server with 1.6TB of ext4 storage, and kicked off an rsync to copy the data from another copy of the 1.6TB XFS backup the quotacheck team were using over to it, so that we could run some tests. We left that rsync running overnight.

When we came in the next morning, we saw something scary. The rsync had failed halfway through with IO errors. The backup we were working from was broken. Most of the files were OK, but some of them simply could not be read.

This was definitely something we didn't want to see. With further investigation, we discovered that our backups were generally usable, but in each one, some files were corrupted. Clearly our past backup tests (because, of course, we do test our backups regularly :-) had not been sufficient.

And clearly the combination of our XFS setup and drbd wasn't working the way we thought it did.

We immediately went back to the live system and changed the backup procedure. We started rolling "eternal rsync" processes -- we attached extra (ext4) storage to each file server, matching the existing capacity, and ran looped scripts that used rsync (at the lowest-priority ionice level) to make sure that all user data was backed up there.

We made sure that we weren't adversely affecting filesystem performance by checking out an enormous git repo into one of our own PythonAnywhere home directories, and running git status (which reads a bunch of files) regularly, and timing it.

Once the first eternal rsyncs had completed, we were 100% confident that we really did have everyones' data safe. We then changed the backup process to be:

  • Interrupt the rsync.
  • Make sure the ext4 disks were not being accessed
  • Back up the ext4 disks
  • Kick off the rsync again

This meant that we could be sure that the backups were recoverable, as they came from a filesystem that was not being written to while they happened. This time we tested them with a rsync from disk to disk, just to be sure that every file was OK.

We then copied the data from one of the new-style backups, that had come from an ext4 filesystem, over to a new XFS filesystem. We attached the XFS filesystem to a test server, set up the quotas, set some processes to reading from and writing to it, then did a hard reboot on the server. When it came back, it mounted the XFS filesystem, but quotas were disabled. Running a quotacheck on the filesystem crashed.

Further experiments showed that this was a general problem with pretty much any project-quota'ed XFS filesystem we could create; in our tests, a hard reboot caused a quotacheck when the filesystem was remounted, and this would frequently take a very long time, or even crash -- leaving the disk only mountable with no quotas.

We tried running a similar experiment using ext4; when the server came back after a hard reboot, it took a couple of minutes checking quotas and a few harmless-seeming warnings appeared in syslog. But the volumes mounted OK, and quotas were active.

Over to ext4

By this time we'd persuaded ourselves that moving to ext4 was the way forward for dimwits like us. So the question was, how to do it?

The first step was obviously to change our quota-management and system configuration code so that it used ext4's commands instead of XFS's. One benefit of doing this was that we were able to remove a bunch of database dependencies from the file server code. This meant that:

  • A future database outage like the one that triggered all of this work wouldn't cause file server outages, so we'd be less likely to make the mistake of hard-rebooting one of them.
  • Our file server-database dependency was one of the main blockers that have been stopping us from moving to a model where we can deploy new versions of PythonAnywhere without downtime. (We're currently actively working on eliminating the remaining blockers.)

It's worth saying that the database dependency wasn't due to XFS; we were just able to eliminate it at this point because we were changing all of that code anyway.

Once we'd made the changes and run it through our continuous integration environment a few times to work out the kinks, we needed to deploy it. This was trickier.

What we needed to do was:

  • Start a new PythonAnywhere cluster, with no file storage attached to the file servers.
  • Shut down all filesystem access on the old PythonAnywhere to make sure that the files were stable.
  • Copy all of the data from all XFS filesystems to matching ext4 filesystems
  • Move the ext4 filesystems over to the new cluster.
  • Activate the new cluster.

Parrallelise rsync for great good

The "copy" phase was the problem. The initial run of our eternal rsync processes made it clear that copying 1.6TB (our standard volume size) from a 1.6TB XFS volume to an equivalent ext4 one took 26 hours. A 26 hour outage would be completely unacceptable.

However, the fact that we were already running eternal rysync processes opened up some other options. The first sync took 26 hours, but each additional one took 6 hours -- that is, it took 26 hours to copy all of the data, then after that it took 6 hours to check for any changes between the XFS volume and the ext4 one it was copying them to that had happened while the original copy was running, and to copy those changes across. And then it took 6 hours to do that again.

We could use our external rsync target ext4 disks as the new disks for the new cluster, and just sync across the changes.

But that would still leave us with a 6+ hour outage -- 6 hours for the copy, and then extra time for moving disks around and so on. Better, but still not good enough.

Now, the eternal rsync processes were running at a very high nice and ionice level so as not to disrupt filesystem access on the live system. So we tested how long it would take to run the rsync with the opposite, resource-gobbling niceness settings. To our surprise, it didn't change things much; a rsync of 6 hours' worth of changes from an XFS volume to an ext4 one took about five and a half hours.

We obviously needed to think outside the box. We looked at what was happening while we ran one of these rsyncs, in top and iotop, and noticed that we were nowhere near maxing out our CPU or our disk IO... which made us think, what happens if we do things in parallel?

At this point, it might be worth sharing some (slightly simplified) code:

rsync-all.sh

#!/bin/bash
# Parameter $1 is the number of rsyncs to run in parallel
cd /mnt/old_xfs_volume/
ls -d * | xargs -n 1 -P $1 ~/rsync-one.sh

rsync-one

#!/bin/bash
mkdir -p /mnt/new_ext4_volume/"$1"
rsync -raXAS --delete /mnt/old_xfs_volume/"$1" /mnt/new_ext4_volume/

For some reason our notes don't capture, on our first test we went a bit crazy and used 720 parallel rsyncs, for a total of about 2,000 processes.

It was way better. The copy completed in about 90 minutes. So we experimented. After many, many tests, we found that the sweet spot was about 30 parallel rsyncs, which took on average about an hour and ten minutes.

Going.... LIVE

We believed that the copy would take about 70 minutes. Given that this deployment was going to require significantly more manual running of scripts and so on than a normal one, we figured that we'd need 50 minutes for the other tasks, so we were up from our normal 20-30 minutes of downtime for a release to two hours. Which was high, but just about acceptable.

The slowest time of day across all of the sites we host is between 4am and 8am UTC, so we decided to go live at 5am, giving us 3 hours just in case things went wrong. On 17 March, we had an all-hands-on deck go-live with the new code. And while there were a couple of scary moments, everything went pretty smoothly -- in particular, the big copy took 75 minutes, almost exactly what we'd expected.

So as of 17 March, we've been running on ext4.

Post-deploy tests

Since we went live, we've run two main tests.

First, and most importantly, we've tested our backups much more thoroughly than before. We've gone back to the old backup technique -- on the backup server, shut down the drbd connection, snapshot the disks, and restart drbd -- but now we're using ext4 as the filesystem. And we've confirmed that our new backups can be re-mounted, they have working quotas, and we can rsync all of their data over to fresh disks without errors. So that's reassuring.

Secondly, we've taken the old XFS volumes and tried to recover the quotas. It doesn't work. The data is all there, and can be rsynced to fresh volumes without IO errors (which means that at no time was anyone's data at risk). But the project quotas are irrecoverable.

We've also (before we went live with ext4, but after we'd committed to it) discovered that there was a bug in XFS -- fixed in Linux kernels since 3.17, but we're on Ubuntu Trusty, which uses 3.13. It is probably related to the problem we're seeing, but certainly doesn't explain it totally -- it explains why a quotacheck ran when we re-mounted the volumes, but doesn't explain why it never completed, or why we were never able to re-mount the volumes with quotas enabled.

Either way, we're on ext4 now. Naturally, we're 100% sure it won't have any problems whatsoever and everything will be just fine from now on ;-)


Today's maintenance upgrade: Fileserver migration complete, other updates

Morning all!

XFS -> ext4

So the reason for our extra-long maintenance window this morning was primarily a migration from XFS to ext4 as our filesystem for user storage. We'll write more about the whys and wherefores of this later, but the short version is that the main reason for using XFS, project quotas, were no longer needed, and a bug in the version of XFS support by Ubuntu LTS left us vulnerable to long periods of downtime after unplanned reboots, while XFS did some unnecessary quotachecks. The switch to ext4 removes that risk, and has simplified some of our code too, bonus!

In other news, we've managed to squeeze in a few more user-visible improvements :)

Features bump for paid plans

We've decided to tweak the pricing and accounts pages so that all plans are customisable. As a bonus side-effect, we've slightly improved all the existing paid plans, so our beloved customers are going to get some free stuff:

  • All Hacker plans now allow you to replace your .pythonanywhere.com domain with a custom one
  • We've bumped the disk space for Hacker plans from 512MB to 2Gigs
  • And we've bumped the Web Developer CPU quota from 3000 to 4000 seconds

Package installs

bottlenose, python-amazon-simple-product-api, py-bcrypt, Flask-Bcrypt, flask-restful, markdown (for Python 3), wheezy.template, pydub, and simpy (for Python 3) are now part of our standard batteries included

Pip wheels available

We've re-written our server build scripts to use wheels, and to build them for each package we install. We've made them available (at /usr/share/pip-wheels), and we've added them to the PythonAnywhere default pip config. So, if you're installing things into a virtualenv, if it so happens we already have a wheel for the package you want, pip will find it and the install will complete much faster.

Python 3 is now the default for save + run

The "Save and Run" button at the top of the editor, much beloved of teachers and beginners (and highly relevant for our education beta) now defaults to Python 3. It's 2015, this is the future after all. We didn't want to break things for existing users, so they will still have 2 as the default, but we can change that for you if you want. Just drop us a line to support@pythonanywhere.com

Security and performance improvements

Other than that, we've added a few minor security and performance tweaks.

Onwards and upwards!


A Baby's First Steps (Part 1)

Hi guys,

I'm Conrad- a new member of the PythonAnywhere team. As a rather junior and beginner programmer, I would like to share with you my story of how I set up my work environment- my rationale for choosing and customizing my text editor, my shell, my windows manager etc, and what I learned along the way.

When I started out on this project, I had two goals in mind: to be as lazy as possible, and to be as scalable/consistent as possible (ie. be able to take what I learnt and setup the same thing quickly and easily on a new PythonAnywhere project, on my mac laptop, on a Ubuntu server, on a friend's Window machine... etc with zero extra changes/tweaks).

When it comes to being lazy, some may say that I have quite an extreme stance, including using as much automation as possible, and tweaking my key bindings and other work conditions to be as ergonomic as possible:

  1. For any sort of typing, I want to do it with the least number of key strokes.
    • eg: Instead of cd, I have an alias g that does exactly the same thing.
    • If you are a Python programmer, you probably use colons a lot more than semi-colon. So I switched : with ; so I won't have to press shift as often.
  2. If anything can be done automatically, I don't want to have to call it/click it/run it myself.

    • eg: I have a shortcut to edit my bashrc/vimrc and and then source/reload it automatically after I save.
    • For you PythonAnywhere webapp users reading this, I also have a bash alias to reboot my webapp from the console:

      touch /var/www/<your-web-domain>_wsgi.py

      Do you recognize this file path? It's your wsgi.py file that you have likely been editing when you customize your webapp!

  3. If I do need to type, I don't want to move my fingers too much from their resting positions (ergonomics!!).

    • eg: My vim and tmux leader keys are Space and Ctrl+Space, because your thumb is the strongest/most underworked finger, and is natually resting on the Space key already.
    • Interestingly, there are a lot of limitations to using Ctrl + x or Alt + x type commands as shortcuts. This is because some Ctrl + x and many many alt + x keys gets captured by your operating system, or your browser etc to do whatever shortcut they have set. To be able to have consistent key mappings across platforms, you probably want to avoid depending on too many Ctrl/Alt key shortcuts.
    • Having said that, there are definitely a couple local mappings I do enjoy, even though I run the risk of not having these mappings available when working on different operating systems, or working on someone else's computer:
      • I switch Escape with Caplocks, since I use the Escape key so much more often.
      • I also switch the Left Alt key with the Left Ctrl key, so that I can use my thumb instead of extending my pinky when I need to hit ctrl, and so that both alt and ctrl are accessible from my thumb. Currently I have Left Alt and Left Ctrl switched (instead of Right Alt and Right Ctrl), because I want to be able to use my mouse and copy/paste with a single hand. However, when purely typing, ergonomically it makes more sense to hold down shift/ctrl/alt with one hand and press the other key with the other hand. So maybe, I'm thinking, the ideal solution is actually to swap Right Alt and Right Ctrl, and use the mouse when needed with your left hand! They do recommend switching mouse hands for people with RSI...
  4. There are other aspects to ergonomy as well: why have your editor take up just half the screen and then have to squint because the font size needs to be smaller to fit into your editor?

    Here is a quick tip for PythonAnywhere users: adding "/frame" to the end of your console url will take you to a page with just the console frame. That extra bit of screen real estate basically means that you can have larger fonts if you like, or you can browse through two files in parallel easier within a vim split pane (as below), or you can actually see a couple lines of commands on your mobile phone after your keyboard pops up and blocks half the screen etc.

    If you do that, here is what you get when working with PythonAnywhere in full screen:

    PythonAnywhere Console iFrame on Full Screen

    WOW LOOK AT THIS PYTHONANYWHERE CONSOLE IN MY WORKSPACE! Now look below... Now back up top, now back down, now back up. Sadly, the bottom PythonAnywhere console will never be in my workspace, but if it had used the /frame hack, it could at least look like it’s my console.

    PythonAnywhere Console Webpage taking up Half the Monitor

    Not even comparable.

In any case, a lot of this may of course just be attributable to personal preference. Having rambled on for so long and gotten severely off topic, I am going to end this blog post for now. But keep your eyes peeled for a follow up post soon on what my work setup actually looks like, featuring tmux, a bootstrapping bash script to setup my configs, and many more other goodies and adventures along the way!


New release -- better virtualenv handling for webapps, tmux, mutt, and our education beta

Morning all!

A pleasingly smooth deploy this morning, allowing us to bring you some new features we hope you'll like:

  • The web tab now has the option to specify a virtualenv, which will then be used by the uWSGI workers that run your web app. This avoids the ugly exec activate_this hack we had to recommend, and should avoid issues with shadowing. More info here and here.

  • Thanks to Conrad (yay new guy!), we've added tmux and mutt as available binaries in consoles, for all you terminal wizzzards out there.

  • And we're doing a soft launch of our (very lean, very early) education beta. We've started to built out some more features to make PythonAnywhere a great place to teach + learn Python, so do get in touch if you're and educator and want to get involved.

That's about that! Keep hassling us for new features, and we'll keep trying to deliver them as soon as we can...


#VATMESS - or, how a taxation change took 4 developers a week to handle

A lot of people are talking about the problems that are being caused by a recent change to taxation in the EU; this TechCrunch article gives a nice overview of the issues. But we thought it would be fun just to tell about our experience - for general interest, and as an example of what one UK startup had to do to implement these changes. Short version: it hasn't been fun.

If you know all about the EU VAT changes and just want to know what we did at PythonAnywhere, click here to skip the intro. Otherwise, read on...

The background

"We can fight over what the taxation levels should be, but the tax system should be very, very simple and not distortionary." - Adam Davidson

The tax change is, in its most basic form, pretty simple. But some background will probably help. The following is simplified, but hopefully reasonably clear.

VAT is Value Added Tax, a tax that is charged on pretty much all purchases of anything inside the EU (basic needs like food are normally exempt). It's not dissimilar to sales or consumption tax, but the rate is quite high: in most EU countries it's something like 20%. When you buy (say) a computer, VAT is added on to the price, so a PC that the manufacturer wants EUR1,000 for might cost EUR1,200 including VAT. Prices for consumers are normally quoted with VAT included, when the seller is targetting local customers. (Companies with large numbers of international customers, like PythonAnywhere, tend to quote prices without VAT and show a note to the effect that EU customers have to pay VAT too.)

When you pay for the item, the seller takes the VAT, and they have to pay all the VAT they have collected to their local tax authority periodically. There are various controls in place to make sure this happens, that people don't pay more or less VAT than they've collected, and in general it all works out pretty simply. The net effect is that stuff is more expensive in Europe because of tax, but that's a political choice on the part of European voters (or at least their representatives).

When companies buy stuff from each other (rather than sales from companies to consumers) they pay VAT on those purchases if they're buying from a company in their own country, but they can claim that VAT back from their local tax authorities (or offset it against VAT they've collected), so it's essentially VAT-free. And when they buy from companies in other EU countries or internationally, it's VAT-free. (The actual accounting is a little more complicated, but let's not get into that.)

What changed

"Taxation without representation is tyranny." - James Otis

Historically, for "digital services" -- a category that includes hosting services like PythonAnywhere, but also downloaded music, ebooks, and that kind of thing -- the rule was that the rate of VAT that was charged was the rate that prevailed in the country where the company doing the selling was based. This made a lot of sense. Companies would just need to register as VAT-collecting businesses with their local authorities, charge a single VAT rate for EU customers (apart from sales to other VAT-registered businesses in other EU countries), and pay the VAT they collected to their local tax authority. It wasn't trivially simple, but it was doable.

But, at least from the tax authorities' side, there was a problem. Different EU countries have different VAT rates. Luxembourg, for example, charges 15%, while Hungary is 27%. This, of course, meant that Hungarian companies were at a competitive disadvantage to Luxembourgeoise companies.

There's a reasonable point to be made that governments who are unhappy that their local companies are being disadvantaged by their high tax rates might want to consider whether those high tax rates are such a good idea, but (a) we're talking about governments here, so that was obviously a non-starter, and (b) a number of large companies had a strong incentive to officially base themselves in Luxembourg, even if the bulk of their business -- both their customers and their operations -- was in higher-VAT jurisdictions.

So, a decision was made that instead of basing the VAT rate for intra-EU transactions for digital services on the VAT rate for the seller, it should be based on the VAT rate for the buyer. The VAT would then be sent to the customer's country's tax authority -- though to keep things simple, each country's tax authority would set up a process where the company's local tax authority would collect all of the money from the company along with a file saying how much was meant to go to each other country, and they'd handle the distribution. (This latter thing is called, in a wonderful piece of bureaucratese, a "Mini One-Stop Shop" or MOSS, which is why "VATMOSS" has been turned into a term for the whole change.)

As these kind of things go, it wasn't too crazy a decision. But the knock-on effects, both those inherent in the idea of companies having to charge different tax rates for different people, but also those caused by the particular way the details of laws have been worked out, have been huge. What follows is what we had to change for PythonAnywhere. Let's start with the basics.

Different country, different VAT rate

"The wisdom of man never yet contrived a system of taxation that would operate with perfect equality." - Andrew Jackson

Previously we had some logic that worked out if a customer was "vattable". A user in the UK, where we're based, is vattable, as is a non-business anywhere else in the EU. EU-based businesses and anyone from outside the EU were non-vattable. If a user was vattable, we charged them 20%, a number that we'd quite sensibly put into a constant called VAT_RATE in our accounting system.

What needed to change? Obviously, we have a country for each paying customer, from their credit card/PayPal details, so a first approximation of the system was simple. We created a new database table, keyed on the country, with VAT rates for each. Now, all the code that previously used the VAT_RATE constant could do a lookup into that table instead.

So now we're billing people the right amount. We also need to store the VAT rate on every invoice we generate so that we can produce the report to send to the MOSS, but there's nothing too tricky about that.

Simple, right? Not quite. Let's put aside that there's no solid source for VAT rates across the EU (the UK tax authorities recommend that people look at a PDF on an EU website that's updated irregularly and has a table of VAT rates using non-ISO country identifiers, with the countries' names in English but sorted by their name in their own language, so Austria is written "Austria" but sorted under "O" for "Österreich").

No, the first problem is in dealing with evidence.

Where are you from?

"Extraordinary claims require extraordinary evidence." - Carl Sagan

How do you know which country someone is from? You'd think that for a paying customer, it would be pretty simple. Like we said a moment ago, they've provided a credit card number and an address, or a PayPal billing address, so you have a country for them. But for dealing with the tax authorities, that's just not enough. Perhaps they feared that half the population of the EU would be flocking to Luxembourgeoise banks for credit cards based there to save a few euros on their downloads.

What the official UK tax authority guidelines say regarding determining someone's location is this (and it's worth quoting in all its bureaucratic glory):

1.5 Record keeping

If the presumptions referred above don’t apply, you’ll be expected to obtain and to keep in your records 2 pieces of non-contradictory evidence from the following list to support your taxing decisions. Examples include:

  • the billing address of the customer
  • the Internet Protocol (IP) address of the device used by the customer
  • location of the bank
  • the country code of SIM card used by the customer
  • the location of the customer’s fixed land line through which the service is supplied to him
  • other commercially relevant information (for example, product coding information which electronically links the sale to a particular jurisdiction)

Once you have 2 pieces of non-contradictory evidence that is all you need and you don’t need to collect any further supporting evidence. This is the case even if, for example, you obtain a third piece of evidence which happens to contradict the other 2 pieces of information. You must keep VAT MOSS records for a period of 10 years from 31 December of the year during which the transaction was carried out.

For an online service paid with credit cards, like PythonAnywhere, the only "pieces of evidence" we can reasonably collect are your billing address and your IP address.

So, our nice simple checkout process had to grow another wart. When someone signs up, we have to look at their IP address. We then compare this with a GeoIP database to get a country. When we have both that and their billing address, we have to check them:

  • If neither the billing address nor the IP address is in the EU, we're fine -- continue as normal.
  • If either the billing address or the IP address is in the EU, then:
    • If they match, we're OK.
    • If they don't match, we cannot set up the subscription. There is quite literally nothing we can do, because we don't know enough about the customer to work out how much VAT to charge them.

We've set things up so that when someone is blocked due to a location mismatch like that, we show the user an apologetic page, and the PythonAnywhere staff get an email so that we can talk to the customer and try to work something out. But this is a crazy situation:

  • If you're a British developer with a UK credit card, currently on a business trip to Germany, then you can't sign up for PythonAnywhere. This does little good for the free movement of goods and services across the EU.
  • If you're an American visiting the UK and want to sign up for our service, you're equally out of luck. This doesn't help the EU's trade balance much.
  • If you're an Italian developer who's moved to Portugal, you can't sign up for PythonAnywhere until you have a new Portuguese credit card. This makes movement of people within the EU harder.

As far as we can tell there's no other way to implement the rules in a manner consistent with the guidelines. This sucks.

And it's not all. Now we need to deal with change...

Ch-ch-changes

"Time changes everything except something within us which is always surprised by change." - Thomas Hardy

VAT rates change (normally upwards). When you're only dealing with one country's VAT rate, this is pretty rare; maybe once every few years, so as long as you've put it in a constant somewhere and you're willing to update your system when it changes, you're fine. But if you're dealing with 28 countries, it becomes something you need to plan for happening pretty frequently, and it has to be stored in data somewhere.

So, we added an extra valid_from field to our table of VAT rates. Now, when we look up the VAT rate, we plug today's date into the query to make sure that we've got the right VAT rate for this country today. (If you're wondering why we didn't just set things up with a simple country-rate mapping that would be updated at the precise point when the VAT rate changed, read on.)

No big deal, right? Well, perhaps not, but it's all extra work. And of course we now need to check a PDF on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying "Beware of the Leopard" to see when it changes. We're hoping a solid API for this will crop up somewhere -- though there are obviously regulatory issues there, because we're responsible for making sure we use the right rate, and we can't push that responsibility off to someone else. The API would need to be provided by someone we were 100% sure would keep it updated at least as well as we would ourselves -- so, for example, a volunteer-led project would be unlikely to be enough.

So what went wrong next? Let's talk about subscription payments.

Reconciling ourselves to PayPal

"Nothing changes like changes, because nothing changes but the changes." - Gary Busey

Like many subscription services, at PythonAnywhere we use external companies to manage our billing. The awesome Stripe handle our credit card payments, and we also support PayPal. This means that we don't need to store credit card details in our own databases (a regulatory nightmare) or do any of the horrible integration work with legacy credit card processing systems.

So, when you sign up for a paid PythonAnywhere account, we set up a subscription on either Stripe or PayPal. The subscription is for a specific amount to be billed each month; on Stripe it's a "gross amount" (that is, including tax) while PayPal split it out into separate "net amount" and a "tax amount". But the common thing between them is that they don't do any tax calculations themselves. As international companies having to deal with billing all over the world, this makes sense, especially given that historically, say, a UK company might have been billing through their Luxembourg subsidiary to get the lower VAT rate, so there's no safe assumptions that the billing companies could make on their customers' behalf.

Now, the per-country date-based VAT rate lookup into the database table that we'd done earlier meant that when a customer signed up, we'd set up a subscription on PayPal/Stripe with the right VAT amount for the time when the subscription was created. But if and when the VAT rate changed in the future, it would be wrong. Our code would make sure that we were sending out billing reminders for the right amount, but it wouldn't fix the subscriptions on PayPal or Stripe.

What all that means is that when a VAT rate is going to change, we need to go through all of our customers in the affected country, identify if they were using PayPal or Stripe, then tell the relevant payment processor that the amount we're charging them needs to change.

This is tricky enough as it is. What makes it even worse is an oddity with PayPal billing. You cannot update the billing amount for a customer in the 72 hours before a payment is due to be made. So, for example, let's imagine that the VAT rate for Malta is going to change from 18% to 20% on 1 May. At least three days before, you need to go through and update the billing amounts on subscriptions for all Maltese customers whose next billing date is after 1 May. And then sometime after you have to go through all of the other Maltese customers and update them too.

Keeping all of this straight is a nightmare, especially when you factor in things like the fact that (again due to PayPal) we actually charge people one day after their billing date, and users can be "delinquent" -- that is, their billing date was several days ago but due to (for example) a credit card problem we've not been able to charge them yet (but we expect to be able to do so soon). And so on.

The solution we came up with was to write a reconciliation script. Instead of having to remember "OK, so Malta's going from 18% to 20% on 1 May so we need to run script A four days before, then update the VAT rate table at midnight on 1 May, then run script B four days after", and then repeat that across all 28 countries forever, we wanted something that would run regularly and would just make everything work, by checking when people are next going to be billed and what the VAT rate is on that date, then checking what PayPal and Stripe plan to bill them, and adjusting them if appropriate.

A code sample is worth a thousand words, so here's the algorithm we use for that. It's run once for every EU user (whose subscription details are in the subscription parameter). The get_next_charge_date_and_gross_amount and update_vat_amount are functions passed in as a dependency injection so that we can use the same algorithm for both PayPal and Stripe.

def reconcile_subscription(
    subscription,
    get_next_charge_date_and_gross_amount,
    update_vat_amount
):
    next_invoice_date = subscription.user.get_profile().next_invoice_date()
    next_charge_date, next_charge_gross_amount = get_next_charge_date_and_gross_amount(subscription)

    if next_invoice_date < datetime.now():
        # Cancelled subscription
        return

    if next_charge_date < next_invoice_date:
        # We're between invoicing and billing -- don't try to do anything
        return

    expected_next_invoice_vat_rate = subscription.user.get_profile().billing_vat_rate_as_of(next_invoice_date)
    expected_next_charge_vat_amount = subscription.user.get_profile().billing_net_amount * expected_next_invoice_vat_rate

    if next_charge_gross_amount == expected_next_charge_vat_amount + subscription.user.get_profile().billing_net_amount:
        # User has correct billing set up on payment processor
        return

    # Needs an update
    update_vat_amount(subscription, expected_next_charge_vat_amount)

We're pretty sure this works. It's passed every test we've thrown at it. And for the next few days we'll be running it in our live environment in "nerfed" mode, where it will just print out what it's going to do rather than actually updating anything on PayPal or Stripe. Then we'll run it manually for the first few weeks, before finally scheduling it as a cron job to run once a day. And then we'll hopefully be in a situation where when we hear about a VAT rate change we just need to update our database with a new row with the appropriate country, valid from date, and rate, and it will All Just Work.

(An aside: this probably all comes off as a bit of a whine against PayPal. And it is. But they do have positive aspects too. Lots of customers prefer to have their PythonAnywhere accounts billed via PayPal for the extra level of control it gives them -- about 50% of new subscriptions we get use it. And the chargeback model for fraudulent use under PayPal is much better -- even Stripe can't isolate you from the crazy-high chargeback fees that banks impose on companies when a cardholder claims that a charge was fraudulent.)

In conclusion

"You can't have a rigid view that all new taxes are evil." - Bill Gates

The changes in the EU VAT legislation came from the not-completely-unreasonable attempt by the various governments to stop companies from setting up businesses in low-VAT countries for the sole purpose of offering lower prices to their customers, despite having their operations on higher-VAT countries.

But the administrative load placed on small companies (including but not limited to tech startups) is large, it makes billing systems complex and fragile, and it imposes restrictions on sales that are likely to reduce trade. We've seen various government-sourced estimates of the cost of these regulations on businesses floating around, and they all seem incredibly low.

At PythonAnywhere we have a group of talented coders, and it still took over a week out of our development which we could have spent working on stuff our customers wanted. Other startups will be in the same position; it's an irritation but not the end of the world.

For small businesses without deep tech talent, we dread to think what will happen.


New PythonAnywhere update: Mobile, UI, packages, reliability, and the dreaded EU VAT change

We released a bunch of updates to PythonAnywhere today :-) Short version: we've made some improvements to the iPad and Android experience, applied fixes to our in-browser console, added a bunch of new pre-installed packages, done a big database upgrade that should make unplanned outages rarer and shorter, and made changes required by EU VAT legislation (EU customers will soon be charged their local VAT rate instead of UK VAT).

Here are the details:

iPad and Android

  • The syntax-highlighting in-browser editor that we use, Ace, now supports the iPad and Android devices, so we've upgraded it and changed the mobile version of our site to use it instead of the rather ugly textarea we used to use.
  • We've also re-introduced the "Save and Run" button on the iPad.
  • Combined with the console upgrade we did earlier on this month, our mobile support should now be pretty solid on iPads, iPhones, and new Android (Lollipop) devices. Let us know if you encounter any problems!

User interface

  • Some fixes to our in-browser consoles: fixed problems with zooming in (the bottom line could be cut off if your browser zoom wasn't set to 100%) and with the control key being stuck down if you switched tabs while it was pressed.
  • A tiny change, but one that (we hope) might nudge people in a good direction: we now list Python 3 before Python 2 in our list of links for starting consoles and for starting web apps :-)

New packages

We've added loads of new packages to our "batteries included" list:

  • A quantitative finance library, Quandl (Python 2.7, 3.3 and 3.4)
  • A backtester for financial algorithms, zipline (Python 2.7 only)
  • A Power Spectral Densities estimation package, spectrum (2.7 only)
  • The very cool remix module from Echo Nest: make amazing things from music! (2.7 only)
  • More musical stuff: pyspotify. (2.7 only)
  • Support for the netCDF data format (2.7, 3.3 and 3.4)
  • Image tools: imagemagick and aggdraw (2.7, 3.3 and 3.4)
  • Charting: pychart (2.7 only)
  • For Django devs: django-bootstrap-form (2.7, 3.3 and 3.4)
  • For Flask devs: flask-admin (2.7, 3.3 and 3.4)
  • For web devs who prefer offbeat frameworks: falcon and wheezy.web (2.7, 3.3 and 3.4)
  • A little ORM: peewee (2.7, 3.3 and 3.4)
  • For biologists: simplehmmer (2.7 only)
  • For statistics: Augustus. (2.7 only)
  • For thermodynamicists (?): CoolProp (2.7, 3.3 and 3.4)
  • Read barcodes from images or video: zbar (2.7 only)
  • Sending texts: we've upgraded the Python 2.7 twilio package so that it works from free accounts, and also added it for Python 3.3 and 3.4.
  • Locating people by IP address: pygeoip (2.7, 3.3 and 3.4)
  • We previously had the rpy2 package installed but there was a bug that stopped you from importing rpy2.robjects. That's fixed.

Additionally, for people who like alternative shells to the ubiquitous bash, we've added fish.

Reliability improvement

We've upgraded one of our underlying infrastructural databases to SSD storage. We've had a couple of outages recently caused by problems with this database, which were made much worse by the fact that it took a long time to start up after a failover. Moving it to SSD moved it to new hardware (which we think will make it less likely to fail) and will also mean that if it does fail, it should recover much faster.

EU VAT changes

For customers outside the EU, this won't change anything. But for non-business customers inside the EU, starting 1 January 2015, we'll be charging you VAT at the rate for your country, instead of using the UK VAT rate of 20%. This is the result of some (we think rather badly-thought-through) new EU legislation. We'll write an extended post about this sometime soon. [Update: here it is.]


PythonAnywhere now supports Postgres

Finally!

tl;dr: upgrade to a Custom account and you can now add Postgres

Say no to multi-tenancy (ak The Whale vs the Elephant vs the Dolphin)

Postgres has been the top requested feature for as long as we've supported webapps (3 years? who's counting!). We did have a brief beta based on the idea of a single postgres server, and multi-tenanted low-privileged accounts for each user, but it turned out Postgres really doesn't work too well that way (unlike MySQL).

Our new solution uses Linux containers to provide an isolated server for each user, so everyone can have full superuser access. And, yes, it uses the ubiquitous Docker under the hood.

How to get it

  • You'll need to upgrade to a Custom account, and enable Postgres, as well as choosing how much storage you need for your database.

  • Then, head on over to the Databases tab, and click the big button that says "Start a Postgres server".

  • Once it's ready, take a note of its hostname and port

  • Set your superuser passsword. We'll save it to ~/.pgpass for convenience.

  • And now hit "Start a Postgres console" and take a look around!

How it works under the hood

We run several different machines to host postgres containers for our users. When you hit "start my server", we scan through to find a machine with spare capacity, and ask it to build up a container for you.

It's a docker container based on our postgres image, but each individual user's is customised slightly. For example, we create a superuser account called "pythonanywhere_helper" with a unique, random password to enable us to perform some admin functions. (once you have your own superuser account you could theoretically delete this guy, but we'd rather you didn't...)

We also set up a special permanent storage area for your database, which gives us the ability to migrate you to a different server if we need to.

You can read a bit more about our TDD process for developing this part of the codebase here

What it costs

Aha, the dreaded question. We've set the price for the basic service at $15/month which includes 1GB of storage, and subsequent gigs are 20cents/gig, which we think is pretty fair... Feel free to have a moan if you think it isn't!

What to do next

That's up to you! We've tried to supply you with everything you could need, including the bleeding-edge 9.4 version of Postgres, PostGIS, PL-Python and PL-Python3.... Let us know how you get on!


New release -- new console with 256 colours, some fixes to task logging, and the P-thing.

Exciting new deploy today!

A new console

Obviously, the most important thing we did was to switch out our javascript console for a new one that supports 256 colours! And slightly more sane copy + paste. And it works on Android, or at least it does on Lollipop. Giles recommends the Hackers keyboard. Still doesn't work on my blackberry though.

For the curious, it's based on hterm which is a part of Chromium...

Some new packages

Of secondary importance, we added a few new packages, including TA-lib, pytesseract, and a thing called ruffus.

Improved logging of scheduled tasks

Scheduled tasks now log directly to files in /var/log, rather than storing their output in our database. That means they'll get log-rotated like everything else in there, and if you call flush on your sys.stdout, you may even be able to see live updates while tasks are still running. I think.

New database type supported.

Oh, and we also released a new database type, it's called Postgres, I'm told it's quite popular. Skip on over to the accounts page and get yourself a Custom account if you want to check it out.

Happy coding everyone!


Outage report: 1st November 2014

We had an outage this morning that lasted about an hour. We've established the cause, fixed the problem, and all sites are now back up. Apologies to all those affected. More detail follows.

We were alerted to an initial problem via our monitoring systems around 9:50AM. The symptom was that one of the web servers was seeing intermittent outages, due to a memory leak in one of our users' web applications causing a lot of swapping.

Our failover procedure involves taking the affected server out of rotation on the load balancer, redistributing its workload across to other servers, and rebooting it. We saw the same user using a lot of memory on a second server, so we able to confirm that that was a repeatable issue. We disabled his web app and rebooted this second server.

At this point the larger issue kicked in, which was that the rebooted servers seemed to be non-functional when they came back, which left the remaining servers struggling to keep up with the load, and causing outages to more customers. By this point two of us were working on the issue, and it took us a while to identify the root cause. It turned out to be due to a change in our logging configuration which was causing nginx to hang on startup. Specifically, it only affected users with custom SSL configuration. The reason that this was particularly baffling to us is that our deploy procedure involves a manual check on a sample of custom SSL users, and we confirmed they were functional when we did that deploy two days ago. Our working theory is that nginx will reload happily with broken logging config, but not restart happily:

On deploy:

  1. Start nginx
  2. Add custom SSL webapp configs
  3. Reload nginx

--> not a problem, despite broken SSL webapp logging

On reboot:

  1. Custom SSL configs with broken logging are already present on disk
  2. Nginx refuses to start

We'll be confirming this theory in development environments over the next few days.

In the meantime, we've fixed the offending configuration file template, and confirmed that both regular users and custom-SSL users sites are back up. We're also adding some safeguards to prevent any other users' web apps from using up too much memory.

Once again, we apologise to all those affected.


Maintenance release: trusty + process listings

Hi All,

A maintenance release today, so nothing too exciting. Still, a couple of things you may care about:

  • We've updated to Ubuntu Trusty. Although we weren't vulnerable to shellshock, it's nice to have the updated Bash, and to be on an LTS release

  • We've added an oft-requested feature to be able to view all your running console processes. You'll find it at the bottom of the consoles page. The UI probably needs a bit of work, you need to hit refresh to update the list, but it's a solution for when you think you have some detached processes chewing up your CPU quota! Let us know what you think.

Other than that, we've updated our client-side for our Postgres beta to 9.4, and added some of the PostGIS utilities. (Email us if you want to check out the beta). We also fixed an issue where a redirect loop would break the "reload" button on web apps, and we've added weasyprint and python-svn to the batteries included.


Page 1 of 11.

Older posts »

PythonAnywhere is a Python development and hosting environment that displays in your web browser and runs on our servers. They're already set up with everything you need. It's easy to use, fast, and powerful. There's even a useful free plan.

You can sign up here.