XFS project IDs - why we switched from Debian to Ubuntu for our File storage


Back in March, we discovered a problem on PythonAnywhere. Some of the people who were signing up reported that the site was telling them that they’d used all of their 500Mb disk quota, even though they had almost no files. When we logged in to our file servers and checked manually using the system tools – like df – we saw the same thing. Our system wasn’t misreporting what the operating system said, the operating system itself was at fault. But curiously, when we used du to see how much space their files were taking up, it gave the correct (much smaller) numbers. This blog post explains what we discovered, and how we fixed it.

The Background

On PythonAnywhere, your files are stored on XFS-based filesystems, on Amazon EBS volumes. XFS allows us to have a large number of “projects” on a system, each of which can have an independently-managed quota. We use one project for each user; a project can contain a number of different directories – for example, your project has your home directory and your /tmp – and the quota applies across all of a project’s directories.

So, let’s say a user called fred signs up. We generate a line in a file called projid specifying the name of the project (for which we use the username) and a numerical ID for it, for which we just use the ID of the user in our own database. It looks like this:

fred:1234

Next, we put one line for each of fred’s top-level quota’ed directories in a file called projects, which says which directories belong in that project. They look something like this:

1234:/an-internal-path/fred/home/fred
1234:/an-internal-path/fred/tmp

We then run a couple of xfs_quota commands to tell XFS to read this stuff from the files and apply it to the disk. One of the commands specifies the size of the fred project’s quota. Once that’s done, the quota is set up. And we can always re-run the command later to adjust the quota size (for example, if he upgrades and gets more disk space).

Hey, hey 16 bits

Now, df and XFS’s own tools were telling us that certain users had filled up their disk quota. But tools like du were telling us that they’d only used a few kb of disk space. What was going on? It took a little while for us to notice this, but when we did it was obvious. The first person to have the problem had project ID 65,542. When we looked at the project IDs of the others who were seeing the problem, they all had projects IDs above 65,536, apart from a small scattering with IDs of less than 100.

The problem was that the version of Debian that we were using had an older XFS implementation, which only supported 16-bit project IDs. Once a project was created with an ID of greater than 65,535, then it was in effect merged with the same project with the same ID modulo 65,536. The merging triggered all kinds of weird behaviour – the higher project would always wind up looking like it was full, and the lower one (the one modulo 65,536) would sometimes look like it was full.

So, our first thought was that we could just upgrade Debian. But at that time, the latest version was still squeeze (6.0). The changes we wanted were in wheezy, but that was still a few months away. We were able to find a package with 32-bit XFS management tools for squeeze, but not the kernel modules.

So, Ubuntu

We’d actually been planning to move our entire infrastructure over to Ubuntu in the medium term anyway (and we’ve since done that). And we discovered that the then-current Ubuntu had the right kernel modules for the version of XFS with 32-bit IDs. So we figured it would make sense to upgrade just the file servers (which have much less stuff running on them than our other server types) to Ubuntu, firstly to fix this bug and secondly to see how hard the full migration was likely to be.

So that’s what we did; it turned out to be pretty easy, and after a slightly scary deploy at 3am (where we had to change the existing storage volumes from 16-bit to 32-bit), everything was working perfectly again.

The moral of the story? Whenever you encounter an OS-level thing behaving strangely, check any associated integers and see if they’ve just passed a particular power of 2.

comments powered by Disqus