The main change for this release is that we now report hits and errors to your web apps on the web app page. If you're a paying user, you get pretty charts over a range of time periods. If you're not, you'll get a text report.
We've greatly improved the errors that are reported when you reload a web app.
As much as is possible, we have tried to bring the packages that we install for Python 3 to parity with Python 2. That means that the number of packages that come preinstalled for Python 3 has increased dramatically.
All of your databases have been upgraded to MySQL 5.5.
We've also applied a number of small bug fixes, user interface improvements and stability fixes.
Kamil is the creator of TickCounter, a simple online time counter. It allows you to create countdown and countup timers as well as measure time with stopwatches and egg timers.
TickCounter is hosted by PythonAnywhere, a Python-focused PaaS and browser-based programming environment.
|Pageviews||4.5MM / month|
|Sessions||2.9MM / month|
|Unique visitors||1.3MM / month|
Here's our exclusive interview with Kamil:
I have a Bachelor's degree in Computer Science and I am currently pursuing my Master's. I've always been interested in computers, but I started coding for real when I was in high school. For the last two years I've been mostly using Python.
The current version of TickCounter is one of the first projects I've built with Django. I've learned a lot since then, so obviously now I can see a huge room for improvement and I plan to rewrite it in the near future.
Ah, and the most important thing. During this summer I am a happy intern at PythonAnywhere :)
I wanted to make some money from a website... Due to my laziness, I wanted to build something that is fully automated and can work entirely on its own. Of course, as I later found out, there is no such thing -- there is always some bug to fix or a new feature to implement. However, at this time I decided that a web-based timer was what I was looking for. The user picks a date, the application does the rest. And I can just sit back and watch.
Shortly after I started reading the Django tutorial, I was entirely sure that's the way to go. Automation, simplicity, batteries included. And remarkable documentation. Everything I could ever want from a framework and more. Before Python and Django, I was using PHP and CodeIgniter. I've never looked back.
One warning though. It's tempting and easy to outsource many things to external packages. However, a lot of these awesome Django reusable apps are often being abandoned after a short period of time, leaving you with a problem when you want to move on. Be careful!
Before PythonAnywhere I was using a VPS, but I found it really time-consuming. I had to learn how to manage it, how to make it secure and efficient, I had to monitor its performance, update packages, configurations and so on. Don't get me wrong, it's really fun stuff, but I wanted to focus mostly on creating Django apps instead of writing Ansible playbooks.
After some research, in which I compared all major Python hosting providers, I decided that PythonAnywhere offers the best compromise between price, resources and my peace of mind. And after a year of using it I still think so :)
For a long time TickCounter wasn't popular at all -- a small user base and slow, steady growth. However, after I made timers embeddable on other websites as widgets, everything changed. Only a week after implementing this feature, the traffic charts went crazy. As it turned out, two really big websites decided to use TickCounter widgets to show their users how much time was left before some event.
From there, it went really quick. More and more people were embedding TickCounter's widgets on their blogs, bringing new users and making them do the same. Nowadays, there are thousands of these and at least once a month there is some kind of "Slashdot effect".
TickCounter often reaches Google Analytics' limit of 5MM impressions per month. Currently, the most popular sources of traffic are Google and widgets. A lot of people also come from Facebook and Reddit.
Numbers for the last 30 days:
Note that most of this traffic is widget impressions!
Well, since TickCounter is a really simple app, I don't think it was a huge challenge. What's more, I was naively counting on a huge success from the very beginning, so I wrote TickCounter to be as lightweight as it could be. The major optimization I did was getting rid of most database queries. Now, the vast majority of hits don't even touch the database. Thanks to this, I'm able to handle this traffic with a $12/month PythonAnywhere account :)
My parents and a few of my friends know. So does my boss - that's probably one of the reasons why I got this internship :)
Hm, I consider myself to be the one who still needs to learn a lot and get some advice. But I can think of two things. First, if possible, consider creating a widget of your product. For TickCounter it was a milestone. Maybe it can bring you some users too?
Another thing. If you are working on the project on your own, you don't have to do everything from scratch. Try to outsource some work to external services. That's exactly what I did when I switched from a VPS to PythonAnywhere. I could focus on developing my project, instead of configuring the server.
Sure, I have a really long list of new features and improvements for TickCounter and I am going to create a brand new version of it soon.
Other projects, hm. Sure! Just need to find some free time..
A big release today...
The official host name you should use for connecting to your account's MySQL database instance has changed from
*yourusername*.mysql.pythonanywhere-services.com. This bypasses a part of our infrastructure that has started showing problems in recent weeks, and it should be much more efficient and reliable than the old way. The old hostname will continue to work, but we strongly advise you to move over to the new one.
When you set up a web app on your own domain, you have to configure some DNS settings to tell the world that your domain is hosted on PythonAnywhere. We used to tell you to set up a CNAME pointing the domain to
*yourusername*.pythonanywhere.com. This had two problems:
*theirusername*.pythonanywhere.com. It seemed strange to them that they pointed their domain to that name, even though if was running a different web app.
In the new version of PythonAnywhere, we provide you with a different, anonymous CNAME for each of your custom web apps. Hopefully this is less confusing!
Again, the old system still works, and unlike the MySQL change, it's just as efficient as the new way. We will deprecate the old way at some point in the future, but not soon -- and we'll give plenty of warning before we do. The system will warn you if you're using the old CNAME pattern, though.
Just a couple of small changes here -- we've made it a bit prettier and added a "Save as" button.
The biggest change in this update (at least from our perspective) is to how we manage the persistence of your files and data when we upgrade PythonAnywhere. Everything is (of course) stored on persistent storage volumes. Previously, we would move these volumes over from the old PythonAnywhere server cluster to the new one as part of our deployment procedure. This was a large part (in terms of time) of the downtime we had when we updated the system; we'd be sitting there watching as a script stopped old servers, disconnected drives, reattached them, and so on, all while the site was down for maintenance.
In our new infrastructure, the file servers themselves, not just their storage volumes, are persistent. We've moved as much code as possible out of them (internally we say that we've made them as stupid as possible) so that they will very rarely need upgrading. This means that when we release new versions of PythonAnywhere, we can spin up the new cluster already connected to the persistent file storage -- so, upgrading should simply be a case of updating our internal databases and moving our external IP addresses over to the new machines.
We expect that this will mean that in most cases going forward, upgrading PythonAnywhere will take us around five minutes instead of 30. The first time around (that is, for the next update) we'll budget 30 minutes anyway, because it will be the first time we've done things with the new setup. But hopefully we'll come in considerably under budget on that :-)
The new file persistence stuff came at a small cost, however.
It's been over a year since we announced that we were reluctantly going to have to deprecate Dropbox sharing with PythonAnywhere, and we've now finally switched it off completely. It was a feature that worked really well in the earliest days of our platform, but unfortunately the implementation just couldn't scale, and we haven't been able to work out a decent alternative.
So, all of the Dropbox shares that were previously visible in the Dropbox directory inside your home directory are now gone. A month ago, we notified everyone who had stuff shared with us via email, so hopefully this won't have come as a surprise to anyone.
Well, that's it for this time. Let us know what you think!
Well, from our point of view, one of the most important things in this release was probably the akismet integration which we hope will prevent forums spam, and thus save us a tedious admin process of deleting spammy posts. So you'll have to go somewhere else for your updates on which are the food supplements for muscle growth.
But we do also care about users of the site other than ourselves, so we're equally pleased to announce a nice little features on the Consoles page, "Custom" launchers. You can use them to create your own console types, little shortcuts for running your favourite scripts. Congrats to our newest employee Conrad for coming up with this idea, and pushing tirelessly to get it out.
We've added the ability for you to nominate a "teacher", as part of our education beta, so you can now use PythonAnywhere's education features for ad-hoc programming workshops or training sessions, without needing to tell us in advance who your students will be. Just get attendees to sign up, fill in the teacher field, and you're good to go.
Other than that, we've made some minor tweaks to the process dashboard -- you can now see how much CPU each process has used, so you can identify those CPU-quota-nommming rogue tasks, and we've added a little warning on the web tab to detect mis-matches between your virtualenv python version and the web application's.
Hope you find this all useful! As ever, we're always interested in feedback and suggestions. Mesozoic mammals, yo.
Today, I want to change gears a little bit and talk about our TDD development process, which has always been very dear to our hearts. Just as a disclaimer, this is going to be a very general overview and if you are an experienced TDD developer, you may not get a lot out of it. However, continuing along the mentorship/educational theme, I hope this will be useful for the people out there who are just starting out, and are trying to figure out what is a good style/process for developing.
So here is what I did for a small webapp that I created (after I used Ansible to automatically configure my new PythonAnywhere account of course) .
Now the plan is to just work through the story, iteratively adding new unittests and writing code to get it to pass, saving your progress into your version control system of choice, and work your way down the functional test until that whole story passes!
While you code, there will probably be a bunch of other use cases/edge cases/different branching routes that pop into your mind that should be tested as additional stories or unittests. I tend to put in a bunch of placeholders with just the name (description of the scenario) and flesh them out one by one afterwards.
Here's a long overdue post about my work environment.
Let's start from the very beginning. In the same way that you should treat servers like cattle, I try to make it so that my personal work environment is also easily replaceable/recreateable on the fly. So let's say that your laptop gets hit by a bus. Oh no! Whatever will you do?
Well- if everything you did was on the cloud (eg: on PythonAnywhere), it wouldn't matter, because you wouldn't have lost anything. However, let's say you did do some development locally, or you are a freelancer and you are setting up a new PythonAnywhere account to develop in for a new customer project. How do you automate this setup? (Continuing on the theme of me being lazy)
The process that I'm using right now to setup a new work environment is this:
git clonemy public repo from bitbucket
No manual setup needed to get all up and running.
One cool thing that I have been doing is to move all the things that this script does into Ansible playbooks. So now when I make changes to my work environment configuration, and I want to make sure the various machines I work on have the correct configs, it's much easier! Instead of having to login to various machines and pulling from the remote repo, I can just use ansible to send out all my changes! Yay!
Yes, you could setup a git hook, but that's so lame (we can talk about that in the next post. pfft...). Just imagine the irony of using Ansible to deploy your configs to a single machine (localhost)- how can you not do it? Also by the way, for all you pythonanywhere users out there with ssh access, you can actually deploy on your local machine AND all of your pythonanywhere accounts at the same time. WOOOO- MUCH SCALE!
And for those of you who read my older blog post- yes, I have come a long way from using a bash script that bootstraps itself to setup my configs...
Now, I did promise to talk about tmux and all sorts of stuff, but it seems I am approaching the acceptable limit of how long a blog post should be again. So next time, I hope to continue to show you what my TDD work flow is like, as well as THE EPIC USES I have for PythonAnywhere consoles, which are rumored to include killing multiple cows simultaneously as well as conducting super secret covert missions to send outgoing mail when the conference wifi is blocking outgoing SMTP traffic.
Welcome to PythonAnywhere's latest newsletter. Latest not only because it's the most recent, but also because it's the most overdue!
Hi there fellow nerds! Another edition of our incredibly-rare-and-infrequent-newsletter (currently coming out about once a year).
(No prizes for guessing when the first draft of this newsletter was composed.)
It's time for your Christmas present from PythonAnywhere. Can you guess what it is? That's right, we now support a new open source relational* database system, which is superior to the other one for a series of arcane technical reasons. Yay! We knew you'd like it.
And we've deployed the latest, shiniest version, 9.4, which includes that crazy JSONB stuff. You can check out our announcement about it here.
We had to wait until now to tell you because, um, we, er, well, anyway -- Postgres! Come on, Stonebraker got a Nobel prize for it, or something! Go check it out!
[*] And non-relational. Postgres like to claim they do pretty well at the whole NoSQL thing, what with being able to work with JSON data both faster and more reliably than MongoDB.
Towards the end of last year we introduced the idea of a "custom" plan, where you can choose exactly the amount of CPU-seconds, web applications & workers, and gigs of storage you want on your account, and so only pay for what you need. We've now made it so that all plans are customisable from the get-go. So if you ever felt like adding Postgres to your plan (yep, you have to pay for it 'mafraid), or just giving yourself a couple of gigs of extra space, or even (whisper it) slightly reducing the features on your account to save money, then you can now do that with total impunity.
We've had lots of students and teachers on PythonAnywhere over the years, happy to save themselves from the installation hassles of getting Python onto a disparate bunch of hardware, and happy in a place where the batteries are included and the experience is consistent. And since the world seems to be agreeing that everyone should know the basics of coding, and that Python is the language to do it in, we thought we'd see what we can do to help. So, if you or your friends are teaching Python, you can read more about it here and get in touch with us here. We'd love to get you on board and see if we can help :-)
(No prizes for guessing when we had a second, abortive, stab at writing this newsletter. Honestly, it's as if we enjoy writing code more than we enjoy writing marketing emails.)
Forget about Christmas presents, how about some Easter eggs? Yeah, that's totally still topical. Is your world-conquering startup in stealth mode? Or just shy? Either way, you'll like the option the put HTTP auth on your site, and hide it away from prying eyes. Want to know what's running in your consoles? Check out the process listing on the consoles page, to help you track down what job is sucking up all your precious CPU quota. Speaking of which we've boosted the features of base accounts so y'all get a few more seconds and gigs from all of us. Check the new pricing page for details. We've also been serving the live consoles on python.org, we've improved virtualenv support for web apps, and lots more.
We've published some fun stuff here on the blog recently, here are some of our favourites:
And that's it! Until next time (next scheduled newsletter: mid-summer 2015. Expected arrival time: may need 5-digits for the year on your calendar).
With today's deploy, we added some console sharing and file system features to make helping other people easier, whether it's in a group setting, or a more in-depth one-on-one session.
It's been great to see more and more initiatives to teach Python, over the last few years, but it's never as easy as it should be. Personally, we have definitely experienced the struggle of our friends who are new to programming, from general commandline stuff and installing modules, to understanding different error messages and trace backs etc. We want to see if there's more we can to do help make the mentoring process as easy and painless as possible.
tl;dr: go mentor someone today and tell us how to make the sharing features even better for you!
We also continued on our epic quest towards zero downtime by stupidifying the file server. Previously, our file server shared the same code base as our web/console servers, and used the django code to do tasks such as updating user storage quotas (eg: after an account upgrade). It was a good idea at the time because it meant that we could handle a quota update request, grab what a user's storage quota changed to, and then apply the new quota- all very easily from within django.
However, this violates the concept of keeping things modular and meant that we had an extra dependency to manage properly. Whenever we updated the source code and wanted to push it out to production, this meant that we needed to do it to the file server as well. This then meant that we needed to make sure nobody was writing to the file server during that time and that all the changes were flushed to disk. This is one reason why we needed downtime when deploying (to ensure data consistency etc).
Anyway, now all the fileserver has on it is a minimal flask microservice independent of our main django code, so the hope is that we can cut out this particular source of downtime. Yay!
Last Tuesday, we changed the filesystem we use to store our users' files over from XFS to ext4fs. This required a much longer maintenance outage than normal -- 2 hours instead of our normal 20-30 minutes.
This post explains why we made the change, and how we did it.
tl;dr for PythonAnywhere users:
We discovered that the quota system we were using with XFS didn't survive hard fileserver reboots in our configuration. After much experimentation, we determined that ext4 handles our particular use case better. So we moved over to using ext4, which was hard, but worthwhile for many reasons.
tl;dr for sysadmins:
In order to understand what we changed and why, you'll need a bit of background about how we store our users' files. This is relatively complex, in part because we need to give our users a consistent view of their data regardless of which server their code is running on -- for example so they see the same files from their consoles as they do from their web apps, and so all of the worker processes that make up their web apps can see all of their files -- and in part because we need to keep everything properly backed up to allow for hardware failures and human error.
The PythonAnywhere cluster is made up of a number of different server types. The most important for this post are execution servers, file servers, and backup servers.
Execution servers are the servers where users' code runs. There are three kinds: web servers, console servers, and (scheduled) task servers. From the perspective of file storage, they're all the same -- they run our users' code in containers, with each user's files mounted into the containers. They access the users' files from file servers.
File servers are just what you'd expect. All of a given user's files are on the same file server. They're high-capacity servers with large RAID0 SSD arrays (connected using Amazon's EBS). They run NFS to provide the files to the execution servers, and also run a couple of simple services that allow us to manage quotas and the like.
Backup servers are simpler versions of file servers. Each file server has its own backup server, and they have identical amounts of storage. Data that is written to a file server is asynchronously synchronised over to its associated backup server using a service called drbd.
Here's a diagram of what we were doing prior to the recent update:
This architecture has a number of benefits:
As you can see from the diagram, the filesystem we used to use to store user data was
XFS. XFS is a tried-and tested journaling filesystem,
created by Silicon Graphics in 1993, and is
perfect for high-capacity storage. We actually started
using it because of a historical accident. In an early prototype of PythonAnywhere,
all users actually mapped to the same Unix user. When we introduced disk quotas
(yes, it was early enough that we didn't even have disk quotas) this was a problem.
At that time, we couldn't see any easy way to change
the situation with Unix users (that changed later) so we needed some kind of quota
system that allowed us to enforce quotas on a per-directory basis, so that (eg.)
/home/someuser had a quota of 512MB and
/home/otheruser had a quota of 1GB.
But most filesystems that provide quotas only support it on a per-user basis.
XFS, however, has a concept of "project quotas". A project is a set of directories, and each project can have its own independent quota. This was perfect for us, so of the tried-and-tested filesystems, XFS was a great choice.
Later on, of course, we worked out how to map each user to a separate Unix user -- so the project quota concept was less useful. But XFS is solid, reliable, and just as fast as, if not faster than, other filesystems, so there was no reason to change.
A few weeks back, we had an unexpected outage on a core database instance that supports PythonAnywhere. This caused a number of servers to crash (coincidentally due to the code we use to map PythonAnywhere users to Unix users), and we instituted a rolling reboot. This has happened a couple of times before, and has only required execution server reboots. But this time we needed to reboot the file servers as well.
Our normal process for rebooting an execution server is to run
sync to synchronise
the filesystem (being old Unix hands we run it three times "just to be safe", despite
the fact that hasn't been necessary since sometime in the early '90s) and then to do
a rapid reboot by
echoing "b" to
File servers, however, require a more gentle reboot procedure, because they have
critical data stored on them, and are writing so much to disk that stuff can
change between the last
sync and the reboot, so a normal slow
reboot command is
This time, however, we made a mistake -- we used the execution-server-style hard reboot on the file servers.
There were no obvious ill effects; when everything came back, all filesystems were up and running as normal. No data was lost, and the site was back up and running. So we wiped the sweat from our respective brows, and carried on as normal.
We first noticed that something was going wrong an hour or so later. Some of our users started reporting that instead of seeing their own disk usage and quotas on the "Files" tab in the PythonAnywhere web interface, they were seeing things like "1.1TB used of 1.6TB quota". Basically, they were seeing the disk usage across the storage volumes they were linked to instead of the quota details specific to their accounts.
This had happened in the past; the process of setting up a new project quota on XFS can take some time, especially when a volume has a lot of them (our volumes had tens of thousands) and it was done by a service running on the volume's file server that listened to a beanstalk queue and processed updates one at a time. So sometimes when there was a backlog, people would not see the correct quota information for some time.
But this time, when we investigated, we discovered tons of errors in the "quota queue listener" service's logs.
It appeared that while XFS had managed to store files correctly across the hard reboots, the project quotas had gone wrong. Essentially, all users now had unquota'd disk space. This was obviously a big problem. We immediately set up some alerts so that we could spot anyone going over quota.
We also disabled quota reporting on the PythonAnywhere "Files" interface, so that people wouldn't be confused. Or, indeed, to make sure that people didn't guess what was up and try to take advantage by using tons of storage, and cause problems for other users... we did not make any announcement about what was going on, as the risks were too high. (Indeed, this blog post is the announcement of what happened :-)
So, how to fix it?
In order to get quotas working again, we'd need to run an XFS quota check on the affected filesystems. We'd done this in the past, and we'd found it to be extremely slow. This is odd, because XFS gurus had advised us that it should be pretty quick -- a few minutes at most. But the last time we'd run one it had taken 20 minutes, and that had been with significantly smaller storage volumes. If it scaled linearly, we'd be looking at at least a couple of hours' downtime. And if it was non-linear, it could be even longer.
We needed to get some kind of idea of how long it would take with our current data size. So, we picked a recent backup of 1.6TB worth of RAID0 disks, created fresh volumes for them, attached them to a fresh server, mounted it all, and kicked off the quota check.
24 hours later, it still hadn't completed. Additionally, in the machine's
there were a bunch of errors and warnings about blocked processes. The kind of errors
and warnings that made us suspect that the process was never going to complete.
This was obviously not a good sign. The backup we were working from pre-dated the erroneous file server reboots. But the process by which we'd originally created it -- remember, we logged on to a backup server, used drbd to disconnect from its file server, did the backup snapshots, then reconnected drbd -- was actually quite similar to what would have happened during the server's hard reboot. Essentially, we had a filesystem where XFS might have been half-way through doing something when it was interrupted by the backup.
This shouldn't have mattered. XFS is a journaling filesystem, which means that it can be (although it generally shouldn't be) interrupted when it's half-way through something, and can pick up the pieces afterwards. This applies both to file storage and to quotas. But perhaps, we wondered, project quotas are different? Or maybe something else was going wrong?
We got in touch with the XFS mailing list, but unfortunately we were unable to explain the problem with the correct level of detail for people to be able to help us. The important thing we came away with was that what we were doing was not all that unusual, and it should all be working. The quotacheck should be completing in a few minutes.
At this point, we had multiple parallel streams of investigations ongoing. While one group worked on getting the quotacheck to pass, another was seeing whether another filesystem would work better for us. This team had come to the conclusion that ext4 -- a more widely-used filesystem than XFS -- might be worth a look. XFS is an immensely powerful tool, and (according to Wikipedia) is used by NASA for 300+ terabyte volumes. But, we thought, perhaps the problem is that we're just not expert enough to use it properly. After all, organisations of NASA's size have filesystem experts who can spend lots of time keeping that scale of system up and running. We're a small team, with smaller requirements, and need a simpler filesystem that "just works". On this theory, we thought that perhaps due to our lack of knowledge, we'd been misusing XFS in some subtle way, and that was the cause of our woes. ext4, being the standard filesystem for most current Linux distros, seemed to be more idiot-proof. And, perhaps importantly, now that we no longer needed XFS's project quotas (because PythonAnywhere users were now separate Unix users), it could also support enough quota management for our needs.
So we created a server with 1.6TB of ext4 storage, and kicked off an rsync to copy the data from another copy of the 1.6TB XFS backup the quotacheck team were using over to it, so that we could run some tests. We left that rsync running overnight.
When we came in the next morning, we saw something scary. The rsync had failed halfway through with IO errors. The backup we were working from was broken. Most of the files were OK, but some of them simply could not be read.
This was definitely something we didn't want to see. With further investigation, we discovered that our backups were generally usable, but in each one, some files were corrupted. Clearly our past backup tests (because, of course, we do test our backups regularly :-) had not been sufficient.
And clearly the combination of our XFS setup and drbd wasn't working the way we thought it did.
We immediately went back to the live system and changed the backup procedure.
We started rolling "eternal rsync" processes -- we attached extra
(ext4) storage to each file server, matching the existing capacity, and ran looped scripts
that used rsync (at the lowest-priority
ionice level) to make sure that all user
data was backed up there.
We made sure that we weren't adversely affecting
filesystem performance by checking out an enormous git repo into one of our own
PythonAnywhere home directories, and running
git status (which reads a bunch of files)
regularly, and timing it.
Once the first eternal rsyncs had completed, we were 100% confident that we really did have everyones' data safe. We then changed the backup process to be:
This meant that we could be sure that the backups were recoverable, as they came from a filesystem that was not being written to while they happened. This time we tested them with a rsync from disk to disk, just to be sure that every file was OK.
We then copied the data from one of the new-style backups, that had come from an ext4 filesystem, over to a new XFS filesystem. We attached the XFS filesystem to a test server, set up the quotas, set some processes to reading from and writing to it, then did a hard reboot on the server. When it came back, it mounted the XFS filesystem, but quotas were disabled. Running a quotacheck on the filesystem crashed.
Further experiments showed that this was a general problem with pretty much any project-quota'ed XFS filesystem we could create; in our tests, a hard reboot caused a quotacheck when the filesystem was remounted, and this would frequently take a very long time, or even crash -- leaving the disk only mountable with no quotas.
We tried running a similar experiment using ext4; when the server came back after a hard
reboot, it took a couple of minutes checking quotas and a few harmless-seeming warnings
syslog. But the volumes mounted OK, and quotas were active.
By this time we'd persuaded ourselves that moving to ext4 was the way forward for dimwits like us. So the question was, how to do it?
The first step was obviously to change our quota-management and system configuration code so that it used ext4's commands instead of XFS's. One benefit of doing this was that we were able to remove a bunch of database dependencies from the file server code. This meant that:
It's worth saying that the database dependency wasn't due to XFS; we were just able to eliminate it at this point because we were changing all of that code anyway.
Once we'd made the changes and run it through our continuous integration environment a few times to work out the kinks, we needed to deploy it. This was trickier.
What we needed to do was:
The "copy" phase was the problem. The initial run of our eternal rsync processes made it clear that copying 1.6TB (our standard volume size) from a 1.6TB XFS volume to an equivalent ext4 one took 26 hours. A 26 hour outage would be completely unacceptable.
However, the fact that we were already running eternal
rysync processes opened up some other
options. The first sync took 26 hours, but each additional one took 6 hours -- that is, it took
26 hours to copy all of the data, then after that it took 6 hours to check for any changes between
the XFS volume and the ext4 one it was copying them to that had happened while the original
copy was running, and to copy those changes across. And then it took 6 hours to do that again.
We could use our external rsync target ext4 disks as the new disks for the new cluster, and just sync across the changes.
But that would still leave us with a 6+ hour outage -- 6 hours for the copy, and then extra time for moving disks around and so on. Better, but still not good enough.
Now, the eternal rsync processes were running at a very high
ionice level so as
not to disrupt filesystem access on the live system. So we tested how long it would take to
run the rsync with the opposite, resource-gobbling niceness settings. To our surprise, it
didn't change things much; a rsync of 6 hours' worth of changes from an XFS volume to an ext4 one
took about five and a half hours.
We obviously needed to think outside the box. We looked at what was happening while we ran one
of these rsyncs, in
iotop, and noticed that we were nowhere near maxing out our CPU
or our disk IO... which made us think, what happens if we do things in parallel?
At this point, it might be worth sharing some (slightly simplified) code:
#!/bin/bash # Parameter $1 is the number of rsyncs to run in parallel cd /mnt/old_xfs_volume/ ls -d * | xargs -n 1 -P $1 ~/rsync-one.sh
#!/bin/bash mkdir -p /mnt/new_ext4_volume/"$1" rsync -raXAS --delete /mnt/old_xfs_volume/"$1" /mnt/new_ext4_volume/
For some reason our notes don't capture, on our first test we went a bit crazy and used
rsyncs, for a total of about 2,000 processes.
It was way better. The copy completed in about 90 minutes. So we experimented. After many, many tests, we found that the sweet spot was about 30 parallel rsyncs, which took on average about an hour and ten minutes.
We believed that the copy would take about 70 minutes. Given that this deployment was going to require significantly more manual running of scripts and so on than a normal one, we figured that we'd need 50 minutes for the other tasks, so we were up from our normal 20-30 minutes of downtime for a release to two hours. Which was high, but just about acceptable.
The slowest time of day across all of the sites we host is between 4am and 8am UTC, so we decided to go live at 5am, giving us 3 hours just in case things went wrong. On 17 March, we had an all-hands-on deck go-live with the new code. And while there were a couple of scary moments, everything went pretty smoothly -- in particular, the big copy took 75 minutes, almost exactly what we'd expected.
So as of 17 March, we've been running on ext4.
Since we went live, we've run two main tests.
First, and most importantly, we've tested our backups much more thoroughly than before. We've gone back to the old backup technique -- on the backup server, shut down the drbd connection, snapshot the disks, and restart drbd -- but now we're using ext4 as the filesystem. And we've confirmed that our new backups can be re-mounted, they have working quotas, and we can rsync all of their data over to fresh disks without errors. So that's reassuring.
Secondly, we've taken the old XFS volumes and tried to recover the quotas. It doesn't work. The data is all there, and can be rsynced to fresh volumes without IO errors (which means that at no time was anyone's data at risk). But the project quotas are irrecoverable.
We've also (before we went live with ext4, but after we'd committed to it) discovered that there was a bug in XFS -- fixed in Linux kernels since 3.17, but we're on Ubuntu Trusty, which uses 3.13. It is probably related to the problem we're seeing, but certainly doesn't explain it totally -- it explains why a quotacheck ran when we re-mounted the volumes, but doesn't explain why it never completed, or why we were never able to re-mount the volumes with quotas enabled.
Either way, we're on ext4 now. Naturally, we're 100% sure it won't have any problems whatsoever and everything will be just fine from now on ;-)
So the reason for our extra-long maintenance window this morning was primarily a migration from XFS to ext4 as our filesystem for user storage. We'll write more about the whys and wherefores of this later, but the short version is that the main reason for using XFS, project quotas, were no longer needed, and a bug in the version of XFS support by Ubuntu LTS left us vulnerable to long periods of downtime after unplanned reboots, while XFS did some unnecessary quotachecks. The switch to ext4 removes that risk, and has simplified some of our code too, bonus!
In other news, we've managed to squeeze in a few more user-visible improvements :)
We've decided to tweak the pricing and accounts pages so that all plans are customisable. As a bonus side-effect, we've slightly improved all the existing paid plans, so our beloved customers are going to get some free stuff:
bottlenose, python-amazon-simple-product-api, py-bcrypt, Flask-Bcrypt, flask-restful, markdown (for Python 3), wheezy.template, pydub, and simpy (for Python 3) are now part of our standard batteries included
We've re-written our server build scripts to use wheels, and to build them for each package we install. We've made them available (at /usr/share/pip-wheels), and we've added them to the PythonAnywhere default pip config. So, if you're installing things into a virtualenv, if it so happens we already have a wheel for the package you want, pip will find it and the install will complete much faster.
The "Save and Run" button at the top of the editor, much beloved of teachers and beginners (and highly relevant for our education beta) now defaults to Python 3. It's 2015, this is the future after all. We didn't want to break things for existing users, so they will still have 2 as the default, but we can change that for you if you want. Just drop us a line to firstname.lastname@example.org
Other than that, we've added a few minor security and performance tweaks.
Onwards and upwards!
Page 1 of 11.