We recently fixed a problem in our website hosting code that was causing weird errors under very specific and rare circumstances. The problem had been there for several years, and -- while we knew that odd stuff happened every now and then -- we'd never been able to reproduce it reliably enough to debug it. But a lucky coincidence of circumstances, when two people tripped over the bug in quick succession, clarified the issues and let us work out the solution. It's an interesting tale, so we thought we'd share it.
Every now and then -- say, once every six months -- someone would report that they had a Flask website which would inexplicably start returning CSRF errors for every request. The apps in question generally had CSRF protection disabled. The error messages were very generic -- there was no indication of where in the stack they came from. And the problem would go away if the web app was reloaded.
There was no obvious pattern that we could see, apart from the fact that it was always Flask apps. No bugs had been reported against Flask's CSRF protection plugins.
We couldn't for the life of us work out what was causing the problem. Our normal request-processing stack (two layers of nginx for loadbalancing and basic HTTP(S), then uWSGI) doesn't do any CSRF-related stuff, and it sends all headers and requests through unchanged. But no-one else was seeing this problem on other hosts, so it seemed likely that there was something odd going on with our service specifically.
Two things happened: we recently upgraded the version of Django that we use to host the PythonAnywhere website itself to 1.9 (from an embarassingly old version that I won't mention). And then, two people reported the weird CSRF problems in quick succession. One was using Flask, but the other was using Django. So maybe the problem wasn't Flask-specific.
Even more interestingly, they were both getting exactly the same error message -- and it was slightly different to the one that had been coming back before. Furthermore it was one that, on Googling, we could only find in newer versions of the Django codebase. This sounded suspiciously like it was coming from somewhere in our infrastructure -- it had changed when we upgraded, and it was a Django message even for the Flask app. But why would our code -- the Django code that makes up the PythonAnywhere website -- be called when a request was being made to one of our customers' sites?
PythonAnywhere's web hosting platform is made up of a large number of web servers. Each web server is capable of serving any of the tens of thousands of websites we host. But, of course, the code for only a subset of those sites is running on any given server at any given time. We load-balance across the web servers, sending the traffic for each hosted site to a specific server. If we add another web server when things are busy, it's a simple job to tell the loadbalancer to spread the traffic out differently, so that the new server takes over handling some of the sites.
If one of our webservers gets a request for a website that isn't already running on it, the request gets redirected internally to a view on a PythonAnywhere web app, which checks to make sure that the site is one that we're actually meant to be hosting (rather than some random domain that someone happens to have pointed at one of our IP addresses). If it is, it starts the site up and redirects the incoming request back round again internally to the original URL. Once that's happened, the requester gets the response from the freshly-started website.
This means that when we add a new server, which will be running no websites initially, it will start them up one by one as it receives the first request for each from the loadbalancer.
Once we'd realised that the error that was coming back for our customers' sites was from Django, and had changed when we upgraded Django, it was clear that this "wake up" view in our own code was the only possible culprit. It's the only bit of our code that would ever be run in response to a request for a customer site.
We took a look at the view -- and, it turned out, it wasn't marked as CSRF exempt. Why does that matter?
We use CSRF heavily across our site (to mitigate the risk of a malicious person putting up a website that posted a "delete my web app" request to PythonAnywhere when someone visited it), and CSRF protection is the default -- each view must explicitly opt-out.
Now, "wake up" requests to this view would come in without any CSRF tokens or other related data. CSRF is site-specific, and the wake-up requests were not really requests for our site at all -- they were requests meant for another site that had been temporarily passed over to us so that we could start up the code to handle them.
So it was definitely a bug that the view wasn't marked CSRF exempt.
Importantly, though, CSRF protection is only relevant to POST requests. GET requests don't need it, as they are (so long as you're doing things properly) not state-changing operations.
So we'd worked out part of the problem -- if a website wasn't running on a given server, and a request for that site was sent to the server, then only a GET request would start the site correctly. A POST request would get a spurious CSRF error, and the site would not be started up.
But our customers who had reported the problem were saying that when their websites got into the CSRF-error state, they'd stay in that state regardless of how much traffic they got. And also, what was the Flask connection? While we now had one person seeing the problem with Django, every other report had been against Flask.
Finally, one of our customers who was having the problem told us something that made it all clear. He explained that his site was collecting data from other computers elsewhere on the Internet, which were only ever POSTing to it. He had discovered that when it got into the state when it was always responding with CSRF errors, he didn't need to reload it to fix it. He just needed to log in to the admin page.
Suddenly it all became clear. The lack of the
@csrf_exempt decorator on our "wake up the website" view meant that it would always respond to a POST request with a CSRF error. But if it received a GET request (for example, a hit from a browser on the admin login page, or a manual reload from the "Web" tab on PythonAnywhere), it would work properly and start the website. Once that had happened, then all further POST requests would work.
This taught us something interesting about the kind of sites people build. There are two main kinds of sites on PythonAnywhere (and we think on the Internet at large) -- human-readable and computer-readable.
Human-readable sites get a lot of GET requests as people hit them and browse around looking at pictures of cats. They get the occasional POST as someone comments on a cat picture, or sends a message to the cat or its owner, but most hits are GETs.
Computer-readable sites are basically web APIs. They also get quite a few GETs. But sometimes, they're designed to get mostly POSTs. The data-collection site that our customer was building, for example. Or sites that exist entirely to receive webhook messages.
Most of the sites on PythonAnywhere are human-readable, but there are a fair number of computer-readable ones. Of the computer-readable ones, most get a decent number of GET requests. A tiny proportion are APIs that get predominantly POSTs. So while every site on PythonAnywhere would have the problem where it couldn't be woken up by a POST request, only a very small subset of them would be seriously affected -- the ones that got almost nothing but POSTs.
The problem was only affecting that tiny proportion of sites. But the fact that their sites were only handling POSTs wasn't something that was obviously relevant information to the customers who were reporting the problem, so they didn't mention it. And because we didn't know what the cause was, we never knew that it was something we should ask about.
And the Flask connection? Well, if you're building a website that handles lots of POST requests, and you don't need to worry about user management, administration, and user interface design, which framework are you most likely to pick? Flask is almost certainly the most popular framework for people building simple web APIs. So it wasn't quite a red herring that the problem seemed originally to be Flask-specific -- there was a very strong correlation there -- but that correlation really did mislead us into thinking it was some weird interaction with Flask specifically.
Once we knew what the problem was, it was trivial to fix -- just adding a
@csrf_exempt on the correct view -- and we had it patched within an hour, even including testing.
But it was a long and confusing year puzzling over the issue! Hopefully the explanation gives some flavour of the sleuthing process.