Outage report; or, the perils of using 502s to trigger web application initialisation

Last weekend, we had two outages, both in the early hours (UK time) of the morning; we were finding it hard to diagnose the cause, because our outage alert system had failed and we only discovered it in each case when someone happened to check their email. On Monday we fixed the outage alert system.

Yesterday we had another, every phone in the office started beeping loudly, and we logged in right away – so we know what made it happen. While we’re not 100% sure that the cause of this problem was the same as the one on the weekend, the symptoms were similar enough that we’re pretty sure that it was.

The symptom in all of these cases was that the main PythonAnywhere website was down, giving “502 Bad Gateway” errors. Over the weekend, all of the customer websites we checked were OK, but when it happened yesterday, we noticed that one specific customer website was extremely busy, and getting the same 502 error. Further investigation showed that it had been linked from the front page of Reddit and was getting about 100 hits/second. So that would explain why it was so busy – and the site’s author hadn’t expected so much traffic so soon, so it was not super-optimised, which would explain why it couldn’t handle the load.

But the question was, how was it affecting the main PythonAnywhere site? We have a lot of stuff in place to stop web applications from affecting each other, which is why the other user websites on the same server were up and running.

Explaining what happened requires a little background, both about how nginx works and about how we use it and uwsgi to serve large numbers of Python web applications.

nginx 50x errors¶

Our web servers run nginx as a front-end server. It delegates web requests to uwsgi instances that actually run the Python web applications. These requests are channeled between nginx and uwsgi via Unix domain sockets; each web application has a socket that is accessible both from outside our users’ sandboxes (where the nginx server runs) and inside (where the uwsgi apps that serve each user’s website live).

When a request comes in but the associated uwsgi server has a problem there are a number of possible errors. Firstly, the uwsgi app might be up and running, accept the incoming request, but then crash or take an unreasonably long time to process it. This leads to an error number 504, “Gateway timeout”. This is what we normally expect to happen when a user’s web app is under too much load to handle incoming requests. Another happens when the user’s web app is just not up and running, so when nginx tries to put the request onto the socket for uwsgi, it can’t. This causes a 502 “Bad gateway” error.

What we hadn’t realised, or at least fully appreciated, is that there’s a second kind of situation where you can get a 502 error. It’s possible for a uwsgi server to be up and running, but so overloaded that from nginx’s perspective it looks like it’s down. These cause 502 errors too.

In this last case, you get an nginx error like this:

2013/01/31 15:15:42 [error] 4919#0: *1947447 connect() to unix:/var/sockets/www.example.com/socket failed (11: Resource temporarily unavailable) while connecting to upstream, client: XXX.XXX.XXX.XXX, server: www.example.com, request: "GET /something HTTP/1.1", upstream: "uwsgi://unix:/var/sockets/www.example.com/socket:", host: "www.example.com", referrer: "https://www.example.com/"

Now, if you google for that error you’ll see lots of hints suggesting you increase the nginx proxy_connect_timeout setting, or the Linux net.core.netdev_max_backlog or net.core.somaxconn values. These can possibly help; essentially if the error is being caused by a very short-term spike in traffic, those changes will help smooth it out by letting more stuff queue up on the connection between the uwsgi server and nginx. But if the uwsgi server is just getting so much traffic that there’s no way it can handle it, all changing the settings will do is make the queue take longer to fill up and cause the problem.

How we manage user web applications¶

Right, so by now it’s probably pretty clear that the outage was caused by heavy traffic overloading a uwsgi server somewhere. But how did that take out our site?

Well, the problem is in how we start web applications. Each of our web servers can potentially be responsible for hundreds of web apps. We do not start uwsgi handlers for all of them when the server comes up, as this would bring the machine to its knees. So what we do is start them on demand; when a request comes in for a web application that isn’t running, we detect that it’s not running and use an “internal redirect” to push the request over to a part of the main PythonAnywhere web application that starts it up, then does another redirect back so that the requester sees the right page. Here’s the nginx config for that:

server {

    listen      80;
    server_name  ~(?<domain>.+)$;
    location / {
        # various bits of config here, elided for clarity

        # if you get a 502 error, go to "fallback".
        error_page 502 = @fallback;
    }
    location @fallback {
        # In the fallback triggered by a 502, restart the user's web app.
        # FOR THE LOVE OF GOD, DON'T DO THIS!
        proxy_pass https://www.pythonanywhere.com/run_initialize_web_application_code/$scheme/$domain/$uri?$query_string;
    }
}

Sharp-eyed readers have probably worked out the problem by now. We were relying on the 502 “Bad gateway” errors to identify unstarted web applications; we’d assumed that unresponsive ones would always give us a 504 “Gateway timeout”. This was a mistake. The customer website that was getting all of the traffic was causing 502s; and this meant that our setup was constantly telling the main PythonAnywhere app to restart the associated uswgi server. This was happening on the same server as handles our own site, dozens of times a second – which, of course, meant that our site went down.

Lessons learned¶

The main lesson we’ve learned here is that we cannot use 502 errors to decide whether or not a user’s uwsgi server is up and running. We’ll blog about the fix for that later.

Another thing we may well look at is moving the “start a user’s web application” code out of the main PythonAnywhere app. This would mean that even if our fix doesn’t work in every case, it will not take down our site and the many people who use PythonAnywhere for running and developing non-web Python code in the cloud will be blissfully unaware of the problem.

So, any thoughts from our readers here? Is there anything we’re missing in this analysis?

[UPDATE, 1 Feb 2013 @ 17:36]

We’ve now tested and pushed what we think is a solid fix.

Some background first. The uwsgi processes on a given server are all managed for us by a uwsgi master application. If a “vassal” file is is present for a web application (in /etc/uwsgi/vassals) then the application will be started by the uwsgi master, and if the web application crashes then the uwsgi master can be trusted to see this and re-start it.

So that means that we only need to start a user’s web app if it gives a 502 error and there is no vassal file for it. If it gives a 502 and there is a vassal file, then the web app can be assumed to be running but having problems.

Which in turn means that all we needed to do was put a check in our nginx configuration to check for the vassal file, and only do the internal redirect to start the app if the vassal file isn’t there. If the vassal file is there, it can just raise a 502 error as normal. The config looks like this:

server {

    listen      80;
    server_name  ~(?<domain>.+)$;
    location / {
        # various bits of config here, elided for clarity

        # if you get a 502 error, go to "fallback".
        error_page 502 = @fallback;
    }
    location @fallback {
        # In the fallback triggered by a 502, restart the user's web app
        # if and only if there is no vassal file for it.
        if (-f /etc/uwsgi/vassals/$domain.ini) {
            return 502;
        }
        proxy_pass https://www.pythonanywhere.com/run_initialize_web_application_code/$scheme/$domain/$uri?$query_string;
    }
}

Again, any thoughts much appreciated!