After our system update yesterday, there was a period when some people’s scheduled tasks were running slowly. This is an update on what caused the issue and what we did to fix it.
Different code that you run on PythonAnywhere runs on different specific servers; this is because different workloads have different CPU/memory requirements, and the most effective way to set things up from our side is to group like with like – so, for example, websites run on web servers, consoles on console servers, and scheduled tasks on task servers.
At around 10am UTC, three hours after the system update, we noticed that the load average on one of the task servers was extremely high – around 100. This meant that 100 processes were in the run queue – that is, they wanted some processor time to do calculations, and were waiting for it. This was clearly an indication of a problem; the run queue should normally be pretty much empty – or in the worst case, have a number of processes approximately equal to the number of cores on the machine.
We logged in to the affected machine, and indeed there were a lot of processes running. But the CPU itself didn’t seem particularly busy. Investigating, we discovered that the connection from that task server to one of the file servers was slow. It looked like perhaps all of the processes were queueing up blocked on IO.
A file server is exactly what it sounds like – just a server where people’s files are stored. Each account has all of its files on one file server; the file server makes the files available to the servers where code is actually run (as well as managing things like disk mirroring, backups, and so on). So a slow connection from a task server to a specific file server meant slow processes for the users on that task server whose files were on that file server.
While we were investigating this, early indications of similar problems started cropping up on other task servers. In each case it was a different file server – for example, task server 4 was having problems talking to file server 3, and task 6 to file 2.
Initially this looked like it might be due to connectivity problems between the servers in question. That’s something that is managed for us by Amazon AWS, so we got in touch with them. After much discussion, network issues were eliminated as a cause, and we had to go back to first principles.
The main question was, what had changed on these servers as a result of the system update? The changes we had made didn’t really have much to do with the scheduled task system. However, one change had been made that was common to all servers where our users’ code is executed: an improvement to our tarpit system.
Each user on PythonAnywhere gets a quota of CPU-seconds that they can use over each 24-hour period. This is the number of seconds for which they can use 100% of a CPU (rather than actual ‘wall clock’ time), and different users have different amounts, free users (of course) having the least. When a user runs out of their quota for the day, they are “put into the tarpit”. This means that instead of getting guaranteed CPU time, they get a slice of whatever is left over after other (non-tarpitted) people have had their share.
We had recently found certain cases where people who were in the tarpit could use up quite a lot of CPU. This wasn’t generally a case of people trying to “cheat” – more that in certain cases, errors that they made in their code could lead to them wasting CPU to no benefit to themselves. We had made a change that fixed that, using the CFS (“completely fair scheduler”) quota cgroup feature that is part of the Linux kernel.
While there was no obvious way we could see that this could impact file system access, especially at a system-wide level (as opposed to just for the people who were in the tarpit), it was the only change we could see from the previous system. And on the impacted task servers, there were users who (a) were in the tarpit, (b) had their files on the file servers with which those specific task servers were having slow connectivity, and (c) had lots of processes running.
So, experimentally, we disabled the new tarpit behaviour on one of the misbehaving servers, and watched. Load average on that server immediately started dropping – and a test of the connection to the fileserver had it going back to normal levels of latency. We had our culprit!
The next step, of course, was to disable the new tarpit behaviour across all task servers; both the other ones that were showing the problem, and the unaffected ones. That was done by about 5pm, and all has been well since then. Load averages across the system are normal.
We’re still trying to work out exactly what interaction between the CFS-based tarpit and the file server system caused the problem. One – somewhat vague – hypothesis is that when someone went into the tarpit, but was trying to do lots of file IO, some kind of internal buffers between the file servers and the user’s code filled up. The user’s code was running slowly, and pulling stuff off the buffers slowly, which meant that the buffers were full – but perhaps the buffers were shared between everyone on the machine rather than being per-user. This doesn’t sound quite right, though, and more digging is definitely needed.
We’ll keep investigating, though, and for now, the new tarpit system will remain disabled on task servers.