We had several outages over the last few days. The problem appears to be fixed now, but investigations into the underlying cause are still underway. This post is a summary of what happened, and what we know so far. Once we've got a better understanding of the issue, we'll post more.
It's worth saying at the outset that while the problems related to the way we manage our users' files, those files themselves were always safe. While availability problems are clearly a big issue, we regard data integrity as more important.
On Thursday 20 July, at 05:00 UTC, we released a new system update for PythonAnywhere. This was largely an infrastructural update. In particular, we updated our file servers from Ubuntu 14.04 to 16.04, as part of a general upgrade of all servers in our cluster.
File servers are, of course, the servers that manage the files in your PythonAnywhere disk storage. Each server handles the data for a specific set of users, and serves the files up to the other servers in the cluster that need them -- the "execution" servers where your websites, your scheduled tasks, and your consoles run. The files themselves are stored on network-attached storage (and mirrored in realtime to redundant disks on a separate set of backup servers); the file servers simply act as NFS servers and manage a few simple things like disk quotas.
While the system update took a little longer than we'd planned, once everything was up and running, the system looked stable and all monitoring looked good.
At 12:07 UTC our monitoring system showed a very brief issue. From some of our web servers, it appeared that access to one of our file servers, file-2, had very briefly slowed right down -- it was taking more than 30 seconds to list the specific directory that is monitored. The problem cleared up after about 60 seconds. Other file servers were completely unaffected. We did some investigations, but couldn't find anything, so we chalked it up as a glitch and kept an eye out for further problems.
At 14:12 UTC it happened again, and then over the course of the afternoon, the "glitches" became more frequent and started lasting longer. We discovered that the symptom from the file server's side was that all of the NFS daemons -- the processes that together make up an NFS server -- would all become busy; system load would rise from about 1.5 to 64 or so. They were all waiting uninterruptably on what we think was disk I/O (status "D" in
The problem only affected file-2 -- other file servers were all fine. Given that every file server had been upgraded to an identical system image, our initial suspicion was that there might be some kind of hardware problem. At 17:20 UTC we got in touch with AWS to discuss whether this was likely.
By 19:10 our discussions with AWS had revealed nothing of interest. The "glitches" had become a noticeable problem for users, and we decided that while there was no obvious sign of hardware problems, it would be good to at least eliminate that as a possible cause, so we took a snapshot of all disks containing user data (for safety), then migrated the server to new hardware, causing a 20-minute outage for users on that file server (who were already seeing a serious slowdown anyway), and a 5-minute outage for everyone else, the latter because we had to reboot the execution servers
After this move, at 19:57 UTC, everything seemed OK. Our monitoring was clear, and the users we were in touch with confirmed that everything was looking good.
At 14:31 UTC on 21 July, we saw another glitch on our monitoring. Again, the problem cleared up quickly, but we started looking again into what could possibly be the cause. There were further glitches at 15:17 and 16:51, but then the problem seemed to clear up.
Unfortunately at 22:44 it flared up again. Again, the issues started happening more frequently, and lasting longer each time, until they became very noticeable for our users at around 00:30 UTC. At 00:55 UTC we decided to move the server to different hardware again -- there's no official way to force a move to new hardware on AWS; stopping and starting an instance usually does it, but there's a small chance you'd end up on the same physical host again, so a second attempt seemed worth-while. If nothing else, it would hopefully at least clear things up for another 12 hours or so and buy us time to work out what was really going wrong.
This time, things didn't go according to plan. The file server failed to come up on the new hardware, and trying to move again did not help. We decided that we were going to need to provision a completely fresh file server, and move the disks across. While we have processes in place for replacing file servers as part of a normal system update, and for moving them to new hardware without changing (for example) their IP address, replacing one under these circumstances is not a procedure we've done before. Luckily, it went as well as could be expected under the circumstances. At 01:23 UTC we'd worked out what to do and started the new file server. By 01:50 we'd started the migration, and by 02:20 UTC everything was moved over. There were a few remaining glitches, but these were cleared up by 02:45 UTC.
We did not expect the fix we'd put in to be a permanent solution -- though we did have a faint hope that perhaps the problem had been caused by some configuration issue on file-2, which might have been remediated by our having provisioned a new server rather than simply moving the old one. This was never a particularly strong hope, however, and when the problems started again at 12:16 UTC we weren't entirely surprised.
We had generated two new hypotheses about the possible cause of these issues by now:
The problem with both of these hypotheses was that only one of our file servers was affected. All file servers had the same number of workers, and all had been upgraded to 16.04.
Still, it was worth a try, we thought. We decided to try changing the number of daemon processes first, as it we believed it would cause minimal downtime; however, we started up a new file server on 14.04 so that it would be ready just in case.
At 14:41 UTC we reduced the number of workers down to eight. We were happy to see that this was picked up across the cluster without any need to reboot anything, so there was no downtime.
Unfortunately, at 15:04, we saw another problem. We decided to spend more time investigating a few ideas that had occurred to us before taking the system down again. At 19:00 we tried increasing the number of NFS processes to 128, but that didn't help. At 19:23 we decided to go ahead with switching over to the 14.04 server we'd prepared earlier. We kicked off some backup snapshots of the user data, just in case there were problems, and at 19:38 we started the migration over.
This completed at 19:46, but required a restart of all of the execution servers in order to pick up the new file server. We started this process immediately, and web servers came back online at 19:48, consoles at 19:50, and scheduled tasks at 19:55.
By 20:00 we were satisfied that everything looked good, and so we went back to monitoring.
Since the update on Saturday, there were no monitoring glitches at all on Sunday, but we did see one potential problem on Monday at 12:03. However this blip was only noticed from one of our web servers (previous issues affected at least three at a time, and sometimes as many as nine), and the problem has not been followed by any subsequent outages in the 4 hours since, which is somewhat reassuring.
We're continuing to monitor closely, and are brainstorming hypotheses to explain what might have happened (or, perhaps still be happening). Of particular interest is the fact that this issue only affected one of our file servers, despite all of them having been upgraded. One possibility we're considering is that the correlation in timing with the upgrade is simply a red herring -- that instead there's some kind of access pattern, some particular pattern of reads/writes to the storage, which only started at around midday on Thursday after the system update. We're planning possible ways to investigate that should the problem occur again.
Either way, whether the problem is solved now or not, we clearly have much more investigation to do. We'll post again when we have more information.