tl;dr¶
On Thursday 5 September 2024 we performed some system maintenance. It appeared to have gone well, and was completed at the scheduled time (06:20 UTC), but unfortunately there were unexpected knock-on effects that caused issues later on in the day, and further problems on Saturday 7 September. This post gives the details of why we needed to perform the maintenance, what happened, and what we will do to prevent a recurrence.
Why was the maintenance necessary?¶
Normally, when we update PythonAnywhere, we need a brief period of downtime to
update all of the systems with the new code. However, this maintenance was for a different purpose.
Over the last year, we have noticed degraded reliability in certain server types
that we use (specifically m3
-series AWS instances). The technical support
team at AWS have told us that there are no known issues with that instance type,
but we have found that moving over to newer m5
-series instances significantly
improves reliability.
As a result, we have (over the last year) been migrating over to the newer instances.
This is a relatively large task, as we were using m3
s as file servers.
Newly-created file servers have, since February, been using the m5
instance
type, but we still had to migrate the data from the old ones over to newer instances.
Simply copying the data from one machine to another would take days to weeks – there’s
a lot of it there! – and people who were on the file servers in question would not
be able to do anything while the copy was happening, which would obviously be
unacceptable.
So our process has been to set things up so that the each old m3
instance syncs all of the data to a new m5
one in the background. That process
works smoothly, and so the only downtime required is at the end, when we have to
swap the servers around so that the m5
becomes the primary file server, and
the m3
can be shut down.
We have previously done this switchover for our EU system (and, of course, in numerous
testing environments), and it has worked without any issues. 5 September was the
first switch that we were doing for a file server on our main US-hosted system at www.pythonanywhere.com
.
It was just one of the file servers (specifically livefile10
), as we wanted to
minimise issues if there was a problem with the process.
Timeline¶
All times are in UTC.
Thursday, 5 September¶
- 06:00 we started the switchover process; our site (and hosted sites) were brought
down, the file server was switched over, we did connectivity checks and confirmed
that all files were present on the new
livefile10
. - 06:20 systems were brought back up again. All systems seemed fine, apart from always-on tasks, which were starting but somewhat slowly. These take longer than they should to start up after any kind of system maintenance (which is something we’re exploring options for fixing), so we kept tracking that, but otherwise believed everything is OK.
Sub-timeline: SQLite¶
- 10:10 we noted that we have been receiving reports of certain websites not working after the maintenance. There did not appear to be anything obvious in common between them; they were on different file servers, so it didn’t appear to be directly linked to the change, but it was obviously related in some way.
- 11:15 we finally worked out the connection; the affected websites were those that were using SQLite. We determined that any SQLite databases that had been locked for updates at the time the maintenance started had not had those locks cleared, so any code that tried to use them would hang. Other databases – MySQL and Postgres – were unaffected.
- 11:20 each file server has a “lock manager” – a process that keeps track of which files are locked and which are not. We restarted that service across all file servers, both old and new-style. The affected SQLite databases that we were aware of started working again immediately.
- 12:30 we received some reports of SQLite databases that were still not unlocked.
After investigation, we discovered that restarting the lock manager on the new-style
m5
instances fixes the problem for databases that are stored on those servers, but it does not work for those stored on the old-stylem3
servers; they’re running different versions of Ubuntu, which we believe to be the underlying cause of that difference. We discovered that the only way to get those servers to release their locks is to reboot them. This was clearly suboptimal, because (unlike restarting the lock server) it would lead to downtime for all accounts on the server in question. Unfortunately there was no other option. - 12:45 - 15:00. In order to minimise downtime, we rebooted the old servers one at a time. This may seem counter-intuitive, but rebooting one server was expected (correctly) to cause an outage lasting 2-3 minutes, affecting only those accounts on the server in question and – potentially – our own site, while rebooting them in a group could have led to a significantly longer outage across all sites. By 15:00 this process was complete, and SQLite databases were all working again.
Sub-timeline: Always-on tasks¶
- 11:00 - 15:00 over this time, we received a number of reports of issues with networking on always-on tasks. This can happen due to capacity issues; if there are too many tasks on a given server, it can make the networking system slow down to a level where (for example) DNS lookups fail, meaning that tasks can’t connect to other servers – for example, external APIs or databases. We investigated and realised that some machines in the cluster were running significantly more always-on tasks than others, which explained the problem, and why only a subset of tasks were affected. (We now believe that this was because when the system came back after the maintenance, it did so asymmetrically – that is, some of the nodes in the cluster came back sooner than others, and started picking up tasks first. This meant that those nodes had picked up more tasks than the others.)
- 15:00 - 18:00 we manually “rebalanced” the always-on task cluster, making sure that each node had the same number of running tasks. This appeared to fix the issues there.
Friday, 6 September¶
- Over the course of the day, there were no system issues.
Saturday, 7 September¶
- A number of other always-on tasks started having network issues again; we’re currently investigating the cause, because we don’t currently see any way in which it could be connected to the issues on Thursday – but the timing is suspicious.
Next steps¶
Having determined the causes of the issues, we have solid plans for the next steps.
We still have file servers using the old m3
instances, and in order to make things
as reliable as they should be, we’ll need to migrate them over to m5
s. For each
of the issues we saw this time around, there are different solutions:
SQLite¶
We know that we will need to restart the lock managers across all file servers after the next migration. This should happen during the maintenance window, not afterwards, which leaves us with two options:
- Migrate a subset of the remaining
m3
s. Once that is done, restart the lock manager on all new-style file servers, and reboot all of the old-style file servers. This would add some time on to the maintenance window, but hopefully no more than ten minutes. - Migrate all of the remaining
m3
s in one batch. Once that was done, we could just restart the lock manager across all file servers – there would be no old-style file servers that would require rebooting. This would add at worst a minute or so to the downtime.
Doing smaller batches would be lower-risk – that is, we believe there would be a lower chance of having longer downtime than anticipated. But doing one large batch, if it went well, would mean only one more downtime to get all servers migrated, and it would be a shorter outage too. We are planning to do a significant number of tests prior to making a decision on which way to go with this. However, whichever route we follow, please do be reassured that all data that we store is backed up in a multitude of ways (especially during this migration process) and the risk of data loss is essentially zero.
(Your author is hoping he won’t come to regret those words!)
Always-on tasks¶
Here, we have a short-term and a long-term plan.
In the short term: the delays in starting up always-on tasks after our normal system updates are much longer than they should be – but, they are significantly less problematic than the multiple outages we saw after this recent maintenance. So, next time we do it, we will shut down and restart the always-on system the way that we do for a normal system update. We appreciate that this is far from ideal, which is why we need a longer-term plan.
In the long term, we’re debating internally two different options, which can be loosely summarised as “improve the architecture we have” and “switch to a new architecture”. The latter isn’t quite as big a step as it might sound, as the new architecture would be based on systems that we already use in other parts of the system. We’ll post more about that in the future.
In general¶
One area where we could have done better during this system update was in spotting patterns in tech support requests after the maintenance. We received around ten reports of website issues – the ones that we ultimately realised were related to SQLite – over the period 06:20 to 10:10. Unfortunately, because different people were working on different support cases, it took us a while to connect the dots and realise that there was a common theme there. We are looking for a way to avoid that next time around (and, perhaps, as a general thing after our normal system updates).
Summary¶
Our system maintenance on 5 September had knock-on effects that meant that SQLite databases were locked after the update, and always-on tasks came back in a fashion that caused problems for some of them for a significant amount of time. Fixing the SQLite issue required rebooting a number of file servers, which caused short (~2-3 minute) outages for all users on the affected file servers, even those that were not using SQLite. And fixing the always-on task issue took significantly longer than it should have done.
We’re working hard to make sure that the next fileserver migration goes considerably more smoothly! We’ll update this blog post with anything we want to share about that.