tl;dr¶
On 2023-10-09 we had an unplanned outage. While we were preparing our systems for a scheduled system update the following morning, we faced some issues. These in themselves would not have caused problems, but responding to them resulted in the accidental termination of the old, running cluster’s machines at 15:22 UTC. To avoid additional downtime, we decided to do the planned update to recover the service. It took longer than expected, but we were able to get all hosted websites up and running by 17:30 UTC; unfortunately always-on tasks took longer and were only fully working by 21:42 UTC.
Background¶
On 2023-10-09, we were preparing for the system update that we had announced and planned for the morning of the next day.
When we’re deploying a new version of PythonAnywhere, we start an entirely new cluster alongside the old one. That means that on the day of the deploy, we only need to do some minor things like switching across public IPs and database migrations to make the new cluster replace the old one.
While spinning up the machines for the new cluster, we noticed that the subnet that we were running in could not accommodate two live clusters at the same time, so many of the machines had failed to start.
In order to work around this, we planned to shut down the machines in the new cluster that had already started and then spin up a fresh new cluster in a new subnet.
Details and Timeline¶
15:22 UTC¶
Unfortunately, due to human error, we started terminating machines from the old, running cluster – the one that was running our live service. By the time we’d realized the error, there was too much damage to the cluster to quickly repair. So we decided to go ahead with the update that we’d planned for the next day to avoid two outages in 24 hours.
15:43 UTC¶
We finished terminating all the machines in the old live cluster and started spinning up a new one with the updated code. After several minutes, we noticed machines that were failing to start properly and realized we’d hit an unexpected rate limit. We slowed things down a bit to prevent that from causing issues and started another cluster.
16:47 UTC¶
Once that was up and mostly working, we were able to do the rest of the deploy and bring the cluster back up. There were still some issues with the cluster, but we had the service running in a degraded state.
The main issue with the new cluster was that all of the infrastructure for running always on tasks had failed to start due to us hitting another, different unexpected rate limit.
17:30 UTC¶
Hosted websites, scheduled tasks, consoles and notebooks were up and running.
17:59 UTC¶
We continued to work through other background issues that we knew of while we started the machines that run always-on tasks.
19:17 UTC¶
Once we had handled all of those issues and the always on machines were ready, we were able to start restarting the always on tasks gradually to avoid overloading the infrastructure.
21:42 UTC¶
All always-on tasks were running again.
Conclusion¶
Naturally, we’ll be taking action to make sure that an issue like this does not happen again. While it’s impossible to remove all chances of human error, we need to work out how to make it harder to perform destructive actions on the currently running live system. We’ll also need to look into ways to reduce downtime when anything like this happens, in particular with respect to always-on tasks.