CPU resetting issues report: 3 - 5 May 2024


tl;dr

We have a number of background processes that execute periodically on our systems; one of these is the one that resets the amount of CPU used, so that you get a fresh allowance every day. Early in the morning of 2024-05-03, on our US-hosted system, that service failed silently.

Unfortunately, we only realized it was not working on the morning of 2024-05-04. Putting a fix in place required another day.

At the same time, our load balancing system was experiencing a DDoS attack by malicious bots, which led to an overall decline of performance.

For some of our users, who noticed the CPU issue, these two separate events correlated, leading to confusion.

These issues appeared only on our US-based system – users on our EU system were not affected.

Details

At around 3:00 UTC on 2024-05-03, the server responsible for running several housekeeping tasks stopped running some of its scheduled jobs, including the one responsible for resetting the daily CPU count. It was a new failure mode, and we didn’t have any alerting in place to detect it, or any monitoring that would pick it up. We realized the server was not responding on the morning of 2024-05-04. At this point the server was unreachable, so we had to reboot it. The reboot fixed the underlying issue, and the scheduled jobs started running again. However, as the CPU resetting system had not been working for a while, almost all our users had their “next CPU reset time” in the past, and the script was overloaded with accounts to process, which meant that it was still not working.

At the same time, our new system for DDoS detection was blocking a flood of malicious bots scanning our site and other PythonAnywhere hosted sites. We’d been already working on fine-tuning the new system, but the stress it was putting on our load balancing system was still big enough to have a noticeable performance impact for our web services.

As some of our users started to notice the performance issues, they also realized their CPU count was over the limit (some of them had been in “the tarpit” since before the failure) and not resetting, despite continuously showing “Resets in 0 minutes”. Although the code running in web apps is excluded from the CPU quota, those two issues were easy to conflate, leading some people to the conclusion that our users’ web apps were not working properly due to the lack of available CPU seconds.

The fix

As the issue was developing over the weekend, we didn’t have the whole team available for responding to our users. As we were gradually trying to manually reset the CPU count for the users who contacted us via our forums and the support email box, we were also trying to fix the whole CPU resetting system. After a while we decided to change the reset time for all the users and set it in the future, so the script would be able to gradually pick them up in consecutive runs. This update took, however, about 14 hours to complete and the system was only fully functional again around 2:00 UTC on 2024-05-05.

Meanwhile we were trying to tune the DDoS detection system in a way that it would keep its functionality without putting such stress on the load balancing system. We got to a satisfying solution in the morning of 2024-05-05. On 2024-05-06 we added even better way of blocking the malicious bots, so the DDoS detection system could address the actual distributed denial of service attacks.

We have also added enhanced monitoring to the server running our housekeeping tasks, so it will alert us much earlier (paging the on-call team member) if it ends up in the same failure mode that stopped it from running scheduled jobs.

Further reading

comments powered by Disqus