Post-Mortem on 14 April 2025 Outage

Today, on 14 April 2025, from approximately 09:33 UTC to around 17:00 UTC, Magic Pages experienced a significant outage affecting all customer websites. This was caused by an issue within the underlying Kubernetes infrastructure that cascaded into an unexpected problem with an external API limit. Here’s a breakdown of what happened and the steps I took to resolve it.

What happened?

Magic Pages runs on a Kubernetes cluster hosted with Hetzner Cloud. Kubernetes has different types of servers (nodes): "control plane" nodes that manage the cluster, and "worker" nodes that run the actual websites. For redundancy, Magic Pages uses three control plane nodes.

Around 09:33 UTC, one of these control plane nodes went offline unexpectedly. Investigating the node showed it had run out of memory. This happens occasionally, and the standard procedure is to remove the faulty node and spin up a replacement. That's why we have redundancy, after all.

I decided to replace it with a slightly larger server instance to provide more memory headroom. I initiated this using the hetzner-k3s tool, which automates creating and managing Kubernetes clusters on Hetzner Cloud.

The deployment process failed, however. That happens. Sometimes API requests time out or fail temporarily, so I restarted the process. It failed again. Unfortunately, the hetzner-k3s tool, while convenient, doesn't provide detailed logs about why underlying Hetzner API calls might be failing.

Around the same time, monitoring alerts started flooding in – customer websites were becoming unreachable. My panic mode was turned on.

After about half an hour of investigation, I found the culprit: Magic Pages had hit the API rate limit imposed by Hetzner Cloud. Hetzner limits projects to 3600 API requests per hour.

I was aware of this limit, but I hadn't fully realized the extent to which my Kubernetes cluster itself relies on the Hetzner API for certain ongoing operations (like networking and some storage management). When the control plane started having issues, these internal components likely increased their API calls trying to self-correct, quickly exhausting the hourly quota.

This meant I couldn't easily deploy new nodes, nor could the existing cluster components function correctly. Things started cascading.

My first thought was to wait for the rate limit to reset. The limit is hourly, so waiting 30 minutes should have freed up around 1800 requests. However, checking the API status after 30 minutes still showed ratelimit-remaining: 0. It took another 15-20 minutes to realize why: the remaining Kubernetes nodes, struggling due to the failing control plane and lack of API access, were continuously trying to use the API, preventing the limit from ever resetting.

At this point, simply waiting wasn't an option. Trying to scale up a separate cluster to migrate traffic also wasn't feasible, as moving resources also requires API calls that were being blocked.

💡

Update, April 16, 2025: I have just heard back from Hetzner on this issue. There is a bug in a cloud controller that interacts between a Kubernetes cluster and Hetzner's cloud. Unfortunately, the root cause of the bug has not been found for several months, though they have provided a workaround that will be able to get things up much more quickly, should this happen again.

How did I fix it?

With the cluster stuck in a loop of failing operations and constant API calls, I took the drastic step of shutting down the entire Kubernetes cluster by powering off the servers. This was the only way to guarantee a stop to the API requests.

It worked. Monitoring the Hetzner API status, the ratelimit-remaining value finally started climbing. I waited for a while longer to build up a buffer, not wanting to risk hitting the limit again immediately during the recovery process. While waiting, I also contacted Hetzner support to request an increase in our API rate limit for the future.

Once a reasonable number of API requests were available, I began the recovery process and restarted the control plane servers. This came back successfully. I then wanted to bring the worker nodes back online, as well, but ran into another issue.

The worker nodes immediately got overwhelmed. Every website container tried to start simultaneously and metaphoricaly shouted "me, me, me!". However, for sites to run, their storage volumes needed to be available first.

The storage system I use (Longhorn) is generally self-healing but requires a minimum number of its own components to be running correctly, especially after a complete shutdown. This required manual intervention to get the essential Longhorn components started.

Once the core Longhorn components were manually started, the system began its self-healing process, gradually making storage volumes available across the cluster. As storage volumes became available, website containers could start successfully. This happened progressively over the next hour or so.

By around 17:00 UTC, about 95% of sites were back online. The remaining few (including my own blogs, ironically) required some more manual attention and restarts, which were completed shortly after.

The entire outage lasted approximately 7.5 hours.

What worked well?

The initial failure of a single control node would have been handled by the redundant setup.
Although delayed by tooling limitations, I eventually pinpointed the Hetzner API rate limit as the core issue.
Shutting down the entire cluster, while disruptive, proved effective in stopping the cascading API calls and allowing the rate limit to recover.
The Content Delivery Network – when activated – correctly cached sites and served this "stale" cache.
The database infrastructure was completely unaffected, as it was pulled out of the Kubernetes cluster earlier this year, meaning no data loss on that end.

What could be improved?

I lacked sufficient monitoring of the Hetzner API usage originating from within the running cluster, focusing too much on deployment usage.
The default Hetzner API rate limit (3600/hour) is clearly insufficient for the operational needs of our cluster, especially during failure or recovery scenarios.
The abstraction provided by hetzner-k3s hindered troubleshooting by masking specific API errors.
The cluster's critical dependency on the Hetzner API for core functions (mainly networking) created a vulnerability when that API became unavailable.
The recovery process for Longhorn after a complete, non-graceful cluster shutdown was complex and required manual intervention, extending the outage.

Future Prevention

Based on this incident, I am taking the following steps:

Implement Hetzner API Monitoring: Set up specific monitoring to track API call rates originating from the Kubernetes cluster nodes themselves, providing early warning of high usage.
Increase API Rate Limit: I am in touch with Hetzner to increae the API rate limit.
Reduce API Dependency: I want to investigate how I can reduce the reliance on a single API. hetzner-k3s makes Kubernetes on Hetzner plain simple, but if it comes with that cost, I'd need to look elsewhere.
Refine Storage Recovery Plan: Document and test a more robust procedure for recovering the Longhorn storage system after a full cluster shutdown scenario.

This outage was disruptive and stressful, both for me and undoubtedly for you. I want to apologise for the downtime and the lack of access to your Ghost sites during this period.

If you still see any issues or have questions, please send me a quick email to help@magicpages.co.

Magic Pages Blog Incidents

About Jannis Fedoruk-Betschki

I'm the founder of Magic Pages, providing managed Ghost hosting that makes it easy to focus on your content instead of technical details.

Post-Mortem on 14 April 2025 Outage

What happened?

How did I fix it?

What worked well?

What could be improved?

Future Prevention

About Jannis Fedoruk-Betschki

You might also like

June & July 2025 Update on Magic Pages

How CarExplore Achieved 70% Faster Page Loads with Ghost's Built-in Redis Caching

Post-Mortem on 24 June 2025 Partial Outage

Websites powered by Magic Pages