Today, on 19 June 2024, from approximately 16:20 GMT to 17:30 GMT, Magic Pages had an outage that affected customer websites. I quickly want to share what happened, and how Magic Pages plans to avoid this in the future.
What happened?
I recently switched over the DNS handling of the magicpages.co domain from Cloudflare to Bunny.net's DNS service. The thought behind that was quite simple: Bunny is already taking care of the content delivery network β so let's also get the DNS routing from the same provider. Hopefully that makes things even faster.
The DNS routing ran through Bunny.net for the last three weeks or so. Everything was fine. However, yesterday, I noticed that some CNAME records on the magicpages.co domain simply didn't resolve anymore β even though they were there.
These CNAME records weren't mission critical β but important. They related to email delivery, specifically the DKIM-signing with Magic Pages's email provider AWS SES.
After some trial and error, I figured out that the issues seemed to stem directly from the way Bunny handles these records. So, the next step was simple: switch back to Cloudflare.
Since Cloudflare automatically imports existing DNS records, the switch back should have been quite smooth. Unfortunately, Cloudflare kinda screwed it up β and I didn't notice soon enough.
The core issue was that Cloudflare tried to add their own CDN/caching to all records. This majorly interfered with the content delivery network I have running with Bunny, as well as the SSL/TLS certificates that were issued for different magicpages.co subdomains.
The Ghost sites running on the Magic Pages infrastructure weren't directly affected by that. Just indirectly. The database server runs on a subdomain of magicpages.co β and all of the sudden that wasn't reachable anymore.
What did I do to fix this?
To fix this, I simply planned to add the proper DNS records on Cloudflare, and deactivate the caching. That was done fairly quickly. The only thing left to do was restarting all Ghost websites.
And this is where things went south.
The Magic Pages infrastructure is built on different servers. Some of these servers have the task of orchestrating sites (managers), some run the sites (workers). The managers and workers know each others hostnames (e.g. manager-X.magicpages.co) and communicate through these. And maybe you can already see where this is goingβ¦
These hostnames had the same problem as the database server. Cloudflare didn't properly import the DNS records β and the servers didn't know where to send their requests to.
Since the servers all depend on each other (one server can be offline at any time β but all of them being "disconnected" at the same time is kind of an issue), fixing it took quite some time.
But, even when they were all online again, the DNS routing didn't want to work 100%, since Cloudflare stilled cached some of the records.
The solution: switching back to Bunny's DNS services β since I know that these worked.
Once I have done that, things were up and running within a minute. And I'll simply ignore the email issue for now (there is a workaround that I can use, so you'll still get emails from me, no worries).
What will I do to avoid that in the future?
The key issue here was the reliance on external hostnames that were based on DNS records. That was a big learning.
While the infrastructure is set up in a "high availability" cluster now, that isn't very useful, if there is one more "single point of failure": DNS.
That point of failure needs to go away. So, my next step is to make the communication between the individual servers independent of the DNS resolving of their hostnames by implementing an internal network.
This way, the servers will still be able to communicate, even if the magicpages.co domain is gone completely. The Magic Pages website might then be down, but all customer websites will keep running. Success!
Implementing this will be fairly easy. So, I assume that this will be done within a few days.