![]()
Today we experienced the longest contiguous downtime in our services 9 year history, lasting around 3 hours. The cause was a worldwide outage of the CloudFlare content delivery network of which we are a customer. They have an incident report you can read here.
First of all we would like to apologise for this downtime, we truly believe we have done everything we can to mitigate downtime but eventually there is a single point of failure somewhere and for us that is CloudFlare. Whether we run our own DNS servers, nameservers or even own and operate our own IP addresses and autonomous networks eventually you have to rely on a 3rd-party somewhere that has the potential to go down.
The reason that we chose CloudFlare to be our sole single point of failure is because the vast majority of our own customers use CloudFlare. Based on the metrics we have around 80% to 95% of the websites that utilise our API are using CloudFlare. And so this means if there is a CloudFlare outage, it's likely our own customers are also experiencing the same outage of their own websites and so this reduces the impact of our downtime.
We're one of the millions of websites that went down today including OpenAI, Spotify, Uber, Twitter/X and even Downdetector.
There are ways in which we could utilise multiple content delivery networks, for example we could use Microsoft Azure CDN or Amazon AWS CloudFront alongside CloudFlare, both of which also experienced hours-long downtimes in recent weeks. But this approach of using multiple CDN's at once simply moves the single point of failure higher up the chain, at the load balancer level that chooses which CDN your traffic is handled by. If this were to have an outage instead of CloudFlare then our outage would not coincide with our customers who use CloudFlare and thus have a larger impact.
We made all of these considerations and researched our options the last time we had a major CloudFlare outage which lasted 38 minutes in 2019. We thought utilising multiple CDN's would be a simple solution and we did even trial some solutions but ultimately we saw that we were just trading one single point of failure for another and the impact on our customers would be larger if we didn't make just CloudFlare our single point of failure.
The reason we're writing this blog post with the above detailed explanation behind our reasoning is because we do want to explain not only why we were down but what lead to the decisions that resulted in us choosing CloudFlare in the first place and more specifically why we know they're our single point of failure and yet we maintain having them in that position within our infrastructure, the impact on our customers specifically is the lowest with CloudFlare of all the other options available.
We're sure that CloudFlare will make a blog post of their own going into specific detail about this outage and what they'll change in the future. We will update our own blog post here with a link to their explanation at that time.
If you would like to get in touch with us for any reason please feel free to use the contact page. Thanks for reading and have a great week.