Introducing a new North American node

Image description

Today we have introduced a new server called LUNAR which is replacing our previous server LETO. This new server is three times more performative than any of our previous North American servers and it's physically identical to the new servers we've deployed in South Asia that we mentioned in our previous blog post.

We've been very pleased with the performance of those two new nodes and that's why we're moving forward with a North American server refresh based on the same hardware platform as those nodes.

As mentioned above we haven't merely added LUNAR to our current roster of servers, we've removed LETO and replaced it with LUNAR. We wanted to discuss why we did this in detail as we think it may be interesting to others building resilient web services like we are.

So firstly we should explain how LETO came to be and what platform it was based on. When we were seeking to add capacity to North America we had a self-imposed mandate to acquire a server with not just high performance but that was hosted by a different company to the one we were mainly using so that we could further diversify our infrastructure.

Similar to how we use three different geographically separated data centres in Europe we wanted the same in North America. Finding a server that was fast enough, affordable enough and hosted by a company we weren't already using was quite difficult.

And adding to the pressure was that we had traditionally shied away from using cloud hosts for a multitude of reasons including performance, cost and security. To put it in other words, we only used bare-metal servers where we're the only tenant on the machine.

As our search continued we did come across a cloud host in the USA that was offering the latest-generation of AMD EPYC-based virtual machines and the price was very attractive. We decided to test one out and found the performance was very good, in-fact we were seeing 2x the performance of our other North American bare metal servers for the same price.

So with performance, cost and diversification of our hosting all accomplished we decided to bring LETO online. It was to be our first virtualised server node. And for the first few months, everything was very good but then we ran into issues.

Firstly we suffered from random shutdowns. Since we're hosting a virtual machine on someone else's physical infrastructure from time to time they need to shut down the Hypervisor that runs the virtual machines to perform maintenance. Their maintenance windows never synchronised with our own and we never received advanced notice of when such maintenance would be occurring.

And so our LETO virtual machine would sometimes seemingly crash at random. Our systems take a while to shut down due to how much data they hold in memory which needs to be committed to disk and this 2-3 minute flushing period was too long for our virtual machine host.

Second to that and the most pressing issue for us was rapidly degrading performance. We're not going to name and shame the host we were using because we know this isn't a unique problem with their service. The broad issue with virtualised infrastructure is when you share a system with other people you're at the mercy of what they're doing with their slice of the available computer resources.

And when we first got our server it was clear we didn't have many "noisy neighbours" sharing the resources with us but over time as more tenants moved in their usage began to impact the performance of our virtual machine. Even as we scaled LETO up by doubling its resources (and pricing) we were still seeing untenable performance regressions.

And that brought us to this decision to end the LETO experiment and go back into bare metal infrastructure where the performance is always consistent and predictable as is the stability since we control the hardware fully and can choose when and how we perform maintenance.

As of right now we no longer have any virtualised infrastructure driving our core services, we're only using virtual machines for honeypots and other non-essential services.

Of course, we do think virtualisation has its place, we use it on our bare metal servers for local software development as one example. But for a service like ours that needs low latency, consistency and stability it's just not a good fit at this time.

Thanks for reading and we hope you're all having a wonderful week!