1 year on: Evaluating our move to AMD processors

Image description

One year ago in this blog post we shared the news that our infrastructure was changing, we would begin transitioning our servers from using processors made by Intel to processors made by AMD.

And specifically, we would be using the Zen 3 based Ryzen 9 5950X 16-core microprocessor from AMD. Before we detail the results of this transition we first wanted to mention that before we ever decided to use AMD we had customers emailing us and advocating for AMD's products. This was a common occurrence any time we announced the deployment of new Intel based systems.

So the tech communities enthusiasm for AMD has been growing since the first generation Zen products were released in 2017. We too had been monitoring AMD's many successes since the launch of Zen 1 and we consistently looked at the options available to us from our hosting providers.

When the Ryzen 9 5950X became available and offered everything we were looking for from core and thread counts to clock speeds and ECC memory support we knew it was time to transition away from Intel and as luck would have it our main European host was offering 5950X based servers in all their datacenters.

Concerns we had moving to AMD

But changing processor vendors isn't without risk. We're entrusting that AMD is maintaining full compatibility with the x86_64 instruction set and that all of our software will work and be essentially vendor-agnostic. If Intel were to become competitive in the future we may need to transition back to their products and we would want to do that without rewriting all of our software.

The final concerns we had were regarding the chiplet architecture of the Zen based processors, the performance of the memory controller and overall system stability as these Ryzen 9 processors are not validated for 24.7 usage as server processors.

Potential software compatibility problems

Our operating system on these servers is Windows Server 2022. On our previous Intel machines, it was a mix of Windows Server 2012 R2 and Windows Server 2022 as we were transitioning from 2012 R2 due to Microsoft ending support for it in late 2023.

When it comes to compatibility with our operating system there is only good news to report. The process scheduler of Windows Server 2022 is top-notch and fully understands the cache and core hierarchy of Zen 3 based microprocessors and can allocate program threads properly. Everything works and is stable.

Our software stack is mostly PHP, that's a mix of PHP 8.1 and 8.2. We also have some C and C++ auxiliary programs. All of our code and third party software just worked, we didn't need to recompile anything to increase compatibility or improve performance.

Hardware issues, stability problems and solutions

So that's the software side out of the way, when it comes to hardware you may note above we had some concerns about the memory controller. Here we did encounter an issue. All of our servers were equipped with 128GB of memory laid out as 4 x 32GB Dual-Rank 3200MHz ECC DDR4 UDIMM's from Samsung.

The CPU's memory controller and the memory modules we had installed are all rated to operate at 3200MHz but when operating at this speed (which by the way is JEDEC certified) we had instability on two of our four servers. This instability manifested as complete system lockups and crashes after several days of uptime.

Thankfully due to our cluster architecture, these crashes did not negatively affect the service and we didn't experience any downtime but it was a big problem that had to be resolved. In the end, we were able to make all our systems stable by changing the memory frequency from 3200MHz to 2666MHz which appears to be the recommendation when using all four slots on a Ryzen motherboard while using dual-rank memory modules.

Our opinion on this is that AMD's memory controller at least on this processor is weak. Intel's memory controllers on all their processors are much more stable at higher frequencies and when using multi-rank modules or when using multiple modules per memory channel.

Reducing the frequency of our memory modules did result in our memory bandwidth reducing by 16.66%. While frustrating as we are paying for much more expensive memory modules that are running at a lower speed than they should be, stability is paramount and not something we're willing to compromise on. We've experienced no more instability since making this configuration change.

Performance

The performance has been exactly as the graphs from our initial announcement showed. Wildly faster than our previous setup with a single one of these servers eclipsing all our other servers combined in not just CPU performance but storage I/O too.

When we switched processors we also eliminated the remaining Hard Disk Drives in our cluster by changing to enterprise U.2 NVMe flash drives from Samsung. These SSD's have been incredible for consistency as one of the issues we had with our previous architecture was the I/O wait times on the hard disk drives. We tried to mitigate that by using in-memory caching as much as possible but there are limits to how much data you can store in memory.

The 128GB of memory that we installed hasn't been that impactful from a quantity perspective but we knew that was likely to be the case before we migrated due to us working with a much smaller quantity of memory on our older servers and being frugal with the memory we've had in the past.

Lessons for the future and would we go AMD again

The main thing we learned is to scrutinise the spec sheets, the RAM frequency issue with dual-rank modules took us by surprise, thankfully it was an easy thing to resolve though diagnosing that as the root cause wasn't so easy.

Since we deployed these 5950X based systems we've also acquired a Zen 3 based 24 core EPYC system with 256GB of memory, this is being used for local research and development running many virtual machines and a mirror of our live physical infrastructure. We're very happy with this system and indeed all of the AMD systems we operate.

For the foreseeable future we see ourselves buying AMD based systems exclusively and looking at the state of Intels product roadmap this seems to hold true through to 2026. AMD has recently launched Zen 4 based processors, this is a highly performative suite of products utilising newer DDR5 memory and although we don't have any of these yet I could see us deploying one either for local development or as a remote node as part of our cluster in the near future.

We hope this post was interesting, we have been asked by a few customers how the upgrade went in specific detail and we hope this answers those questions. Thanks for reading and have a great weekend!


Back