Re-architecting Software The Right way

Image description

In computing, there is a strong resistance to complete rewrites and there are many great reasons for that including cost, time, the potential for new bugs and regressions in functionality.

Instead, the software industry prefers to do what we call refactoring where you take existing code and improve it by making many small changes over a long period so that each change can be more easily implemented and the results of those changes measured in a controlled manner. In short, it's fast, it's cheap, it gets results and it lowers the potential for problems.

But now and then it may be required to completely rearchitect a system and there can be many reasons. You need more performance, you need the code to scale better to more computing resources, new hardware has arrived that runs the old code poorly or not at all, new libraries, operating systems or execution environments are incompatible with your old code etc

And so many of the above conditions can precipitate a rewrite.

Image description

This is exactly where we found ourselves this year with some of our backend software written since 2017. This was software that was designed for a specific operating system (Windows Server) which meant we had leveraged some Microsoft-specific operating system functions which made our code incompatible with Linux.

We had also designed this software around certain hardware that we had access to at the time which meant low core count processors and slow storage using hard disk drives. This resulted in a lot of our backend systems being conservative in how they used the hardware, meaning mostly single-threaded operations and serial data access.

Our current servers average 23.5 CPU threads whereas when most of the code we're discussing was created our biggest and best server only had 8 CPU threads. And when it comes to storage we used to use HDDs that could manage only 800 IOP's now we're dealing with NVMe SSDs that can handle 1.2 Million IOP's.

With us wanting to rewrite things to be processor-independent and operating system agnostic whilst taking better advantage of our latest (and future) hardware it necessitated some rewrites. For sure we could refactor some of our old code and in many smaller cases that is what we did but for the biggest stuff rewrites were the right way to go.

So how do you re-architecture your software the right way?

Firstly you need to do a full code review for the system you're going to rewrite. This includes reading all of the code and understanding and documenting all the tasks the code performs and why it does them. This is paramount because otherwise, you will forget to include functionality in the new rewrite that the previous iteration contained.

Secondly, you want to identify all the deficient parts of the code that you want to improve upon. This could be simply messy or unmanageable code or it could be just unproductive code that performs poorly or doesn't meet your goals for now or the future (such as tying you to a particular operating system).

Thirdly you want to plan how you intend to improve the code to meet the goals you have for the new program. For us that mostly entailed making things multithreaded, better use of our storage system's I/O capabilities, not using any Windows operating system-specific features or functions that won't be available on Linux. And of course, the culmination of all this work is added scalability and flexibility.

Fourth and finally you write and test the code. We did a lot of testing during development to test our many hypotheses and this informed our design process. As we learned the capability of certain approaches to problems the decisions we were making changed along the way.

So let us go over some of these.

1: About two years ago Microsoft ended support for WinCache which was an in-memory data store for PHP. We made extensive use of this and so we had to build a replacement. Thus two years ago we wrote what we call ramcache. It performed the same role and re-implemented all of the WinCache functions. We were also able to extend the functionality. WinCache had a 85MB memory limit, we have no such limit in our ramcache as one example. We also made it operating system agnostic aka it will run on anything including Linux.

2: Webhooks. We use a lot of webhooks, mainly for payment-related events such as sending you an email when a payment is declined or coming up but we also use webhooks for what we consider time-critical events such as when you make a change in your Dashboard that must be propagated to all cluster nodes very quickly.

3: Our database synchronisation system. At one time the processor usage caused by synchronising our databases was as high as 70%. This did scale back as a percentage when we upgraded to systems with faster processors but it was still very high and we saw the usage steadily increasing as our customers generated more data per minute. In this case, we developed a new process called Dispatcher to handle this traffic and we dramatically reduced processor usage to just 0.1%.

4: Cluster management, node health monitoring & node deployment. Before our rewrite, this heavily relied on Microsoft-specific operating system features, especially the node health monitoring and node deployment features. We've now rewritten all of these to also be operating system agnostic and processor independent.

So let's look at some net results. Previously our webhooks (where one server specifically sends out a small update to one or more servers in the cluster) took an average of 6 seconds for a full cluster-wide update. The new code which is multithreaded when it comes to network usage has reduced this time to just 0.3 seconds. That's a 20x performance improvement.

Image description

When it comes to Dispatcher, this was a ginormous change for us as everything that keeps our servers in sync with one another utilised our previous system. The old system was so encompassing it didn't even have a name because it wasn't thought of as one specific object, the code was interspersed with so many other functions and features that it was almost omnipresent throughout our code.

This has all changed with Dispatcher which provides a standardised interface for reading and writing to our databases and it provides a framework for our cluster nodes to provide data in the most passive (and thus least resource-intensive) way possible through the use of packaging up node updates and having a singular chosen master node for each geographical region temporarily selected as a collector, processor and distributor of database updates.

You can think of Dispatcher a lot like a train network. Each node operates its own train that is constantly going around the track to all the other nodes and picking up data. Master nodes pick up data from non-Masters, they process the data and then carefully decide where in the database that data should be inserted. It is then repackaged by the Master and presented to any trains that come by from non-Masters which pick up those updates.

Image description

Every few minutes the nodes hold a vote and the node with the most free resources and best uptime is chosen to act as the master for that geographical region. The way we stop conflicts from having multiple master nodes able to distribute updates simultaneously is by having clearly defined containers, merge conflict resolution and a database that is built around a structured 1-minute time table.

Each minute in the real world is accounted for in the database with a master node attached to it for that specific minute and geographical region. That node and only that node can perform maintenance and alterations unless the nodes agree to remove it from that minute and assign another node, thus allowing another node to become the master over that minute. Any node can read from a minute but only masters can perform alterations and writes.

What all this results in is a high-performance database synchronisation system that scales to however many server nodes we have and most importantly with the least amount of processor and storage burden.

While performing all these code rewrites and refactors we also investigated new technologies and tuned our execution environments for the code we author. To that end, we upgraded to the latest PHP v8.2.6 across our entire service including all our webpages. During this upgrade process, we also engaged the latest JIT compiler present since PHP v8 as we saw massive improvements in page load times across the site with no regressions.

So that's the update for today, we hope you enjoyed this look at what we've been up to and the illustrations.

Thanks for reading and have a wonderful week.


Weekend Topic: Why we use a monolithic architecture

Image description

For this blog post, we would like to go over our architecture design for proxycheck.io and explain some of the decisions we've made along the way in building the service. To start with what exactly is a monolithic architecture and what is the alternative approach?

Monolithic in software pretty much means running all your services on one or more beefy servers as opposed to breaking out your services into what is commonly referred to as microservices and having them distributed across many smaller servers or even running them on what is known as "serverless" or "edge" computing infrastructure. The idea behind the microservices approach is you remove a lot of overhead like managing an operating system, you only instead manage the specific application that you've developed.

The other benefit is you can scale up microservices horizontally meaning if you need more resources for an application you can simply spin up another copy of the microservice on another system and load balance between them.

This approach however does have some caveats. As the amount of microservices you have increases the volume of network activity between all the services in your infrastructure increases. After all each service needs to obtain, process and share data with the rest of your infrastructure and the more servers you have sharing that burden the more there is to keep synchronised.

Database traffic is often overlooked when people turn to these services but it can become substantial to the point that you cannot expand horizontally anymore because there aren't enough resources to keep all your services synchronised. In addition to this complexity, there is also a creeping increase in cost from all this overhead which can overshadow the initial costs you thought you would incur for the resources you're using to serve customers.

Some good examples of how other services moved from microservices to monolithic would be Dropbox or even Prime Video which recently shared an interesting blog post about how they reduced their costs by 90% when moving from microservices to a monolithic architecture. And yes that is Amazons Prime Video who were using Amazon's AWS services to operate their microservices.

To quote Amazons Prime Video:

"Moving our service to a monolith reduced our infrastructure cost by over 90%. It also increased our scaling capabilities. Today, we’re able to handle thousands of streams and we still have capacity to scale the service even further."

So not only did it save them money but it also increased their ability to scale and helped them to support more users with fewer servers.

We have used a monolithic architecture since the very beginning because although we identified the benefits of microservices and specifically the use of AWS's EC2 and Azure clouds to scale rapidly we identified many drawbacks. Performance for these services on an individual level is not high that is to say individual requests have poor performance.

To put it another way, this microservices approach is akin to flying 2,000 hot air balloons instead of having 2 jumbo jets. Sure you can have double the amount of people across those hot air balloons but the time it takes to get to their destination will be much longer.

And that was and continues to be the crutch of the microservices model that has kept us not only on our monolithic trajectory but our bare metal one too. When we rent servers we are the only tenant and we get to pick the hardware, we often pick the fastest hardware available and we have been replacing our older servers with new ones that offer 3x to 4x their performance.

Meanwhile, if you look at the past 5 years of "serverless" computing like EC2 the performance has remained pretty much the same driven by the service provider's desire to maximise the amount of customers per unit of compute resource available.

To us, speed matters. If you compare for instance our customer dashboard to companies that use cloud providers and microservices you'll find ours loads instantly and populates with data in the blink of an eye while some of even the largest companies like OVHCloud have you sit for upwards of 10 seconds for their customer dashboards to populate with information.

Now we don't think that microservices have no use at all. There are certainly workloads that benefit from this approach especially data processing that needs a lot of workers and doesn't need instantaous results and any workload that can be accelerated by dedicated fixed-function silicon for example video transcoding, network encryption/decryption, packet routing. All of these tasks make sense for the horizontal growth approach that serverless/microservices can provide.

But for anything customer-facing where speed and latency are paramount, we just don't see the same benefits, users get frustrated waiting for things to load, the performance of the service isn't great overall, the costs can spiral out of control and the overhead with regards to data synchronisation can be crippling.

We hope this was interesting, we wanted to go a bit more in-depth about this topic due to our recent infrastructure posts which spurred some customers to message us and ask about why we don't use cloud providers and instead continue to use bare metal.

Thanks for reading and have a wonderful weekend.


New North American Nodes

Image description

This has been a month full of new servers and today is the last announcement we have regarding servers, we promise.

Over the past day, we brought online two new high-end servers within our North American service region called Jupiter and Saturn.

These are now the highest-performing servers we have serving that region and each of these servers are 6.27x more performative than our previous nodes excluding Lunar which was added last week and is itself very high-end.

To put the total performance upgrade in perspective our three new servers (Lunar, Jupiter and Saturn) provide us with 100,000 units of performance compared with 25,000 units of performance for our old servers (Leto, Cronus, Metis and Nyx). That's a straight 4x performance uplift while moving from four servers to three servers.

However, we will not be saying goodbye to Cronus, Metis or Nyx just quite yet because we have leases on those servers which expire between July and September. So until then, we've tweaked our load balancer to split the load so that Cronus, Metis and Nyx handle 25% of North American traffic while our three new servers Lunar, Jupiter and Saturn will carry the other 75%.

This upgrade is not just to give us breathing room to grow in the future but also to address some very large customers we picked up since the start of this year. We've seen our North American traffic go from around 35% of our daily mix (Europe being the rest) to 60% of our traffic. And with the new Asian servers we introduced earlier in the month we've seen our European load fall a bit due to some of the Asian traffic they would otherwise handle going to those new dedicated servers in the Asian region.

So in short, we needed to bolster our American infrastructure. And in addition to this change, we are also making changes to our per-second request limits. Before today you could make 100 requests per second to any server before receiving a warning and 125 requests per second before having your requests denied for up to 10 seconds.

Due to our new increased hardware performance, our planned reduction in the number of servers we have (while making each server more powerful) and also the very large customers we've been acquiring we've decided to increase the limits to 175 for the warning and 200 for the hard limit. The limiter still only looks at the previous 10 seconds of your request volume so you can still go over these limits if it's brief enough.

Essentially the true limit will be 2,000 requests over 10 seconds whether all those queries were made in a single second or spread out over the full 10-second period. This is up from the 1,250 limit we imposed previously. And remember this is per server. So for our North American servers when we only have 3 servers (down from the current 6) that limit will become 6,000 requests over 10 seconds instead of 3,750.

These new raised limits will also help with our new South Asia region where we only have two servers although the two combined are the equivalent of 8 of our last gen servers in performance and so they can handle these increased request limits with ease.

We know discussing infrastructure as we often do is not the norm. Most services like to shoulder these kinds of things and not share them publicly. We don't want to do that because we feel it's an interesting part of building any business and we want you to know that your paid subscriptions are being utilised to improve the infrastructure that ultimately delivers the service you're using. These new servers we've deployed aren't just going to increase the volume of traffic we can handle, they'll also lower the processing latency so everyone who uses our service will benefit from that too.

Thanks for reading and have a great week!


Introducing a new North American node

Image description

Today we have introduced a new server called LUNAR which is replacing our previous server LETO. This new server is three times more performative than any of our previous North American servers and it's physically identical to the new servers we've deployed in South Asia that we mentioned in our previous blog post.

We've been very pleased with the performance of those two new nodes and that's why we're moving forward with a North American server refresh based on the same hardware platform as those nodes.

As mentioned above we haven't merely added LUNAR to our current roster of servers, we've removed LETO and replaced it with LUNAR. We wanted to discuss why we did this in detail as we think it may be interesting to others building resilient web services like we are.

So firstly we should explain how LETO came to be and what platform it was based on. When we were seeking to add capacity to North America we had a self-imposed mandate to acquire a server with not just high performance but that was hosted by a different company to the one we were mainly using so that we could further diversify our infrastructure.

Similar to how we use three different geographically separated data centres in Europe we wanted the same in North America. Finding a server that was fast enough, affordable enough and hosted by a company we weren't already using was quite difficult.

And adding to the pressure was that we had traditionally shied away from using cloud hosts for a multitude of reasons including performance, cost and security. To put it in other words, we only used bare-metal servers where we're the only tenant on the machine.

As our search continued we did come across a cloud host in the USA that was offering the latest-generation of AMD EPYC-based virtual machines and the price was very attractive. We decided to test one out and found the performance was very good, in-fact we were seeing 2x the performance of our other North American bare metal servers for the same price.

So with performance, cost and diversification of our hosting all accomplished we decided to bring LETO online. It was to be our first virtualised server node. And for the first few months, everything was very good but then we ran into issues.

Firstly we suffered from random shutdowns. Since we're hosting a virtual machine on someone else's physical infrastructure from time to time they need to shut down the Hypervisor that runs the virtual machines to perform maintenance. Their maintenance windows never synchronised with our own and we never received advanced notice of when such maintenance would be occurring.

And so our LETO virtual machine would sometimes seemingly crash at random. Our systems take a while to shut down due to how much data they hold in memory which needs to be committed to disk and this 2-3 minute flushing period was too long for our virtual machine host.

Second to that and the most pressing issue for us was rapidly degrading performance. We're not going to name and shame the host we were using because we know this isn't a unique problem with their service. The broad issue with virtualised infrastructure is when you share a system with other people you're at the mercy of what they're doing with their slice of the available computer resources.

And when we first got our server it was clear we didn't have many "noisy neighbours" sharing the resources with us but over time as more tenants moved in their usage began to impact the performance of our virtual machine. Even as we scaled LETO up by doubling its resources (and pricing) we were still seeing untenable performance regressions.

And that brought us to this decision to end the LETO experiment and go back into bare metal infrastructure where the performance is always consistent and predictable as is the stability since we control the hardware fully and can choose when and how we perform maintenance.

As of right now we no longer have any virtualised infrastructure driving our core services, we're only using virtual machines for honeypots and other non-essential services.

Of course, we do think virtualisation has its place, we use it on our bare metal servers for local software development as one example. But for a service like ours that needs low latency, consistency and stability it's just not a good fit at this time.

Thanks for reading and we hope you're all having a wonderful week!


Introducing South Asian point of presence

Image description

Today we're introducing two new high-spec server nodes within the South-Asian region of the world. For those who have followed our blog since 2021, you'll know this has been a long time coming as we have purchased multiple servers to serve the Asian region as test platforms without deploying them in our cluster. All of the servers we've tried until now have failed our testing for one reason or another.

Today that has finally changed as we've deployed two very fast servers to the region for our substantial and growing Asian-based customer base. This will mean your queries no longer head towards Europe where doing so would incur a latency penalty of around 500ms.

During our testing, we've seen latency results of 30ms for India, 60ms for Singapore, Indonesia, Vietnam & Malaysia, 80ms for South Korea and 100ms for Japan. These new servers will now handle traffic for all of these countries.

They'll also be handling traffic for the Oceania region which includes Australia and New Zealand as well as certain Pacific islands that are closer to our new South Asian nodes than our North American nodes.

The names we've chosen for the new servers are Pulsar and Quasar. This continues our Space theme for naming servers since we moved away from Greek Titans due to exhausting most of them. Luckily most of the Titans we already chose do have space phenomena named after them so they still fit well.

In other infrastructure news, we boosted our North American capacity by 25% earlier this month by doubling the performance of our LETO node. However, this is only a stopgap measure for the growing traffic that is being generated there and we intend to replace several North American nodes later in the year.

We've seen enormous growth in North America over the past two years and on some days it now eclipses our European traffic which necessitates an infrastructure refresh as we did with our European servers just over a year ago.

So that's the news today, we hope if you have servers in the Asian or Oceanic regions you'll appreciate the lower latency and higher throughput now available to you. Thanks for reading and Have a wonderful weekend.


Data improvements between March and April

Image description

Today we wanted to detail a few significant data changes we've made over the current and previous month and how it impacts the data we serve through our API to you.

Firstly we've seen a large uptick in residential proxies being used around the internet to scrape websites and perform exploits. Residential proxy networks have outgrown the onion network (TOR) due to their users being paid to participate which differs from the free model that TOR uses.

On TOR anyone with an internet connection can launch what is called an exit node and others can use it to proxy their internet traffic for free. We've always detected TOR exit nodes right from the beginning. But now with money being involved these residential proxy networks are growing exponentially. We've managed to find flaws in a few of them which we've used to list their networks on our API.

In March we added the networks of two of the largest ones, we've had many customers email us about the fact we fail to detect a lot of these networks and so we became aggressive in our pursuit of their network nodes.

To put the scale of these networks in context In one case we were able to list 15,000 of 17,500 nodes on our API. That alone is four times the size of TOR. And while this has pleased our customers who asked for better indexing of these networks it has come at a cost: false positives.

The reason for the false positive rate increase is that these networks are relying on users to share their home and mobile networks and these are often dynamic. A single subscriber may change IP address upwards of 20 times per day in some circumstances which means unless we're constantly evicting addresses from our database we're going to have false positives.

To work around this problem we've begun evicting addresses at a much faster rate than we otherwise would, sometimes as little as 10 minutes depending on how dynamic we believe the addresses are. And for addresses we don't see multiple times within these proxy networks, they'll be evicted from our data within an hour.

When it comes to evicting dynamic addresses in general we have made significant progress in this area. For example, 90% of mobile (5G/4G/3G etc) proxies are removed from our data within 10 minutes. We've also categorised hundreds of address ranges we know to be shared via carrier-grade NAT (CG-NAT) due to the impact to the users of those networks being too great to list even if a proxy is inhabiting an address within one of those address ranges.

We've also expanded our VPN detection greatly, we increased the amount of hosting providers in our database by 19.4% between March 1st and April 8th, our biggest increase in such a short period. This was largely driven by investigating hundreds of address ranges and also by our customers supplying us with suspicious addresses and providers through our contact form (which by the way we read and reply to every message sent to us).

Finally, we've also been working heavily on our disposable email detection. We identified several issues in our internal systems that collect and store disposable addresses and by fixing those we've been able to vastly increase the number of domains we're adding to our database. We've also built some custom tools to obtain disposable domains from many of the most popular services automatically.

So that's the data update for you. We have also made a few updates to the Dashboard over the same time frame, you will now receive Country and Continent suggestions when creating rules that use those condition types making it easier to target locations without having to guess how we present their names in our API. This feature is driven by our new resources feature found here.

We hope you found this post interesting and we would like to thank all of our customers who have taken the time to write to us about emerging threats, new proxy networks, suspicious addresses and temporary email domains. We very much appreciate your effort to make our data better and more thorough.

Thanks for reading and have a lovely weekend!


Introducing new stat graph with local time and per-minute precision

Image description

Today we're launching the first in a series of visualisation improvements for the customer dashboard statistics tab. And specifically, we're starting with the graph of your daily query usage.

There were a few things we wanted to resolve with this redesign.

  1. Add proper timestamps along the bottom of the graph
  2. Improve the resolution of the graph so it shows per-minute trends
  3. Display times and dates in your local timezone
  4. Make the chart interactive through selectable timescales, resolution and zooming
  5. Easier to visually understand by switching to line-graphing from filled-graphing
  6. Better floating toolbar that follows your mouse cursor

So to explain where we've gone with the new chart lets first show you what it looks like displaying the previous 15 days of a test account which is set at a similar precision level to our previous chart which means it's displaying only a single data point per 24 hours of data.

Image description

As you can see every day is represented by very large and smooth lines. This is great for an overview but it doesn't show us the trends throughout a single day. That is where the new precision dropdown comes in and if we select to view this data in increments of hours we get a much different view.

Image description

Now we can see where our peaks and valleys are but the data is so precise over the total 15-day time scale that a lot of the information has become crushed down at the bottom. This is where our new zooming feature comes in, you can simply draw a box over a section of the graph and view that area like below.

Image description

Now we get a better picture of what we're looking at. And all of this happens on the page in real time, in-fact the chart loads incredibly quickly even with very large volumes of data. Below is a gif showing the speed and fidelity while zooming into a single day in the graph.

Image description

The resolution we're displaying in the above graphs is an hour but as you view smaller increments of time (such as the past hour or past day) you're able to select higher levels of precision right down to viewing per-minute query volumes.

At present we're limiting the past view to 30 days but we may increase this in the future as we're storing each customer's query data for 365 days at a per-minute resolution so it's entirely possible for us to offer 90 days or even more time.

We hope you'll check out the new chart within the Dashboard and let us know what you think. A lot of work went into this feature, especially the next-generation stat recording and storage which drives the per-minute data behind the chart.

Thanks for reading and have a wonderful weekend!


1 year on: Evaluating our move to AMD processors

Image description

One year ago in this blog post we shared the news that our infrastructure was changing, we would begin transitioning our servers from using processors made by Intel to processors made by AMD.

And specifically, we would be using the Zen 3 based Ryzen 9 5950X 16-core microprocessor from AMD. Before we detail the results of this transition we first wanted to mention that before we ever decided to use AMD we had customers emailing us and advocating for AMD's products. This was a common occurrence any time we announced the deployment of new Intel based systems.

So the tech communities enthusiasm for AMD has been growing since the first generation Zen products were released in 2017. We too had been monitoring AMD's many successes since the launch of Zen 1 and we consistently looked at the options available to us from our hosting providers.

When the Ryzen 9 5950X became available and offered everything we were looking for from core and thread counts to clock speeds and ECC memory support we knew it was time to transition away from Intel and as luck would have it our main European host was offering 5950X based servers in all their datacenters.

Concerns we had moving to AMD

But changing processor vendors isn't without risk. We're entrusting that AMD is maintaining full compatibility with the x86_64 instruction set and that all of our software will work and be essentially vendor-agnostic. If Intel were to become competitive in the future we may need to transition back to their products and we would want to do that without rewriting all of our software.

The final concerns we had were regarding the chiplet architecture of the Zen based processors, the performance of the memory controller and overall system stability as these Ryzen 9 processors are not validated for 24.7 usage as server processors.

Potential software compatibility problems

Our operating system on these servers is Windows Server 2022. On our previous Intel machines, it was a mix of Windows Server 2012 R2 and Windows Server 2022 as we were transitioning from 2012 R2 due to Microsoft ending support for it in late 2023.

When it comes to compatibility with our operating system there is only good news to report. The process scheduler of Windows Server 2022 is top-notch and fully understands the cache and core hierarchy of Zen 3 based microprocessors and can allocate program threads properly. Everything works and is stable.

Our software stack is mostly PHP, that's a mix of PHP 8.1 and 8.2. We also have some C and C++ auxiliary programs. All of our code and third party software just worked, we didn't need to recompile anything to increase compatibility or improve performance.

Hardware issues, stability problems and solutions

So that's the software side out of the way, when it comes to hardware you may note above we had some concerns about the memory controller. Here we did encounter an issue. All of our servers were equipped with 128GB of memory laid out as 4 x 32GB Dual-Rank 3200MHz ECC DDR4 UDIMM's from Samsung.

The CPU's memory controller and the memory modules we had installed are all rated to operate at 3200MHz but when operating at this speed (which by the way is JEDEC certified) we had instability on two of our four servers. This instability manifested as complete system lockups and crashes after several days of uptime.

Thankfully due to our cluster architecture, these crashes did not negatively affect the service and we didn't experience any downtime but it was a big problem that had to be resolved. In the end, we were able to make all our systems stable by changing the memory frequency from 3200MHz to 2666MHz which appears to be the recommendation when using all four slots on a Ryzen motherboard while using dual-rank memory modules.

Our opinion on this is that AMD's memory controller at least on this processor is weak. Intel's memory controllers on all their processors are much more stable at higher frequencies and when using multi-rank modules or when using multiple modules per memory channel.

Reducing the frequency of our memory modules did result in our memory bandwidth reducing by 16.66%. While frustrating as we are paying for much more expensive memory modules that are running at a lower speed than they should be, stability is paramount and not something we're willing to compromise on. We've experienced no more instability since making this configuration change.

Performance

The performance has been exactly as the graphs from our initial announcement showed. Wildly faster than our previous setup with a single one of these servers eclipsing all our other servers combined in not just CPU performance but storage I/O too.

When we switched processors we also eliminated the remaining Hard Disk Drives in our cluster by changing to enterprise U.2 NVMe flash drives from Samsung. These SSD's have been incredible for consistency as one of the issues we had with our previous architecture was the I/O wait times on the hard disk drives. We tried to mitigate that by using in-memory caching as much as possible but there are limits to how much data you can store in memory.

The 128GB of memory that we installed hasn't been that impactful from a quantity perspective but we knew that was likely to be the case before we migrated due to us working with a much smaller quantity of memory on our older servers and being frugal with the memory we've had in the past.

Lessons for the future and would we go AMD again

The main thing we learned is to scrutinise the spec sheets, the RAM frequency issue with dual-rank modules took us by surprise, thankfully it was an easy thing to resolve though diagnosing that as the root cause wasn't so easy.

Since we deployed these 5950X based systems we've also acquired a Zen 3 based 24 core EPYC system with 256GB of memory, this is being used for local research and development running many virtual machines and a mirror of our live physical infrastructure. We're very happy with this system and indeed all of the AMD systems we operate.

For the foreseeable future we see ourselves buying AMD based systems exclusively and looking at the state of Intels product roadmap this seems to hold true through to 2026. AMD has recently launched Zen 4 based processors, this is a highly performative suite of products utilising newer DDR5 memory and although we don't have any of these yet I could see us deploying one either for local development or as a remote node as part of our cluster in the near future.

We hope this post was interesting, we have been asked by a few customers how the upgrade went in specific detail and we hope this answers those questions. Thanks for reading and have a great weekend!


New API version, currencies now live

Today we've set our January 14th API version from beta to stable and it's now the default and recommended version of the API. With that, we would like to detail the new currency result in the API in a little more detail.

Image description

Above is a screenshot showcasing three different address results and the currency information for the countries to whom those addresses are assigned. As you can see we display the isocode, the name of the currency and the symbol. To activate the currency information simply supply &asn=1 with your requests.

We're still welcoming any corrections users may have for this data as there are some disputes about which symbols should be used to represent specific currencies even within a country's borders and so we're having to make some judgement calls on a case-by-case basis.

With this addition to the API it should now be easier than ever for you to localise your content for specific visitors, we've been asked a lot for more tools to pinpoint a user's local metadata beyond just location information and that's why we added timezones and now currencies, we are looking to expand the amount of data we offer within this same sphere in the future.

Thanks for reading and have a wonderful week!


Introducing Currencies to the API and documentation improvements

Image description

Today we've introduced a new version of the API dated 14-Jan-2023 (selectable from within your dashboard) which adds currency information to the API output. This is a beta feature which is why this new API version is not selected by default for users who have selected "Latest Version" from the API selector dropdown.

With this new API version when you check an IP and have the asn flag enabled (&asn=1) you'll receive local currency information for the IP you're checking. This includes the ISO code, name and symbol for the currency.

As it's a beta feature we are looking for your feedback including bug reports, incorrect data and so forth. With the new feature, we've also added a new flag called &cur=0/1 which lets you disable or enable this data in the API result. By default when using &asn=1 the currency information is shown and if you don't want it, you can supply &cur=0 to disable it.


In addition to the new currency feature, we've also added a new resources section to the website which we're beginning to fill with useful information that customers may need to better utilise our API or integrate our data with their services. The first resource we're launching with is geographical data, specifically lists of continents and countries that you can expect to see in our API responses.

You can view that new section on the updated API documentation page found here.

That's all the updates we have for you today, we hope you'll take the time to try the new API version and let us know what you think! - Thanks for reading and have a great weekend.


Back