Recently we've been focusing a lot of effort on improving the performance of our API. We've reduced overall query access time, improved network peering to lower our network overhead, added new query caching software and reformatted how our data is stored.
But over the past few days we've been focusing on the CPU usage of our nodes. With the inference engine running constantly and our API having to answer millions of queries per day we found that the CPU usage on our nodes was getting quite high. Here is a image depicting an average 60 seconds on one of our nodes, HELIOS.
As you can see from the graph above the CPU usage is quite consistently high around 55-60%.
So to figure out what is causing this consistently high CPU usage we looked at our performance counters and also the data from Ocebot. What we found was, this high CPU usage isn't being caused by API queries directly. Our level of caching and code efficiency is very high there and the impact of even several hundred thousand queries a minute was not causing these kinds of high load scenarios.
Instead we found it to be caused by the inference engine (about 10-20% load) and our database syncing system (25-30%). So combining these it's easy to get around 55% usage all the time.
To fix it we've rewritten some core parts of our syncing system, we did some code refactoring to this system last month so that some of our data that changes very often enters into a local cache to be synced at timed intervals. This coalescing of database updates allows for a higher efficiency because data that changes very often (hundreds or even thousands of times per minute) are being synced only one time instead of hundreds or thousands of times.
But what we found is, as our customer base has continued to double every few weeks that the amount of data we need to cache before syncing has increased too. So what we're doing now is staging all cluster database updates in local node caches.
As for the inference engine, we have manually gone in and altered some of the algorithm to remove some learned behaviour which got results but in an unoptimised way, artificial learning still has a way to go or at-least our implementation does. This has also resulted in lowered CPU usage.
So here is the result of our work:
Now we're seeing much lower average CPU usage, from 55% to around 7% with peaks to 10-15%. We're still optimising for CPU usage but we think we've hit all the major CPU issues with this update and we're now looking at other aspects of the service for improvement. The good news is by doing this kind of work we can put off purchasing another node for our cluster which leaves more money to pay for development and partner services instead of the servers that run our infrastructure.
Thanks for reading and have a great day!