Infrastructure Deep Dive

Over the past year we've had a few customers ask us how our service is structured, what kinds of software we use and what custom solutions we've created to run As it's coming near to the end of the year we thought now would be a good time to take you inside our software stack.

To get started lets follow a request to our API from your server to our server.

Image description

So firstly our service is entirely behind CloudFlare's Content Delivery Network (CDN). So whenever you access our website or API you're first going through CloudFlare. As illustrated your request first goes to the closest CloudFlare server to your servers physical location. This is done using IP Anycast and is entirely handled by CloudFlare.

Image description

Once your request reaches CloudFlare it then enters their Argo Virtual Private Network (VPN). This is a service CloudFlare offers (for a fee) which uses dedicated network links between CloudFlare servers to lower network latency. Essentially we use Argo as a fast and low latency on-ramp to our servers.

This is what enables us to serve customers who are the furthest away from our cluster nodes while delivering acceptable network latency.

Image description

At this point CloudFlare chooses one of our geographically separated cluster nodes to send the request to and the CloudFlare server closest to that node performs the request and returns the answer from us to you back through the Argo VPN to the CloudFlare server closest to you.

Image description

But what actually happens inside our server? Well the above illustration explains. Firstly all our cluster nodes run Windows Server. We feel that Windows offers a lot of value and performance and we've found IIS to be quite a competitive webserver offering low CPU and Memory usage while being able to handle enormous amounts of connections. That isn't to say we think NGINX isn't good, in-fact we use NGINX running on Linux for our Honeypots and even CloudFlare uses NGINX for the connections they make to us.

Second to IIS is of course wincache which you can think of as opcache for Windows. It allows us to keep compiled PHP scripts in memory and re-run them without needing to re-run the PHP interpreter. This is very important for performance. You can also store user variables, sessions and relative paths in wincache but we don't usually make use of these features and instead rely on our own memcache implementation which we will detail below.

Third and Fourth is of course PHP v7.2 and our code which is written in PHP. You may think why didn't we use something more modern such as node.js, well we feel comfortable with PHP and the latest versions have been quite incredible when it comes to performance and features. Having over a decade of experience with PHP has given us great insight into the language, its quirks, its limitations.

Image description

Above is an illustration of what happens inside our code. We've tried to outlay each step in the program loop. All of this usually happens at or under 7ms for every "full" query (meaning all flags enabled). And in the case of performing a multi-query a lot of what happens on the left is only done once. This allows the latency per IP checked to go down dramatically when you check multiple addresses in a single query.

We will elaborate on the caching system and machine learning below.

Image description

We talked a bit above about our custom memcache system. You may be thinking why did we roll our own when we could have used memcached or redis? - Both quite common and well developed. Well we found that in the case of memcached its use of UDP as a mechanism to withdraw cached data wasn't consistent enough under high load scenarios.

So to explain what behaviour we saw, most of the test queries we performed would be answered in under 1ms with memcached. But sometimes we would have queries that took 1 second or even 2.5 seconds. We determined this was caused by its network communication system.

For a website those kinds of one-off latency hiccups are fine but for our usage those issues would add-up fast. After testing with it for more than a month we decided to roll our own system which relies on inter-process communication similar to an RPC call. Essentially we load an interface within PHP as an extension and that allows us to store and retrieve data as needed from our custom memcache process that runs separately on each server node.

Our memcache system also has some features you'll find in redis such as being able to write out cached data to persistent storage, network updates to keep multiple instances on different servers consistent and the ability to store almost any kind of data including strings, arrays, objects and streams.

In addition to those features it can also make use of tiered storage which means it can store the most frequently touched objects in memory and keep less frequently used objects on an SSD and then even even less frequently used objects on a hard disk. We've found this approach very beneficial for our large machine learning datasets where we try to pre-process as much information as possible so that query times remain low when utilising our real-time inference engine.

Image description

Which is a great segway into how our machine learning system works. We don't want to go into too much detail about what specific machine learning framework we're using or what models we're using but we can confirm we're using an open-source library. Above are some of the intelligence gathering methods our inference engine uses in determining whether an IP is operating as a proxy or not.

And we do want to elaborate on some of these as it may not be so obvious. For example "High Volume Actioning" due to our wide visibility over many different customer properties where they make use of our API combined with our own honeypots we're able to monitor unusual and high volume actions by singular IP Addresses, this could be signing up for many different websites, posting a lot of comments on forums and blogs, clicking on a lot of ads etc - Behaviours that are on their own not unusual but when done at a high frequency within a short time frame become suspicious.

Another we wanted to elaborate on is "Vulnerable Service Discovery". A growing number of Internet of Things (IoT) devices are being turned into proxies by criminals. These range from CCTV Cameras to Routers to general "smart" devices like home automation hubs, kitchen appliances and so on.

Our system during its probing and prodding of addresses discovers a great deal of compromised devices which either can be accessed by default credentials or have a vulnerability in their firmware which has yet to be patched that can allow an attacker to setup proxy serving software on the device.

Simply having a vulnerable device available isn't going to get an IP flagged as a proxy server but it does hurt that IP's reputation and it will be weighted along with the other data we have for that IP Address when the inference engine makes its final decision.

Finally we wanted to talk about "Automated Behaviour Detection" similar to the High Volume Actioning this is where we use our wide visibility and honeypot network to observe addresses performing web crawling, spamming, automated signups, captcha solving and other activity that fits botting behaviour. As bots have become more sophisticated and actually execute javascript within headless browsers it has become harder to stop them from accessing your web properties.

So detecting this kind of automated behaviour is extremely important and our model is designed to detect all kinds of automated behaviour through wide observation of incoming queries (combined with tag mining from our customer queries) and through our own honeypot network of fake websites, blogs, forums and more.

So we hope that you enjoyed this deep dive into our software stack. Obviously some parts we've had to hold a little close to our chest as we feel they give us a competitive edge (especially with regards to our machine learning) but we think we shared enough to give you some insight into what we're doing and how.

Thanks for reading and have a great week!

New upcoming payment notices

As our service has been offering subscriptions for quite a while now we've come across a few instances where customers forget that they're signed up for a subscription with us or they didn't realise the payments for their subscription are taken automatically as opposed to being paid by the customer manually.

Thankfully these instances where we bill someone without their knowledge are rare and in each case we have always issued an immediate refund once the customer contacts us about the situation.

But to eliminate this problem we've decided to be proactive about it by offering upcoming payment notices. And so within the customer dashboard starting today you'll see a new email toggle (which replaces our never used promotions toggle) which allows you to activate email notices for upcoming payments.

By default all new customers will have this toggled on, if you're an existing customer you'll need to enable it yourself. Also if you're subscribed to a yearly plan we'll still send you a notice regardless of this setting because we feel it's important that customers who hold a yearly subscription get these notices due to those plans being very expensive.

We don't want anyone to forget a payment charge is coming but we know for monthly subscribers receiving two emails every month (a notice of an upcoming payment and the receipt for payment) could get annoying so we've added this email setting toggle for those users who are subscribed monthly. Of course we'll always still send you payment receipts regardless of this setting.

Below is an example of what the email looks like.

Image description

We think it conveys everything succinctly and most importantly lets you know that you can cancel your plan from the dashboard to avert the upcoming charge.

Thanks and we hope everyone had a great Halloween! ๐ŸŽƒ

New homepage, footer changes and dropping google ads

Today we've launched a brand new homepage with the goal of drawing in more users by showcasing our amazing customer dashboard which we feel is our biggest differentiator in this space and a great asset.

Remaking the face of your website, the home page everyone sees when they visit for the first time is a daunting task and we've been quite conservative with our changes over the past two years but today we've taken a big step and we're very happy with how it turned out.

If you're very perceptive you may also have noticed that we've cleaned up our footer navigation across the site by removing some redundant links and visual separators. A more obvious change is our removal of Google Ads.

The reason for removing all ads across the site is due to them not performing well enough to warrant us carrying them. For the software developer community that our product is made for the usage of ad-blocking software is extremely high which results in very low ad views when compared to our page views.

So from now on we will not be displaying any ads on the site, not from Google or any other ad network, we'll instead be subsisting purely on the revenue made from selling paid plans.

We hope you like these changes and please do check out the new homepage!

Invoice history added to the Dashboard

This has been an often requested feature, the ability to view and print out past and current invoices. Today we've added the feature to the customer dashboard under the Paid Options tab and this is what it looks like: Image description We will be showing your most recent 100 invoices here, due to that possibly becoming quite a long list we've also added a hide button. To keep the page loading quickly we are loading in the invoice log after the page itself is loaded so the dashboard won't be slowed down at all by this new feature.

That's it for this update we hope you enjoy the new addition!

What happened on October 19th?

If you visited your Dashboard yesterday you may have seen a notice at the top explaining we had a very bad server failure on our HELIOS node which had caused many stats related issues. Today we will explain this very unusual failure and what we learned from it.

So to begin with, HELIOS had been our longest serving node. We have had that server for many years and it has had some hardware failures in the past including two failed hard disks. Yesterday was the most difficult type of failure to deal with from a programmers perspective, bad memory. To fix it we replaced the Motherboard, CPU and Memory so effectively HELIOS is a new server.

When writing any software you are building on a foundation of truths and what is held in the computers memory is something you have to trust as that's where all your software is actually living. It's very difficult to program a system to self diagnose a memory issue when the self diagnosis tool itself will likely be affected by the memory problems.

And that is exactly what happened here. Our system is designed to remove malfunctioning nodes from the cluster but in this case HELIOS's bad memory was causing it to re-assert itself. It even tried to remove other nodes from our cluster thinking they were malfunctioning because its own verification systems were so broken it was interpreting their valid health responses as invalid.

The reason this affected our stats processing is because to keep our cluster database coherent, to stop conflicts caused by multiple nodes processing the same data at the same time we use an election process where every so often the nodes hold a vote and one healthy node is selected to process all of the statistics for a given time period. Due to the HELIOS node memory issues this voting process did not work as intended.

What we learned from this is that we needed a better way to completely lock out malfunctioning nodes from the cluster and we needed more points of reference for nodes to self diagnose issues and preferably to break themselves completely when they discover problems that would need human intervention instead of continuing to harm the cluster by remaining within it.

Today we think we've accomplished both of these goals. Firstly we've setup a lot of references in our health checks for self diagnosis that weren't there before. This isn't a foolproof solution but if any of the references are corrupted it shouldn't allow the nodes built in self management system to start arguing with the cluster and voting other nodes offline or at-least if it still has the working capability to perform votes it should neuter itself before attempting to vote on other nodes health status.

Secondly we've broadened our nodes ability to lockout bad nodes by revoking the tokens needed to be a part of the cluster group. This means good servers with a consensus can remove the "passwords" required to access the cluster by a malfunctioning node.

A third change that we've made is having known good nodes act faster when they are removed from the cluster while they're still functional by allowing them to initiate a confidence vote amongst the other nodes, this can be done in just a few seconds after they are removed from the cluster if the node thinks it's working correctly. Only nodes with perfect health scores over the past 3 minutes are allowed to vote in these decisions to reduce false positives caused by malfunctioning nodes.

Also we should mention although we only have three nodes listed in the cluster there is in-fact 5 nodes. Two of them do not accept queries and are not front-facing and instead work behind the scenes to manage the health, settle vote disputes and step in under another nodes name if there is a serious enough issue to warrant that.

We are of course disappointed that this failure occurred, many of you contacted support yesterday via live chat to express your concerns and we're very sorry that this happened. We're especially sorry to those of you who received overage notices due to the invalid query amounts that accumulated on your accounts and we hope you can accept our sincere apology for that. Our hope is that with these changes something like this will never happen again.

Thanks for reading and we hope everyone has a great weekend.

Minor stats issue yesterday evening through to this morning

Just a quick notice, yesterday evening we renewed some of our internal security certificates and although we set the new certificates to be applied to all three of our server nodes they were in-fact only applied to our prometheus node.

Due to this, customer stats including how many queries you've made and your positive detections were not being updated within your dashboard. The good news is, none of these stats were lost, they just weren't being processed, we have now corrected the certificate issue and all of your stats from the affected time period will now be reflected accurately within your dashboard.

We're sorry for the inconvenience this caused.