Email Alert Improvements and a CDN Caching Update

Today we've pushed a minor update which will bring about a quality of life improvement for our customers who prefer to make singular payments for service instead of being on our reoccurring billing system.

From today we'll send you an email a day before your paid plan is going to expire so you don't caught out. Prior to today we only sent you an email at the same moment your plan transitioned from paid to free which did cause some customers to go without full protection until they started a new plan.

Like our previous plan expired emails, these are tied to your email preferences within the customer dashboard, specifically the "Important emails related only to my account" email toggle.

So if you don't want to receive these kinds of alerts (for example you've chosen to cancel a paid plan as you don't need it anymore) you can simply toggle that email preference off and we won't send you any such notifications.

The second thing we wanted to talk about is our new CDN Caching that we introduced a few days ago to help smooth out peak loads and specifically handle DDoS attacks.

The good news is we've been able to significantly reduce load on our cluster by utilising caching at our network partner CloudFlare. In-fact we had another attack on our infrastructure recently and the mitigations held up quite well, service was not affected or disrupted. We've been tweaking the system and today we pushed live some changes to how it handles unregistered users (those without an API Key) so they have an even smaller impact on our service.

For registered users (both free and paid) this means a better quality of service overall as less resources need to be spent handling unregistered user queries.

Thanks for reading! - We hope everyone is having a great week.

New Statistics Synchroniser

As the proxycheck service has become more popular we've found that our prior syncing system was not living upto expectations. High CPU usage due to the volume of statistics needing to be processed and synchronised was the main problem.

To combat this issue in the past we added update coalescing which is where our servers each maintain a local object which stores all the raw statistics for a period of 60 seconds and then all our servers transfer their objects to the server that has been selected to process statistics for that time period (the server selected to process statistics is regularly rotated but to maintain database coherency only one server can perform writes at any one time).

This had worked well for the past few years but as we've grown this method wasn't enough on its own to combat this issue. Where as at the beginning we could update your stats within 90 seconds of you making an API query what we've found lately is that some statistics can take up-to an hour to show depending on the load level of the node that originally accepted your API query.

This is quite obviously unacceptable which is why today we've gone through and rewritten the way all statistics are synchronised. We're still using our local object approach with update coalescing but we've completely reprogrammed the methods for sending data, chunking data, checksumming data and verifying data.

The biggest change this brings is a significant reduction in CPU usage. Synchronising statistics had become such a burden that some of our servers were spending upto 25% of their CPU utilisation on statistic syncing. With the new system we're seeing huge reductions in CPU usage with around only 5-6% usage when handling the synchronisation of live customer statistics and that usage is over much briefer time periods, stats are now synchronised within just 2 to 3 seconds where as before we were seeing even in the best case scenarios syncing taking more than several minutes.

This change is live right now which means you should see every stat in your dashboard (and on our dashboard API endpoints) update much quicker.

Thanks for reading and have a great weekend!

Degraded Performance Yesterday and Mitigations

Yesterday between 7:30 PM and 10:30 PM GMT we experienced some highly degraded service with many of your queries taking upto 2.5 seconds to be answered if not dropped entirely.

This was due to a sustained attack against our infrastructure. In this case we were not the initial target of the attack, we were simply dragged into it due to one of our customers using our service to protect their game server. The individual[s] attacking our customer turned their attacks on us as a way to degrade the service level we were providing so that their attacks on our customers game server would be more effective.

The traffic we received was 9.5x higher than we ever would normally experience and was tuned for maximum resource depletion. Although our service did not completely go down and normal service resumed immediately once the attack stopped we did have severe service disruption which we intend to mitigate with two changes we have enabled today.

Firstly we're adjusting our per-second request limiter. Previously it allowed you to make between 100 and 125 requests per second with a resolution of 1 second (per node). We're changing this so that the resolution is now 10 seconds. The per second limit is still the same but with the added resolution range we can help our servers ignore bad requests for a longer period of time and this will help smooth out the kind of per-second peak loads that denial of service attacks create.

Secondly we've enabled request caching at our edge CDN (Content Delivery Network). This means every unique request you create will be cached for 10 seconds. This cache is per-customer so you will never receive cached content generated by another customer. The main benefit here is it will allow the same IP Address to be checked multiple times by a single customer without incurring requests to our servers.

We've made this second change because when our own customers suffer DDoS attacks they often send the same singular IP Addresses thousands of times a minute to our API which exhausts their query plans and creates undue load on our servers answering the same queries multiple times.

We're hopeful that both mitigations will help with future attacks but as always we will monitor the situation closely and alter our strategy as we see fit. We're also planning to add more servers into the cluster to further load balance this kind of peaking traffic.

Thanks for reading.

A Message About Intel's Microarchitectural MDS Vulnerability

Today there has been a revelation in the news about a new attack dubbed ZombieLoad which allows the exfiltration of data held in system memory by processes that shouldn't have access to that data on Intel systems.

Here is a quote from the ZombieLoad website:

While programs normally only see their own data, a malicious program can exploit the fill buffers to get hold of secrets currently processed by other running programs. These secrets can be user-level secrets, such as browser history, website content, user keys, and passwords, or system-level secrets, such as disk encryption keys.

One scenario where this can be exploited with wide-reaching ramifications is cloud hosting. When you rent a Virtual Private Server (VPS) from a cloud host such as Amazon, Digital Ocean, Google, Microsoft and many others you're actually sharing a virtualised piece of a larger system. And you will have neighbours on that system.

Normally the virtual machines are kept completely isolated from each other with files and memory kept separate by the host system. But with this attack you can break down those secure barriers and peek at what the other guests on the system are doing with their slice of the host resources. You can also peek at the host system itself revealing data encryption keys, root access keys and other data that should remain secret at all times.

Now the reason we're making this post is because we've already had some customers send us links to the new attack because they're concerned about how it affects us. Well we want to make clear, we're not affected because we do not use cloud hosting or virtual private servers for any of our core infrastructure.

All of our nodes within our main cluster are bare metal meaning we operate the entire physical server, we're not renting just a slice of it. That also means all customer data we have is held on bare metal servers and are completely safe from this new vulnerability. While it's true we use VPS's for our honeypots they do not hold any data beyond incoming attacks.

We specifically use a data-retrieve model for our honeypots where they collect and store attacks and then one of our core servers connects to the honeypot and downloads their data. At no point are any of these honeypots given access to the rest of our infrastructure. They hold no keys or credentials. We treat them the same way we would any untrusted third-party.

Thanks for reading and have a great week.

Spring Updates

This month we've been working mostly behind the scenes on various long term projects and today we'd like to share some of these with you so that you can get an understanding of how we're improving the service going forward and what investments we're making for our future.

Earlier this month we added a new server similar in nature to STYX which has a specific duty within our infrastructure. This new server is called HADES and it's an orchestration server. It handles all our server management duties including the scheduling of automatic software updates, the setup and provisioning of new servers and continued real-time monitoring of our entire infrastructure among other similar duties.

Essentially HADES allows us to do a lot of things in a completely automated way that could only be done semi-automated in the past. Which means by having HADES we can deploy a greater number of servers with the least amount of ongoing management burden.

This leads well into what we've been working on with honeypots. If you've visited our status page before you may have noticed at the bottom we list 20 of our own honeypots. These are virtual private servers we setup around two years ago to appear attractive to bots which are hunting for exploitable computers on the internet. The data provided by these honeypots has been invaluable to growing our unique datasets.

But we always want more data which is why we have been aggressively securing partnerships with sources of unique attack data. One reason for why this data is useful is because there is a very wide crossover between addresses being used for proxying and addresses being used for botting.

And our customer base is growing in the direction where it's not just good enough to only detect proxies and virtual private networks, our customers almost always want to block any non-human visitor to their site or service. They know that bots are a growing nuisance which are causing untold damage to infrastructure and loss of earnings.

Be it from employee time wasted cleaning web properties of spam comments, click fraud that destroys your ad revenue or devastating and automated exploit hunting that results in infrastructure compromises that hurt customer confidence in your business and lead to big financial ramifications.

All of this is why we've been putting a deep focus on what our API today lists as compromised servers. We have an opportunity to make the internet safer and at an affordable price so that no site or service need go unprotected.

And of course these new partnerships we've been making continue to respect our privacy obligations. The only data we're sharing with our new partners comes from our own honeypots and inference engine. We're not giving up any of your data or the data you entrust to us about your own customers.

These changes are all live as of this post and we've already grown our purview of bots by 25% this month and we see that only growing as we strike more partnerships going forward.

Thanks for reading this update and we hope you all have a great week.

Celebrating our 3rd birthday with new API Documentation!

On the 20th of April reached its third year milestone. In that time we've served many billions of queries and added many features. To celebrate this occasion we decided to redo our API documentation so that it meets the level of quality that our service delivers. So let's go through some of the changes we've made.

  1. Each section of the API must be useful, less repeated information makes it faster to read.
  2. Anchor points throughout the document making it possible to link to specific sections.
  3. A side menubar which lets you click straight to the section you're interested in and follows your progress through the document.
  4. A moveable menubar so you can position it wherever you'd like it to be.
  5. Straight forward tables showing feature, limit and other breakdowns with colour formatting.
  6. A test console so you can try all the flags and status codes right in the document itself.
  7. Showcased coding libraries and their features so you don't waste time reinventing the wheel.

So lets get to the screenshots! - First we'd like to showcase our new side menus which not only track your position through the api documentation based on where your cursor is positioned but are also draggable to anywhere on the screen where they will stay. This is especially useful for users with small screens who may need the menus to overlay some of the document content.

Image description

The next thing we wanted to illustrate is our new library section where we feature many implementations of our API written by ourselves and other developers that you can use to jump start your own use of our API.

Image description

If you have created a function, class or library for our API that you would like us to feature in our documentation please contact us!

Now If you do decide to build your own client for our API then what better way is there to familiarise yourself with our formatting and status codes than to try them right in your browser. So we've included a new test console that lets you see every possible answer from our API with a high degree of customisability. Image description

We hope that you'll find these new features highly useful when developing for our API. We know that the documentation has needed these changes for some time and we've been planning to overhaul them since the beginning of this year. We spoke with multiple customers during the design of our new documentation page and they were instrumental in recommending changes and additions.

If you have any feedback please contact us and let us know, we would very much like to hear from you. Thanks for reading and we hope everyone has a great week!

Per Second Request Limits Introduced

Yesterday we introduced per second request limits to our API to ensure consistent service for all customers, we've had to implement this due to two main situations that have arose recently.

  1. Our customers who are under a sustained denial of service attack tend to have bursts of extremely high requests per second to our API, often they are checking the same addresses over and over again in quick succession due to them not caching the queries they make to us.
  2. We've been receiving multiple denial of service attacks targeting us directly which are tuned to exhaust our CPU resources.

So to stop these situations from reducing our quality of service we have introduced a per customer per node per second request limit of 100. Now that is a lot of pers so I will explain in a little bit more detail how the limit is actually enforced.

First, we have a soft and hard limit

  1. The soft limit is between 101 and 125 requests per second. (Your request will succeed, with a warning message).
  2. The hard limit starts at 126 requests per second. (Your request will be denied).

Secondly, these limits are per-node

Currently there are 4 nodes in the cluster which means if you are able to distribute your requests across multiple IP Addresses you're more likely to evenly load the cluster which will raise your per second soft request limit from 100 to 400 and your hard limit from 125 to 500.

Thirdly, the limit is per customer

Each individual customer can make 100 requests per second to each of our nodes or 400 requests per second to the cluster as a whole.

Fourth, the limiter has a resolution of exactly one second

What this means is, if you go over the limit in the current second and you receive a warning or denied response from the API, by the next second your queries will be answered again and the allowance per second is reset. We're not recording how many queries you make over minutes or hours and then dividing that volume by seconds, this is a truly per-second limit and so you won't be penalised for a short burst of very high requests.

Fifth, these limits are per request, not per query

What this means is, you can still send multiple IP Addresses to be checked in a single request. The current limit is 10,000 addresses in a single request, if you did so that would count as one request in the second we receive it and not 10,000 requests.

Based on our research we believe these limits are some of the most liberal of any API of this type. We've seen services which have limits of less than 100 queries per minute and we're offering 100 per second (or 400 when evenly loading our cluster).

We believe our limits are very reasonable, if you took our largest pre-configured plan of 10.24 Million queries per day you would need to make 118 requests per second over a 24 hour period to utilise that full plan (when performing one query per request) and it's very likely you would evenly load our cluster during that time putting you well within the 400 requests per second soft limit, but even if all your traffic was always directed to a single node you would still be under the 126 requests per second hard limit.

Of course this is our first foray into request limiting and we may alter the limits in the future, rest assured we will fully detail any and all changes. And remember as we add more nodes to the cluster the overall cluster request limit will keep rising.

Thanks for reading and have a great weekend.

API Performance Improvements

Today we have rolled out new versions of our v1 and v2 API's with a focus on reducing query processing time and the improvement is quite drastic.

Prior to today it was common to have a full query which means having all flags enabled (VPN, ASN, Inference etc) take between 7ms and 14ms to process. Today we've been able to reduce that to between 1ms and 3ms on average.

The way we've been able to accomplish this is by porting some of our changes that we've created for our yet unannounced v3 API back to v1 and v2. The changes are mostly structural dealing with the processing of our code, how it's compiled and cached between queries and finally how it's executed again for subsequent queries.

Although we've always reused processes for different queries (due to the time it takes to setup a new process being too long) we're now doing it more efficiently with more data being retained by these processes so they don't need to reload as much information into memory between queries.

We've also altered our caching system for our code to have a tiered storage approach. Code files are now loaded from disk into memory by one of our processes and then the opcache retrieves the code from that memory based file cache before compiling it and storing the subsequent opcode in memory. This results in more consistent performance as the opcache no longer needs to reload or check modified dates of code files from our physical disks, they are instead held in memory by our file based caching process.

This change is important because opcache checks code files frequently to determine if the file containing our code needs to be re-compiled and cached again, storing the files in system memory thus keeps performance consistently high over long periods of time.

We're also making more efficient use of the compiled versions of our code by removing comments and other nonessential text from the code files prior to compilation which makes the compiled versions smaller and faster to run. Finally in the code itself we've reduced database calls which cause processor context switching so we waste fewer CPU cycles through gathering and sorting information and more time displaying information. This ties into the change we mentioned above with regards to our re-used processes storing more data in memory for each query to make use of.

So what is the net benefit of having the API respond this quickly? ~7ms to ~3ms may not seem like a lot of time to save. But put simply, it allows us to handle 2.3 times more queries per second on each cluster node.

It also means we can answer your specific queries faster so you're not waiting as long for an answer from us, that allows you to use the API in more time sensitive deployments. Some of our customers are already so physically close to our infrastructure they can receive an answer (including network overhead) in under 20ms so we're getting to the point where every single millisecond we save counts.

Thanks for reading and have a great day!