Improved dashboard query stats

One of the things our customers have asked for is a way to view yesterdays query stats in their dashboard. This is a useful metric because it helps potential customers estimate how many daily queries they will need to purchase if they're going over the 1,000 free queries we provide.

So today we've gone ahead and spruced up the stats view on the dashboard, you can now page through the last 30 days of queries on a day by day basis and you can download your stats as a text file or export them using our JSON Export button. And like with our positive detection log you can query the JSON Export API using just your API Key so it can be integrated into your own control panels.

Here is what the new stats section looks like:

As always these new features are instantly accessible to all users whether you're on our free or paid plans. We've also enhanced the subscription UI if you're subscribed (monthly or yearly) you will now see how much you will be charged on your next billing date, prior to today we only showed you the next billing date and of course that's not useful if you forget what your plan costs.

We hope you all like these changes as they are all a direct result of your feedback, we really appreciate how helpful everyone has been with reporting the issues they find and requesting new features like the enhanced query stats we added today.

Thanks for reading and have a great day!


Customer Q&A

Since the service has been going quite a while now I thought it'd be a good time to answer some of the most frequently asked questions we get from customers. Firstly we welcome these questions so please feel free to keep sending us questions via our live chat or email ([email protected]).

So lets get into the top questions we receive.

Question: Why is the free service limit so large? (1,000 queries per day) and why are the paid plans so affordable?.

Answer: We know that there are two types of proxy detection API's out there, the free kind where absolutely every query you make is free up to a usually unexplained "reasonable" amount and the paid kind where the cost is extremely high, especially for new developers who are simply nurturing an idea into existence and thus don't have the capital needed to purchase expensive subscriptions.

So our free plan is quite generous because we're fighting for market share from the other free players in this category. Our main free competitors have a lot of mindshare because they're free but the service level offered isn't that great. Some of them did not or continue to not offer TLS querying, they are ambiguous about how many queries you can make per day before being cut off and none of the ones we've found offer cluster redundancy to free customers. There is also some questions about how accurate free services are when the creators are not being incentivised monetarily to keep their API's up to date with the latest detection methods.

So these are areas we identified where we could offer something better to grow our marketshare, to pull some customers away from the free services and perhaps in the process convert them to paid customers but even if they don't convert, simply mentioning us online (as developers often do) is more than enough pay back for the free service they enjoy.

As for why the paid plans are so affordable, we don't have much overhead as we run proxycheck.io very lean. We've worked hard to write it in a way that it scales on mid-range server hardware. If you look at the companies that are offering paid only proxy detection their plans are often several times more costly than us but they have enormous overheads with lots of employees. We simply don't, we know that proxy detection is mostly a niche service, it's not as popular as for example geolocation services.

And the other side of that coin is we believe all developers should have access to the best proxy detection service for the lowest prices. Right now our starting plan offers unlimited concurrent querying and 10,000 daily queries for only $1.99 a month or $1.59 a month if paid annually. Some of the lower paid only providers start charging at $8.99 a month and that's just too much for most peoples websites.

But rest assured we have a lot of paying customers, the service is profitable and we intend to maintain our current pricing while adding lots of new features and further improving our detection methods.

Question: Why do you only accept Debit/Credit cards and not PayPal? (I want to pay with PayPal)

Answer: This question has come up more often than any other and to be quite honest PayPal is not an enjoyable company to work with, it's more like a necessary evil. Part of us being able to offer prices as low as $1.99/$1.59 (Annually billed pricing) is because we don't use PayPal.

They take quite a large slice of each transaction. Not only do they take a percentage but they take a set fee as-well and it would simply eat into our revenues too much. We would have to increase our lowest priced plan by an extra dollar.

PayPal also has a nasty habbit of closing accounts on a whim, something we can't be dealing with when our business relies on the automatic reoccurring payments made by our customers.

Naturally we understand the reason people like to use PayPal as a customer, it keeps your Debit/Credit card information safe as the merchant doesn't get it, but this is partly why we partnered with Stripe instead. Their card processing fees are much lower than PayPal and we still do not get your card information, only Stripe has it and they're fully PCI compliant like PayPal is.

Unfortunately we cannot offer PayPal at this time although it has resulted in quite a few lost sales I think you can understand from our perspective we're reluctant to enter into any partnership with them.

Question: Can I make an app that includes your API and if so can I sell that app to others?

Answer: You sure can, we welcome you to make all manner of software which includes our API and if it's really good we'll even feature it on our examples webpage so feel free to shoot us an email when you've made a great app!

Question: How are you detecting/gathering the proxy addresses to be blocked?

Answer: There are a lot of different ways we use but the most common two are scanning different websites all over the internet all day every day to find new IP Addresses which are acting as proxy servers. The other way is collecting IP Addresses by testing them using our own inference engine.

The inference engine is a type of machine learning system where by we set it some goals and a working set of data and a very rough guideline of how to apply evidence based reasoning to sorting that information. In this case good IP Addresses from bad ones. Over the past few weeks it has quickly become a major source of proxy server addresses for us and during our testing we've found 90% of the addresses our inference engine finds are unknown to other major proxy checking services we've tried.

On top of these major ways we get around 0.25% of our data from third party sources who we either pay for their data or they make it available for free without commercial license stipulations. It's important to realise we cannot have total detection so we look to third parties when they have reasonable pricing or licensing and offer unique address data that we didn't have. Thanks to our inference engine we've seen our database almost double in size and so our reliance on third parties has fallen drastically as a result.

I should also mention we operate 20 honeypots situated in different address ranges around the world and from these we feed our inference engine data about attacks being performed on our honeypots. We monitor for VNC, RDP, FTP, Telnet attacks and also Email spam, Website signup/comment spam and more. These have been a great source of data for us.

Question: Why are you using CloudFlare?

Answer: CloudFlare is a CDN (Content Distribution Network) and they have servers all over the world, as do we. They can offer us great connectivity to our customers anywhere on the globe, even in areas we don't have servers.

The other benefit is our cluster was designed from the very beginning to work in tandem with CloudFlare and that is why we're able to offer triple server redundancy without requiring our developers do anything special with how they query our API. It just works, it's incredible fast and we've designed it into our cluster from day one. Whether we have a single server or 2,000 servers in our cluster it will continue to just work.

Question: What's your opinion of x service that denigrated your service?

Answer: Whenever you enter a competitive market you're going to get some disparaging comments from the other established players. We've actually been offering proxy blocking services (for free) since 2009 but it was only in 2016 that we put up the proxycheck.io domain and decided to turn it into a proper business. So what I'm saying is we have a lot of experience, we're not new to this and frankly our service is already one of if not the best offering developers incredible flexibility in pricing and features.

Question: Your biggest plan isn't big enough for me or I'm worried I will outgrow your service

Answer: Currently our biggest plan is 2.56 Million daily queries for $29.99 a month / or $287.90 a year (20% discount for annual pricing). However this is just our biggest set plan. If you need twice, thrice or quadruple this amount of queries (or more) simply let us know, we're offering very competitive pricing.

For example for $39.99 a month you can have 5.12 Million daily queries. We're not suddenly charging extortionate pricing for custom plans so feel free to shoot us an email and we can discuss your needs.

Question: Can I really cancel my plan at any time?

Answer: You sure can. From your dashboard you'll see a new Cancellation button if you're currently subscribed to any plan (monthly or yearly). The best part is if you cancel before your current plan ends you don't lose what you've already paid for. So you can purchase a one month plan, cancel it after a few minutes and still enjoy an entire months worth of your paid plan.

We've done this because we know some are hesitant about automatically renewing subscriptions, you don't want to be caught out with a payment you forgot was coming and so we fully support your ability to cancel a subscription without losing anything you've paid for.

And that's all folks!

We hope you've found this little Q&A useful. All of these questions were put to us by customers, often many times. If you have any other questions please feel free to email us at [email protected] we aim to answer all emails within 12 hours.


Improving node efficiency

Recently we've been focusing a lot of effort on improving the performance of our API. We've reduced overall query access time, improved network peering to lower our network overhead, added new query caching software and reformatted how our data is stored.

But over the past few days we've been focusing on the CPU usage of our nodes. With the inference engine running constantly and our API having to answer millions of queries per day we found that the CPU usage on our nodes was getting quite high. Here is a image depicting an average 60 seconds on one of our nodes, HELIOS.

As you can see from the graph above the CPU usage is quite consistently high around 55-60%.

So to figure out what is causing this consistently high CPU usage we looked at our performance counters and also the data from Ocebot. What we found was, this high CPU usage isn't being caused by API queries directly. Our level of caching and code efficiency is very high there and the impact of even several hundred thousand queries a minute was not causing these kinds of high load scenarios.

Instead we found it to be caused by the inference engine (about 10-20% load) and our database syncing system (25-30%). So combining these it's easy to get around 55% usage all the time.

To fix it we've rewritten some core parts of our syncing system, we did some code refactoring to this system last month so that some of our data that changes very often enters into a local cache to be synced at timed intervals. This coalescing of database updates allows for a higher efficiency because data that changes very often (hundreds or even thousands of times per minute) are being synced only one time instead of hundreds or thousands of times.

But what we found is, as our customer base has continued to double every few weeks that the amount of data we need to cache before syncing has increased too. So what we're doing now is staging all cluster database updates in local node caches.

As for the inference engine, we have manually gone in and altered some of the algorithm to remove some learned behaviour which got results but in an unoptimised way, artificial learning still has a way to go or at-least our implementation does. This has also resulted in lowered CPU usage.

So here is the result of our work:

Now we're seeing much lower average CPU usage, from 55% to around 7% with peaks to 10-15%. We're still optimising for CPU usage but we think we've hit all the major CPU issues with this update and we're now looking at other aspects of the service for improvement. The good news is by doing this kind of work we can put off purchasing another node for our cluster which leaves more money to pay for development and partner services instead of the servers that run our infrastructure.

Thanks for reading and have a great day!


Improved ASN data and lower response times

Earlier today we made a post about our ASN data source having some network issues causing us to have incomplete ASN data. We have sinced switched data sources for ASN information which has resulted in two benefits.

  1. Queries that ask for ASN data are now being answered in 100-200ms instead of 400-600ms.
  2. We now have ASN data for IPv6 addresses.

Previously only IPv4 was supported for ASN lookups and those took quite a while (relatively speaking) to be answered, we're now using a much better partner for this information which allows us to store more ASN data on our servers themselves resulting in faster and more complete lookups.

Thanks!


Minor Node Issue with HELIOS

Around 10 hours ago an intermittent syncing issue with our HELIOS node began where by it wasn't syncing some of its data with the other nodes in our cluster, stats data and new user registrations specifically were not being synced by this node. This morning we discovered the issue and it has been corrected by the time you're reading this post.

At no point did any customer data become lost and your query stats should show correctly as of right now. Also at no time was the API giving bad or incomplete data as the syncing of that information was working correctly at all times.

We're very sorry that this occurred and we're investigating why HELIOS was not removed from the cluster permanently, our initial findings seem to indicate it synced up completely a few times and then re-entered the cluster only to fall out of sync almost immediately and then was removed again after a long delay. We will be adjusting our cluster architecture to be more resilient to these kinds of intermittent faults in the future.

In an unrelated event we're having some latency issues with our ASN data suppliers network which has resulted in diminished ASN information and also higher latency. Expect 4-5 second replies for queries that contain the ASN flag and for the information to be incomplete (lacking country information sometimes). We expect to have this working correctly again soon.

Thank you for your patience and have a great day.


Real time Inference Engine

As we mentioned in our previous blog post about Honeybot our machine learning inference engine has become so fast in making determinations about IP Addresses that it exhausted our backlog of negative detection data and this subsequently slowed down its self iteration considerably.

We've now reached a point where the algorithm we have is able to consistently make an accurate assessment of an IP Address in under 80ms and so we've decided to add the inference engine to our main detection API for real time assessment.

What this means is when you perform a query on our API we now have our inference engine examine that IP Address at the same time as our other checks are being performed. Our hope is that we can provide a more accurate real time detection system instead of only fortifying our data after a query is made.

Our inference engine is still doing exhaustive testing on IP Addresses that have negative results to find proxies we weren't aware of and our system still performs checks on the surrounding subnet when it feels confident there are other bad addresses in that neighbourhood but all these checks are still being done after your query in addition to the more targeted checks we're now doing in real time.

As of this post the new real time inference engine is live on our API and being served from every node in our cluster, one thing you should expect is slightly higher latency. Previously our average response (after network overhead is removed) was 26ms, with real time inference that average has increased to 75ms.

We feel this is a good trade off because we're continually working to reduce latency while also introducing more thorough checking, so we're confident we can get back down below 30ms soon and we will use those extra response time savings to introduce more types of checks.

Thanks for reading and have a great day!


The adventures of Ocebot continue

Earlier this month we told you about our new API testing bot called Ocebot which performs queries on our API and records the results so we can examine the areas of the API that need further optimisation.

I'm pleased to say that since our last update where an average response on a negative detection took 250ms (including network overhead) we've now got that down to less than half with average negative detections taking just 119ms (including network overhead).

Previously our average response time for a negative detection without network overhead was 78ms, our negative results used to be much faster than this but as our data set has grown 10 fold over the past year so has our ability to access our data quickly. But with the help of Ocebot we have been able to reduce our data access times down to 43ms earlier this week and then further down to 22ms just today by optimising our functions and the way we access our database of information. These numbers are after network overhead.

Going from 78ms to 22ms was done by tuning our functions, rewriting ones that weren't performative and multithreading more parts of our checking pipeline. Getting the best performance out of our multitasking system is a priority for us as we know there is still more we can do here.

The final thing we did was alter our network routing. We're now doing smart routing to our server nodes which has significantly reduced the latency you'll have when interacting with our services. We already use a CDN (Content Distribution Network) but now we're optimising the routes taken by your bits once they hit our CDN partner so that they touch our servers as quickly as possible.

Essentially we've created a wider on-ramp so that customer traffic can get to us faster by using better intermediary networks. This is the major reason behind the network latency going from an average of 250ms to 119ms but our work in reducing API response time is also helping here too.

We hope you're enjoying these updates, keeping the API as fast as possible is important because we're gaining a lot more data per day than we ever have previously. Data sources like our inference engine and honeypots are now providing more unique and useful data than our manual scraping efforts which has increased our database size considerably. Investing in making all that data as quickly accessible as possible is paramount to our service.

Thanks for reading and have a great day!


Introducing Honeybot, the proxycheck.io honeypot

Since we built our inference engine we've been having it examine addresses from our negative detections to find proxies that we missed. We're still doing that but we've hit a snag. The engine is so fast at making determinations now that we no longer have a backlog of negative detections to still go through.

Even with millions of daily queries where the majority of those are negative detections the engine is simply so fast that it gets through them very quickly. And as a result our learning system has started to slow down, it's not iterating on itself as often as it once was because its efficiency has resulted in a lack of data to be procssed. Essentially we've bottlenecked its learning capability by not providing enough data.

So we've decided to expand on our sources of IP data to further feed the machine learning algorithm behind our inference engine. To accomplish this we have begun renting 20 VPS's around the world which will act as honeypots. Thankfully VPS servers with low specifications that are perfect for this role are very cheap. In-fact all 20 of the VPS's we're now renting cost the same as our new ATLAS node which is great value.

So the way it works is simple, we've created a Linux distribution we're calling Honeybot (just a casual name) which contains various emulated services wide open to the internet. Think web servers with admin login forms, SSH, FTP, Email, Telnet, RDP, VNC servers and so on. We're currently setup to emulate (with accurate handshakes) more than 120 services.

All of the honeypots are also running a mini-version of our cluster database software so that our main nodes can retrieve data from these honeypot servers so that we can process it with our inference engine.

So to be clear, we're not simply adding all the IP Addresses that touch these servers to our proxy database. Some IP Addresses like ones that are specifically trying to gain access to VNC, FTP, SSH by brute forcing username and logins will be added to our proxy database straight away. But ones that are not so obvious will be processed by our production inference engine. This is the one that has learning turned off.

A mirror set of all the IP Addresses touching our Honeypots will then be tested on our learning inference engine, even the ones we know are being used for attacks. The hope here is that we can find common characteristics amongst these addresses that the inference engine will use to better detect proxies itself in the future.

We enabled our new honeypots running our Honeybot distro this morning after much testing yesterday and already we're seeing huge volumes of attack traffic, it's quite surprising just how quickly we began to see hundreds of connections per server on all manner of services.

The data gleaned from these attacks is already filtering down into our main cluster database and being served by our nodes to customers, looking at the results so far we think this is going to be a great opportunity to widen our datas field of view and further enhance our inference engine.

Thanks for reading and we hope you all had a great weekend.


Dashboard Exporter

Last month we added a new feature to the dashboard which lets you view your recent positive detections as determined by our API. Here is a screenshot of this feature:

Since we added it we've had some customers ask us for a more convenient export feature. At first we allowed you to download your complete recent detections to a text file but this is mostly useful for human reading and not for easily parsing by computers.

So today we've added a new button as seen in the screenshot above called JSON Export. When you click this you'll open the most recent 100 entries in a new tab of your web browser and you'll notice the url contains your API Key and a limit variable.

This is what the url structure looks like:

https://proxycheck.io/dashboard/export/detections/?json=1&limit=100&key=111111-222222-333333-444444

If you don't supply your API Key we'll check your browser for a cookie and session token like we do for log downloads and in-dashboard page browsing.

Now the point of allowing you to specify your API Key in the request URL is so that you can create software on your side which automatically parses your recent positive detections. For example perhaps you don't want to setup logging on your side for positive detections and would rather have an overview from us. It's a no-fuss turn-key solution which will allow you to integrate your positive detections into any control panels you may have on your side.

And we've provided the limiter so you can specify how many recent entries you want to view. If you set this to 0 or remove it entirely we will send you your entire positive detection log.

You may notice that the URL for this feature starts in a new /export/ directory. It is our intention to expand the kinds of data you can export to include whitelist and blacklists (and also adding entries to these through the API too). We'll also be adding account controls to specify whether these sorts of things can be queried by your API Key alone or whether you only want them accessible from within the dashboard itself.

The new export feature does not count towards your normal API queries so you are free to query it as much as you need to. If you're querying it very frequently please use the limiter variable to receive only the most recent queries that are relevant to you.

And of course this feature is accessible to all our customers whether you're on our free or paid plans.

Have a great day!


proxycheck.io history, where did we come from?

Today I thought it's such a nice day, why not reflect on our history and tell the story of how proxycheck.io came to be as I often get asked how did I come up with the service and how did it start. Well it all started back in 2009 as a side project.

Now you're probably thinking, hold on a second proxycheck.io started in 2016 so how is the story starting in 2009! - Well back then I (the owner of proxycheck.io) was operating a chat room, similar to an IRC chat channel. And I hosted this room on a shared network with hundreds of other chat rooms.

And one day we started being attacked by automated bots that sent random gibberish into all our chat rooms. At first the attackers were using TOR (The Onion Router) to mask their IP Addresses and to get around all the banning we were doing. Then once we found a way to block TOR they started using SOCKS proxies.

So in 2009 I set out to solve this problem and I built a piece of software called Proxy Blocker. It even had a sweet logo:

That link above is to the original thread where I posted the first client software for Proxy Blocker way back in November 2009. At the time the client software would download a list of proxies from my web server each day and then when a user entered a chat room using one of those IP Addresses it would kick them out of the room and ban their IP Address for 24 hours.

Over time it gained a lot of complexity and popularity with the chat channels on the network. It gained features such as cross-channel ban sharing, automatically logging people in once verified as not proxies, redirecting users to different channels if they weren't proxies and many other features.

But the main change that occurred some time in 2010 was I switched it from downloading a list of Proxy IP Addresses from my server to querying the server directly for each IP encountered. Essentially I built the first version of proxycheck.io, an API that checked if an IP Address was operating as a proxy server or not.

Since that time until 2016 the proxy blocker API worked great and I used it for many other projects from protecting my forum and other websites to protecting game servers. I also gave the url for the API out to other developers to use in their own coding projects. But I always had this thought in the back of my head, what if I turned it into a proper service?

I actually nudged a friend of mine who had helped with Proxy Blocker a few times over the years to make such a service. I kept telling him it would be a great thing for developers and he could probably charge for queries to pay for servers and development. He humm'd and arre'd about it and the service was never made.

So after trying to convince him to do it I'd actually convinced myself that I should make it instead. Now to be clear this is 2016 now and there are various other proxy checking / blocking API's available, so I'm coming into it last but I've got a lot of experience as I'd already built Proxy Blocker over the previous 7 years. I had a great head start and I felt that with my unique perspective having protected many different kinds of services and hundreds of chat channels I still had a great product developers would want.

And that is where proxycheck.io was born. I bought the domain in 2016, started coding and within a few days I had the API up and answering queries. About 6 months later I started offering paid plans and a customer dashboard. So far things are going very well, the service is profitable which means all our bills are paid and my time spent coding the service is partially being paid back.

We like to think we're a little bit ahead of the competing services in this space because we're offering things like our cluster architecture, the whitelist, blacklist and query tagging for free to all customers. These are the kinds of features developers want but take time and knowledge to setup correctly. By having them situated at the API level it removes a lot of complexity for our customers and makes our service more attractive, especially to developers who want to get proxies blocked fast without spending a long time on the implementation.

We're loving the response to the service so far, it has been just over a year since we started proxycheck.io but we've gained a lot of customers and already handling millions of daily queries. If I had one regret it's that I didn't start the service earlier!

We hope this blog post was interesting, if you have any questions please feel free to contact us!


Back