The adventures of Ocebot continue

Earlier this month we told you about our new API testing bot called Ocebot which performs queries on our API and records the results so we can examine the areas of the API that need further optimisation.

I'm pleased to say that since our last update where an average response on a negative detection took 250ms (including network overhead) we've now got that down to less than half with average negative detections taking just 119ms (including network overhead).

Previously our average response time for a negative detection without network overhead was 78ms, our negative results used to be much faster than this but as our data set has grown 10 fold over the past year so has our ability to access our data quickly. But with the help of Ocebot we have been able to reduce our data access times down to 43ms earlier this week and then further down to 22ms just today by optimising our functions and the way we access our database of information. These numbers are after network overhead.

Going from 78ms to 22ms was done by tuning our functions, rewriting ones that weren't performative and multithreading more parts of our checking pipeline. Getting the best performance out of our multitasking system is a priority for us as we know there is still more we can do here.

The final thing we did was alter our network routing. We're now doing smart routing to our server nodes which has significantly reduced the latency you'll have when interacting with our services. We already use a CDN (Content Distribution Network) but now we're optimising the routes taken by your bits once they hit our CDN partner so that they touch our servers as quickly as possible.

Essentially we've created a wider on-ramp so that customer traffic can get to us faster by using better intermediary networks. This is the major reason behind the network latency going from an average of 250ms to 119ms but our work in reducing API response time is also helping here too.

We hope you're enjoying these updates, keeping the API as fast as possible is important because we're gaining a lot more data per day than we ever have previously. Data sources like our inference engine and honeypots are now providing more unique and useful data than our manual scraping efforts which has increased our database size considerably. Investing in making all that data as quickly accessible as possible is paramount to our service.

Thanks for reading and have a great day!


Introducing Honeybot, the proxycheck.io honeypot

Since we built our inference engine we've been having it examine addresses from our negative detections to find proxies that we missed. We're still doing that but we've hit a snag. The engine is so fast at making determinations now that we no longer have a backlog of negative detections to still go through.

Even with millions of daily queries where the majority of those are negative detections the engine is simply so fast that it gets through them very quickly. And as a result our learning system has started to slow down, it's not iterating on itself as often as it once was because its efficiency has resulted in a lack of data to be procssed. Essentially we've bottlenecked its learning capability by not providing enough data.

So we've decided to expand on our sources of IP data to further feed the machine learning algorithm behind our inference engine. To accomplish this we have begun renting 20 VPS's around the world which will act as honeypots. Thankfully VPS servers with low specifications that are perfect for this role are very cheap. In-fact all 20 of the VPS's we're now renting cost the same as our new ATLAS node which is great value.

So the way it works is simple, we've created a Linux distribution we're calling Honeybot (just a casual name) which contains various emulated services wide open to the internet. Think web servers with admin login forms, SSH, FTP, Email, Telnet, RDP, VNC servers and so on. We're currently setup to emulate (with accurate handshakes) more than 120 services.

All of the honeypots are also running a mini-version of our cluster database software so that our main nodes can retrieve data from these honeypot servers so that we can process it with our inference engine.

So to be clear, we're not simply adding all the IP Addresses that touch these servers to our proxy database. Some IP Addresses like ones that are specifically trying to gain access to VNC, FTP, SSH by brute forcing username and logins will be added to our proxy database straight away. But ones that are not so obvious will be processed by our production inference engine. This is the one that has learning turned off.

A mirror set of all the IP Addresses touching our Honeypots will then be tested on our learning inference engine, even the ones we know are being used for attacks. The hope here is that we can find common characteristics amongst these addresses that the inference engine will use to better detect proxies itself in the future.

We enabled our new honeypots running our Honeybot distro this morning after much testing yesterday and already we're seeing huge volumes of attack traffic, it's quite surprising just how quickly we began to see hundreds of connections per server on all manner of services.

The data gleaned from these attacks is already filtering down into our main cluster database and being served by our nodes to customers, looking at the results so far we think this is going to be a great opportunity to widen our datas field of view and further enhance our inference engine.

Thanks for reading and we hope you all had a great weekend.


Dashboard Exporter

Last month we added a new feature to the dashboard which lets you view your recent positive detections as determined by our API. Here is a screenshot of this feature:

Since we added it we've had some customers ask us for a more convenient export feature. At first we allowed you to download your complete recent detections to a text file but this is mostly useful for human reading and not for easily parsing by computers.

So today we've added a new button as seen in the screenshot above called JSON Export. When you click this you'll open the most recent 100 entries in a new tab of your web browser and you'll notice the url contains your API Key and a limit variable.

This is what the url structure looks like:

https://proxycheck.io/dashboard/export/detections/?json=1&limit=100&key=111111-222222-333333-444444

If you don't supply your API Key we'll check your browser for a cookie and session token like we do for log downloads and in-dashboard page browsing.

Now the point of allowing you to specify your API Key in the request URL is so that you can create software on your side which automatically parses your recent positive detections. For example perhaps you don't want to setup logging on your side for positive detections and would rather have an overview from us. It's a no-fuss turn-key solution which will allow you to integrate your positive detections into any control panels you may have on your side.

And we've provided the limiter so you can specify how many recent entries you want to view. If you set this to 0 or remove it entirely we will send you your entire positive detection log.

You may notice that the URL for this feature starts in a new /export/ directory. It is our intention to expand the kinds of data you can export to include whitelist and blacklists (and also adding entries to these through the API too). We'll also be adding account controls to specify whether these sorts of things can be queried by your API Key alone or whether you only want them accessible from within the dashboard itself.

The new export feature does not count towards your normal API queries so you are free to query it as much as you need to. If you're querying it very frequently please use the limiter variable to receive only the most recent queries that are relevant to you.

And of course this feature is accessible to all our customers whether you're on our free or paid plans.

Have a great day!


proxycheck.io history, where did we come from?

Today I thought it's such a nice day, why not reflect on our history and tell the story of how proxycheck.io came to be as I often get asked how did I come up with the service and how did it start. Well it all started back in 2009 as a side project.

Now you're probably thinking, hold on a second proxycheck.io started in 2016 so how is the story starting in 2009! - Well back then I (the owner of proxycheck.io) was operating a chat room, similar to an IRC chat channel. And I hosted this room on a shared network with hundreds of other chat rooms.

And one day we started being attacked by automated bots that sent random gibberish into all our chat rooms. At first the attackers were using TOR (The Onion Router) to mask their IP Addresses and to get around all the banning we were doing. Then once we found a way to block TOR they started using SOCKS proxies.

So in 2009 I set out to solve this problem and I built a piece of software called Proxy Blocker. It even had a sweet logo:

That link above is to the original thread where I posted the first client software for Proxy Blocker way back in November 2009. At the time the client software would download a list of proxies from my web server each day and then when a user entered a chat room using one of those IP Addresses it would kick them out of the room and ban their IP Address for 24 hours.

Over time it gained a lot of complexity and popularity with the chat channels on the network. It gained features such as cross-channel ban sharing, automatically logging people in once verified as not proxies, redirecting users to different channels if they weren't proxies and many other features.

But the main change that occurred some time in 2010 was I switched it from downloading a list of Proxy IP Addresses from my server to querying the server directly for each IP encountered. Essentially I built the first version of proxycheck.io, an API that checked if an IP Address was operating as a proxy server or not.

Since that time until 2016 the proxy blocker API worked great and I used it for many other projects from protecting my forum and other websites to protecting game servers. I also gave the url for the API out to other developers to use in their own coding projects. But I always had this thought in the back of my head, what if I turned it into a proper service?

I actually nudged a friend of mine who had helped with Proxy Blocker a few times over the years to make such a service. I kept telling him it would be a great thing for developers and he could probably charge for queries to pay for servers and development. He humm'd and arre'd about it and the service was never made.

So after trying to convince him to do it I'd actually convinced myself that I should make it instead. Now to be clear this is 2016 now and there are various other proxy checking / blocking API's available, so I'm coming into it last but I've got a lot of experience as I'd already built Proxy Blocker over the previous 7 years. I had a great head start and I felt that with my unique perspective having protected many different kinds of services and hundreds of chat channels I still had a great product developers would want.

And that is where proxycheck.io was born. I bought the domain in 2016, started coding and within a few days I had the API up and answering queries. About 6 months later I started offering paid plans and a customer dashboard. So far things are going very well, the service is profitable which means all our bills are paid and my time spent coding the service is partially being paid back.

We like to think we're a little bit ahead of the competing services in this space because we're offering things like our cluster architecture, the whitelist, blacklist and query tagging for free to all customers. These are the kinds of features developers want but take time and knowledge to setup correctly. By having them situated at the API level it removes a lot of complexity for our customers and makes our service more attractive, especially to developers who want to get proxies blocked fast without spending a long time on the implementation.

We're loving the response to the service so far, it has been just over a year since we started proxycheck.io but we've gained a lot of customers and already handling millions of daily queries. If I had one regret it's that I didn't start the service earlier!

We hope this blog post was interesting, if you have any questions please feel free to contact us!


Yearly Subscriptions

One of the things our customers said to us when we switched to monthly subscriptions is that some of them just do not want a monthly subscription. They don't want money coming out of their bank account each month, they don't want it on their statements. They'd rather pay for an entire year up front so they don't need to think about it again for a year.

Which is a completely valid perspective, we can understand that. Here at proxycheck.io we have to pay for things like domain names, hosting, password managers, virtual private networks and we too prefer to pay for a year up front for the same reasons. And it helps that when you pay for a year up front you usually save some money.

So in the interest of choice we've broadened our subscriptions, we now offer both monthly and yearly subscriptions. Essentially we offer a year for every monthly plan we have with the same query volumes but if you enter into a yearly subscription you save 20% over holding a monthly subscription for 12 months.

So essentially you can try the service for a month with the query amount you need and once you're sure you like the service and it meets your needs you can choose to pay for an entire year up front and save 20%. But if you like the flexibility of paying for the service month-to-month you can continue to do that too.

We've updated our pricing page to reflect the new plan options and you can subscribe for a yearly plan from your dashboard right now. We hope you all like the new changes they are a direct result of your feedback.


New subscription pricing

Since we changed from yearly to monthly subscriptions we've had a lot of feedback from customers who purchased our prior yearly plans who felt that the new subscriptions were not providing the same value and they were concerned that when their paid year plans ran out they would not be able to afford a monthly subscription with the query amounts they needed.

It was only our very high paid tier holders that received a better value than previously and they are in the minority. Most of our sales were around $120 and down (for an entire year).

So we've listened to your feedback and looked at the numbers and we have decided that we can lower the monthly prices and create a more linear payment approach that makes sense for smaller developers. Here is our new pricing:

Now previously our lowest subscription plan started at $5 per month for 10,000 queries. We now have two plans lower than that which are 10K for $1.99 and 20K for $3.99. (Those of you who already subscribed to our monthly plans have been automatically transferred over to the most affordable paid subscription which has the same daily query limits you paid for and you have been refunded the difference that you already paid)

Similarly our most popular middle plan which is 80,000 daily queries used to cost $120 a year but when it became a monthly subscription it became $20 per month which is $240 a year. With our new prices though it becomes just $7.99 a month which is $95.88 a year, less than half its previous monthly cost and still lower than our previous yearly pricing.

We hope the new pricing will help to ease any fears that the service has become too expensive, your feedback is invaluable to us, without it we probably would have kept the higher pricing for a much longer time period and that wouldn't have been a good thing as it's not our intention to shut smaller developers out of our service, we want everyone to be able to protect their service no matter their size.

The new pricing also makes a lot more sense for people on our free tier, it's much easier to accept a jump from FREE to $1.99 than it is to $5. We're not afterall netflix or spotify, charging $5 for the smallest paid subscription just didn't feel right.

For many people who need just a few more queries than 1,000 they're only protecting a hobby. Be it a discussion forum, online computer game or login forms on their blog. They shouldn't be burdened to pay $5 for something that doesn't make them any money and so we feel the $1.99 price which is less than the cost of a coffee a month is more than enough to satisfy those kinds of needs while still being enough for us to pay for our servers.

Thanks for reading and as always please feel free to write us at [email protected] just like many of you already did which resulted in the lowered pricing we've announced today.


Ocebot update

On July 4th we wrote a post about our new software robot called Ocebot (a combination of the words Ocelot and Bot). And today we'd like to give you some insight into what we discovered as we pour over the past 10 days of data since that post.

Before we get into the data though lets just run through the kinds of things Ocebot has been doing.

  1. Querying the API around once a minute for 10 days straight
  2. Making proxy only and VPN requests
  3. Making queries it already knows the answer to
  4. Making malformed queries to see how the API responds
  5. Forcing the server to take detailed server-side analytics when answering Ocebot queries

So from this we've gleaned a few things. Firstly the response time of the API is excellent on a positive detection with most queries being answered under 50ms with network overhead. For negative detections (meaning every single level of check is performed) the average time is 250ms, again this is with network overhead but without TLS turned on.

The second thing we found is that the response times are very consistent, our averages aren't changing throughout the day and we're not seeing much of any difference between our nodes in the time they take to answer a query which is a good thing as slow nodes would create inconsistency for our customers.

The third thing we found were some edge cases in our code that could create a high latency response due to logging of errors. We're talking in the millisecond range here but when we're trying to give responses as fast as possible every millisecond counts.

The fourth thing we found were some optimisations to our cluster database syncing system. Through the server side analytics we were able to discover high CPU usage caused by the encryption of data that is to be synced to our other nodes in the cluster. Essentially before we send any data to any other node in the cluster through our persistent machine to machine data tunnel we encrypt it with AES256.

This can be CPU intensive if the data being transferred is always changing and thus requiring lots of database updates to other nodes. By looking at the Ocebot data we could see there were a lot of things being synced that didn't need to be, lots of high activity data alterations that are only really important to the machine handling your API query and are not needed by the other nodes in the cluster.

And so what we've done is moved some of these to a local cache on the nodes making the requests when the data isn't ever going to be needed by another node.

The other thing we've done is some data does need to be shared with other nodes but not immediately and so we've added some granularity to how frequent certain pieces of data are synced so we can benefit from update coalescing, meaning combining multiple smaller database updates into one larger database update that is transferred to other nodes less frequently.

By doing it this way we've been able to significantly reduce the CPU usage of our cluster syncing system and thus increase their API response throughput (hypothetically) in the future when we're closer to full node utilisation.

Our experiments with Ocebot are ongoing, already we've discovered some incredibly useful information that has directly improved proxycheck.io. Over the next few weeks we will be enhancing Ocebot so it can perform tests on our new Inference Engine, not to judge accuracy but to gauge performance and to make sure it's getting faster at making determinations.

Thanks for reading and have a great day!


Introducing the proxycheck.io inference engine

Prior to today proxycheck.io's data was scraped from many websites across the globe. The kind that list proxies for sale or for free use. But we've been working on introducing our own inference engine for some time now.

Put simply this is a type of machine learning where our service gathers information about an IP Address and then through those evidence based facts, draws likely conclusions about whether that IP is operating as a proxy server.

At this time we're only putting the positive detections made by the inference engine into our data, when it has a confidence level of 100%. In human terms this would be the equivalent of an investigator catching a perpetrator in the act of a crime and not based on a judgement call or flip of a coin.

We're doing it this way because accuracy is our number one priority, if we're not confident that an IP Address is operating as a proxy server it's pointless to say it is in our API responses.

The other caveat here is that figuring out if an IP Address is operating as a proxy server or not takes time. The inference engine will get faster over time but to get the kind of extremely accurate detections we care about we have to do the processing after your queries are made.

What this means is, whenever you perform a query on our API that results in a negative detection that IP Address is placed in a queue to be processed by the inference engine and if determined to be a proxy server it will enter into our data. In testing we believe we can do an accurate processing of each IP Address in around 5 minutes after the first negative result.

Now obviously having the IP processed after you've already told us about it and after you've already received a negative result from us isn't that useful for you, but as we're seeing millions of queries a day and proxy servers are used all over the internet for comment spam, automated signups on forums and click fraud it means we have been given a giant window from which we can analyse the IP Addresses that matter most.

We could for example scan the entire internet address space and detect thousands of proxy servers out of the 4 billion possibilities on IPv4 alone, before we even think about IPv6. But that would be incredibly wasteful of our resources and abusive to the internet at large. By only scanning the addresses that are performing tasks on your services (the same ones proxy servers are used for) it means we're targeting and training our engine on the data that matters.

During our testing we supplied the engine with 100,000 negative detections from our own API from the past day and we found 0.4% of those addresses to be operating as proxy servers. That's around 400 proxy servers that we previously had no knowledge of that are now detected by our API for the next 90 days minimum.

We're absolutely thrilled by the results and as our service grows with more developers using the API the inference engine will become a major source of proxy data for us. At the moment we have two versions a static non-learning version which is in production with total confidence from us and zero false positives.

And then we also have a development version which is working from the same data as the production version but with learning enabled, results from the development version are not saved into our production database. So over time our inference engines detection rate will rise from the current 0.4% as it becomes more intelligent through iterative machine learning.

Thanks for reading we hope you enjoyed this post, if you're watching your API responses look out for proxy: Yes and type: Inference Engine!


Live Chat and Bug Fixes

Live Chat!

As of a few days ago we're now featuring a live support chat feature on all our webpages. The reason we've done this is so that you can get instant support without needing to use Skype or iMessage.

The best part of our live chat is it's manned by our developers, we're not outsourcing the support chat. This means you can not just receive pre-sales information from the chat but also account and payment level support. We can handle any query through the live chat that previously you would need to use our Skype, iMessage or email support for.

But of course the new live chat is optional we're still offering email, skype and iMessage support and that's not changing.

Bug Fixing

The other bit of news we wanted to discuss was our Dashboard. A few weeks ago we altered the way the Dashboard was handled server side to be more secure but it had some unintended negative affects which didn't show in our testing. They were mostly just niggly bugs for example:

  1. Setting/Changing your Password or API Key logged you out of the dashboard after performing the changes
  2. No email was sent if you changed your password (but one was sent if a password was set)
  3. Some errors were not handled correctly causing blank pages

So yesterday we did a full audit of the Dashboard code and we also tested every feature within it. We found numerous minor issues mostly visual bugs after certain requests were made. We went to work on all of these problems and solved all of the ones listed above.

We also finally added an account recovery feature which enables you to generate a new password for your account in a secure way in the event you lose access to your account due to losing your password. This was a planned feature since the moment we added password security to accounts but we have been working mostly on the API and other new features like account stats, blacklist/whitelist support and so forth.

As of two months ago we have a proper priority based development ledger which maintains a list of all the features and bugs we still have to implement or fix. The ledger is prioritising bugs and as of this post we have cleared all the bugs the ledger had listed. If you come across any bugs please shoot us an email or live chat and we'll get right on them!

For a little insight into what we're working on next, it's our email notices. At the moment they are quite inconsistent in their layout and wording. We intend to unify all of our emails in appearance.

Thanks for reading and have a great day!


New server ATLAS added to our server cluster

Today we've added a new node into our cluster it's a dedicated server we've named ATLAS. This new server is already serving your queries and is viewable on our service status page.

It is our aim to have proxycheck.io always be accessible which means our goal is to never have any downtime. We're mitigating the risk of downtime by not only renting servers in different data centers but also hosted by different companies and in different countries and next year we aim to have servers in entirely different continents.

Currently we have PROMETHEUS in the United Kingdom, HELIOS in Germany and now ATLAS in France. The next time we discuss nodes we hope to have a server operational in North America.

As always our cluster operates transparently to users, you do not need to specify which node your traffic goes to, your queries are routed automatically by us and our cluster is used to answer all queries, not just paid queries but free ones too.

With the volume of queries we're receiving per day reaching into the millions we decided to add a third server sooner rather than later, not because we're maxing out the servers we already had (we were not close to that point due to our efficient API backend) but because we felt it was important to add more redundancy to the cluster as our customer base grows.

We hope you're finding our blog posts interesting, it's an enjoyable way to tell our story and inform people about what we're up to. The service is constantly being worked on behind the scenes and although you may notice some visual changes to the site here or there it's mostly the things you don't see which are being worked on most of all.

Thanks for reading and have a great day!


Back