blog / proxycheck.io

Since we built our inference engine we've been having it examine addresses from our negative detections to find proxies that we missed. We're still doing that but we've hit a snag. The engine is so fast at making determinations now that we no longer have a backlog of negative detections to still go through.

Even with millions of daily queries where the majority of those are negative detections the engine is simply so fast that it gets through them very quickly. And as a result our learning system has started to slow down, it's not iterating on itself as often as it once was because its efficiency has resulted in a lack of data to be procssed. Essentially we've bottlenecked its learning capability by not providing enough data.

So we've decided to expand on our sources of IP data to further feed the machine learning algorithm behind our inference engine. To accomplish this we have begun renting 20 VPS's around the world which will act as honeypots. Thankfully VPS servers with low specifications that are perfect for this role are very cheap. In-fact all 20 of the VPS's we're now renting cost the same as our new ATLAS node which is great value.

So the way it works is simple, we've created a Linux distribution we're calling Honeybot (just a casual name) which contains various emulated services wide open to the internet. Think web servers with admin login forms, SSH, FTP, Email, Telnet, RDP, VNC servers and so on. We're currently setup to emulate (with accurate handshakes) more than 120 services.

All of the honeypots are also running a mini-version of our cluster database software so that our main nodes can retrieve data from these honeypot servers so that we can process it with our inference engine.

So to be clear, we're not simply adding all the IP Addresses that touch these servers to our proxy database. Some IP Addresses like ones that are specifically trying to gain access to VNC, FTP, SSH by brute forcing username and logins will be added to our proxy database straight away. But ones that are not so obvious will be processed by our production inference engine. This is the one that has learning turned off.

A mirror set of all the IP Addresses touching our Honeypots will then be tested on our learning inference engine, even the ones we know are being used for attacks. The hope here is that we can find common characteristics amongst these addresses that the inference engine will use to better detect proxies itself in the future.

We enabled our new honeypots running our Honeybot distro this morning after much testing yesterday and already we're seeing huge volumes of attack traffic, it's quite surprising just how quickly we began to see hundreds of connections per server on all manner of services.

The data gleaned from these attacks is already filtering down into our main cluster database and being served by our nodes to customers, looking at the results so far we think this is going to be a great opportunity to widen our datas field of view and further enhance our inference engine.

Thanks for reading and we hope you all had a great weekend.