Building a better Detective

Image description

One of the challenges a service like ours faces is the existence of anonymising services that specifically go out of their way to obscure their infrastructure. And so while it's easy to detect most addresses and their suppliers there's always a small percentage that slip by.

Which is why this past month we created a list of these difficult to index suppliers and went about building tools that were tailored specifically to scan and verify the addresses they offer. Traditionally when we want to scan a provider they will offer a webpage of addresses or hostnames which makes it easy to scan them and correlate what we find across our honeypot network and other scraped websites, this is part of our collect and verify strategy.

But some of these let’s say hardened providers will either mask their addresses behind signup pages, paid memberships or other means. For instance it's becoming very common for VPN providers to only show you their server addresses once you've signed up and paid for service and with there being hundreds of VPN suppliers paying for all those subscriptions isn't really commercially viable.

But even the free providers are becoming more shrewd by inserting randomly generated addresses within their legitimate address pools to thwart page scraping and some sites only show you addresses once you verify you're not a bot by solving a captcha or require a javascript engine to decode the addresses before they're rendered on the webpage.

All of these are things we worked to solve this month with what we're calling our Detective. It's a new module within our custom scraping engine which allows for a lot more thought during collection and processing. The results have been quite promising with our list of detected proxies and virtual private networks steadily increasing since it went live.

Some of its features include:

  1. Web and non-Web collection for anonymising services that only offer an application for accessing their network of servers.
  2. Javascript engine for solving any kind of proof-of-browser anti-Bot measures during address collection.
  3. Captcha solving support using image recognition with a fallback to human based solving.
  4. Bad/Fake/Generated address discardment through time based observation and frequency of appearance.
  5. Pattern recognition for indexing VPN providers infrastructure based on a few hand entered sample hostnames.

There are some very well known providers that are constantly being abused that have employed one or more of the above tactics to make it difficult for services like our own to get a full picture of their infrastructure but the new system we've devised has been able to break all of these approaches.

As always if you've come across an address, range or service provider we don't yet detect please contact us, we really do investigate every lead sent to us by customers.

Thanks for reading and have a great week!


Back