Prior to today proxycheck.io's data was scraped from many websites across the globe. The kind that list proxies for sale or for free use. But we've been working on introducing our own inference engine for some time now.
Put simply this is a type of machine learning where our service gathers information about an IP Address and then through those evidence based facts, draws likely conclusions about whether that IP is operating as a proxy server.
At this time we're only putting the positive detections made by the inference engine into our data, when it has a confidence level of 100%. In human terms this would be the equivalent of an investigator catching a perpetrator in the act of a crime and not based on a judgement call or flip of a coin.
We're doing it this way because accuracy is our number one priority, if we're not confident that an IP Address is operating as a proxy server it's pointless to say it is in our API responses.
The other caveat here is that figuring out if an IP Address is operating as a proxy server or not takes time. The inference engine will get faster over time but to get the kind of extremely accurate detections we care about we have to do the processing after your queries are made.
What this means is, whenever you perform a query on our API that results in a negative detection that IP Address is placed in a queue to be processed by the inference engine and if determined to be a proxy server it will enter into our data. In testing we believe we can do an accurate processing of each IP Address in around 5 minutes after the first negative result.
Now obviously having the IP processed after you've already told us about it and after you've already received a negative result from us isn't that useful for you, but as we're seeing millions of queries a day and proxy servers are used all over the internet for comment spam, automated signups on forums and click fraud it means we have been given a giant window from which we can analyse the IP Addresses that matter most.
We could for example scan the entire internet address space and detect thousands of proxy servers out of the 4 billion possibilities on IPv4 alone, before we even think about IPv6. But that would be incredibly wasteful of our resources and abusive to the internet at large. By only scanning the addresses that are performing tasks on your services (the same ones proxy servers are used for) it means we're targeting and training our engine on the data that matters.
During our testing we supplied the engine with 100,000 negative detections from our own API from the past day and we found 0.4% of those addresses to be operating as proxy servers. That's around 400 proxy servers that we previously had no knowledge of that are now detected by our API for the next 90 days minimum.
We're absolutely thrilled by the results and as our service grows with more developers using the API the inference engine will become a major source of proxy data for us. At the moment we have two versions a static non-learning version which is in production with total confidence from us and zero false positives.
And then we also have a development version which is working from the same data as the production version but with learning enabled, results from the development version are not saved into our production database. So over time our inference engines detection rate will rise from the current 0.4% as it becomes more intelligent through iterative machine learning.
Thanks for reading we hope you enjoyed this post, if you're watching your API responses look out for proxy: Yes and type: Inference Engine!