Teaching an old Raven new tricks

Image description

If you've been reading our blog for the past few years you may have seen a post we made in December 2019 where we detailed our inference engine called Raven. This is the software we created that runs not only on each of our nodes for real-time inference but also on separate dedicated hardware tailored specifically for post-processing inference.

Since that post we've changed where and how Raven functions. We broke up our single dedicated inference server (STYX) into many separate servers and repurposed STYX as a distributor of work instead of a processor for Raven. We had to do this due to the service becoming so popular we couldn't process the volume of addresses we were receiving in a time frame that made sense.

This month we've been hard at work improving Raven and the associated infrastructure that supports it. We've reached a scale where traditional databases, storage systems and networking are not scaling for our use case, we want to be able to process many more addresses per second and in a more thorough way which requires more resources at every link in the chain from the way addresses are collected, transported through our infrastructure, processed and delivered back to our cluster nodes.

To this end we've completely changed how addresses are collected from our cluster, it's now multithreaded and scales seamlessly to the volume of data waiting to be picked up. We're also now storing addresses in a high-performance in-memory database served by MariaDB. We're seeing very high transaction throughput combined with extremely low CPU utilisation from MariaDB and in-fact this one change from our prior custom solution reduced CPU usage from 97% to 30% on our work distribution server.

But that's not all, Raven for us is more than just a data analysis tool, it also includes what we call agents which allow it to be extended with plugins that serve as data collectors and data formatters. Essentially a way to feed Raven auxiliary data through a multitude of means. For instance processing firewall logs from our data partners or even agents that probe addresses directly to see if they're running proxy servers.

That last agent we mentioned that probes addresses directly has become a very important tool for Raven because it provides conclusive evidence which helps to reinforce its prior conclusions and thus help it to make better decisions in the future. Another advantage of this particular agent is its ability to find new proxies from where we have no data. This is important because we, like all anti-proxy services, operate a network of scrapers which scour websites that publish proxy and vpn addresses in an attempt to collect as much data about bad addresses as possible.

The problem however is many of these websites have data that overlaps with one another and so there is not many sites publishing proxies that we don't already know about. We spend a lot of time locating new sites and often even if they list thousands of addresses as being seen within the past several minutes we already detect 99.9% to 100% of them. So the ability to seek out unique addresses that have never been published publicly is important if we want to have a full picture which is certainly our goal.

And indeed we do find many unique proxies on our own, in-fact we find hundreds of unique proxies daily that have never and in some cases will never be listed on publicly accessible proxy indexing websites. With how important this agent is to our service we spent the last few days rewriting it to be faster and smarter. We've come up with some subnet searching algorithms that increase the chances of finding bad addresses without needing to scan an entire service providers address range in addition to some other improvements that we're going to keep close to our chest for now due to their trade secret value.

The last piece of the puzzle has been iterating on Ravens inference models. In the past we would collect a subset of important decisions and their outcomes to train Raven. It would actually almost take a month each time. But we've been able to improve the training time by breaking up the data into smaller units which can be iterated on across different computers. In addition to that we upgraded our main workstation that we would traditionally compute these models on which has cut the training time in half. We're now able to produce a new model in 8 days down from the 26 days it took previously which is a significant improvement that allows us to tweak Raven more often.

So that's what we wanted to share with you today. If you often monitor our threats page which is where we post unique proxies we've found that haven't been seen on indexing websites before you may notice a vast increase in the postings over the past 2 days. This is going to continue to ramp up as we further tweak the new software and find the right balance between detection rate and processing throughput.

Thanks for reading and have a great week!