Degraded Performance Yesterday and Mitigations

Yesterday between 7:30 PM and 10:30 PM GMT we experienced some highly degraded service with many of your queries taking upto 2.5 seconds to be answered if not dropped entirely.

This was due to a sustained attack against our infrastructure. In this case we were not the initial target of the attack, we were simply dragged into it due to one of our customers using our service to protect their game server. The individual[s] attacking our customer turned their attacks on us as a way to degrade the service level we were providing so that their attacks on our customers game server would be more effective.

The traffic we received was 9.5x higher than we ever would normally experience and was tuned for maximum resource depletion. Although our service did not completely go down and normal service resumed immediately once the attack stopped we did have severe service disruption which we intend to mitigate with two changes we have enabled today.

Firstly we're adjusting our per-second request limiter. Previously it allowed you to make between 100 and 125 requests per second with a resolution of 1 second (per node). We're changing this so that the resolution is now 10 seconds. The per second limit is still the same but with the added resolution range we can help our servers ignore bad requests for a longer period of time and this will help smooth out the kind of per-second peak loads that denial of service attacks create.

Secondly we've enabled request caching at our edge CDN (Content Delivery Network). This means every unique request you create will be cached for 10 seconds. This cache is per-customer so you will never receive cached content generated by another customer. The main benefit here is it will allow the same IP Address to be checked multiple times by a single customer without incurring requests to our servers.

We've made this second change because when our own customers suffer DDoS attacks they often send the same singular IP Addresses thousands of times a minute to our API which exhausts their query plans and creates undue load on our servers answering the same queries multiple times.

We're hopeful that both mitigations will help with future attacks but as always we will monitor the situation closely and alter our strategy as we see fit. We're also planning to add more servers into the cluster to further load balance this kind of peaking traffic.

Thanks for reading.


Back