What happened on October 19th?

If you visited your Dashboard yesterday you may have seen a notice at the top explaining we had a very bad server failure on our HELIOS node which had caused many stats related issues. Today we will explain this very unusual failure and what we learned from it.

So to begin with, HELIOS had been our longest serving node. We have had that server for many years and it has had some hardware failures in the past including two failed hard disks. Yesterday was the most difficult type of failure to deal with from a programmers perspective, bad memory. To fix it we replaced the Motherboard, CPU and Memory so effectively HELIOS is a new server.

When writing any software you are building on a foundation of truths and what is held in the computers memory is something you have to trust as that's where all your software is actually living. It's very difficult to program a system to self diagnose a memory issue when the self diagnosis tool itself will likely be affected by the memory problems.

And that is exactly what happened here. Our system is designed to remove malfunctioning nodes from the cluster but in this case HELIOS's bad memory was causing it to re-assert itself. It even tried to remove other nodes from our cluster thinking they were malfunctioning because its own verification systems were so broken it was interpreting their valid health responses as invalid.

The reason this affected our stats processing is because to keep our cluster database coherent, to stop conflicts caused by multiple nodes processing the same data at the same time we use an election process where every so often the nodes hold a vote and one healthy node is selected to process all of the statistics for a given time period. Due to the HELIOS node memory issues this voting process did not work as intended.

What we learned from this is that we needed a better way to completely lock out malfunctioning nodes from the cluster and we needed more points of reference for nodes to self diagnose issues and preferably to break themselves completely when they discover problems that would need human intervention instead of continuing to harm the cluster by remaining within it.

Today we think we've accomplished both of these goals. Firstly we've setup a lot of references in our health checks for self diagnosis that weren't there before. This isn't a foolproof solution but if any of the references are corrupted it shouldn't allow the nodes built in self management system to start arguing with the cluster and voting other nodes offline or at-least if it still has the working capability to perform votes it should neuter itself before attempting to vote on other nodes health status.

Secondly we've broadened our nodes ability to lockout bad nodes by revoking the tokens needed to be a part of the cluster group. This means good servers with a consensus can remove the "passwords" required to access the cluster by a malfunctioning node.

A third change that we've made is having known good nodes act faster when they are removed from the cluster while they're still functional by allowing them to initiate a confidence vote amongst the other nodes, this can be done in just a few seconds after they are removed from the cluster if the node thinks it's working correctly. Only nodes with perfect health scores over the past 3 minutes are allowed to vote in these decisions to reduce false positives caused by malfunctioning nodes.

Also we should mention although we only have three nodes listed in the cluster there is in-fact 5 nodes. Two of them do not accept queries and are not front-facing and instead work behind the scenes to manage the health, settle vote disputes and step in under another nodes name if there is a serious enough issue to warrant that.

We are of course disappointed that this failure occurred, many of you contacted support yesterday via live chat to express your concerns and we're very sorry that this happened. We're especially sorry to those of you who received overage notices due to the invalid query amounts that accumulated on your accounts and we hope you can accept our sincere apology for that. Our hope is that with these changes something like this will never happen again.

Thanks for reading and we hope everyone has a great weekend.


Back