Today we wanted to talk about some changes we've been making behind the scenes to help us scale to meet the needs of our growing customer base. You may have noticed over the past week that the customer dashboard has not been working as reliably as it should with changes made within your account not synchronising amongst all our infrastructure in a timely manner.
We've also had issues with new users signing up only for their accounts to not be activated by our internal systems, resulting in them being unable to log in and use their accounts. Both of these issues are linked to the same problem, which has been the unreliable synchronisation of data amongst our nodes for our most high-traffic database operations.
This includes things like signing up, changing any setting in your account and the creation or modification of user-generated content (custom lists, custom rules, CORS domain entries etc). We put these things in a fast-track lane so that the Dashboard always feels snappy even when your dashboard requests could change from one server to another in our cluster during a single user session.
But as we've grown we are seeing more and more traffic targeting the dashboard and this once super-fast lane began to slow to a crawl culminating in this week where changes and additions to customer data were greater than the ability for our servers to synchronise those changes in real-time.
To solve this problem we looked at the way we've been handling requests to and from the database and identified many pain points. The biggest one was the amount of database transactions for a single dashboard web request this includes both loading the dashboard and also creating or modifying content within it and then sharing those changes with other server nodes.
By restructuring our customer data and coalescing the gathering and saving of that data into single database operations we've been able to reduce traffic between nodes by a factor of 7 on average for a user accessing their Dashboard and by a factor of 3 to 5 when they make changes depending on what those changes are.
As a result of these changes, the dashboard and user signups are now being handled in real-time again. In addition to these changes for our most accessed customer data, we've also been working on the slower database synchronisation we use for big data. This includes things like customer positive detection logs.
One thing we noticed here is a lot of this data is rarely accessed but the enormity of it was significantly delaying our ability to bring up new nodes (due to the need to synchronise all this data) and the ledger size of our database that maintains a listing of what is synchronised and which nodes are missing data was getting very large and becoming a burden for our server nodes to handle.
To solve this problem we've begun to compress all high-impact user data and as a result we've been able to reclaim 80% of the disk space utilised by this data. This has also had the side-effect of making this data much faster to access as even though we use high-end Solid State Drives on all our servers, storage is still the slowest component out of the CPU, RAM and Storage in a server. So by decreasing the size of the data we're loading from our disks, we can load it into memory much faster.
In addition to the compression, we've also altered how our database blocks work. Prior to this week, all data in the database were stored in 8MiB blocks. This made things mathematically simple. But as user data has grown the amount of blocks has increased to an unmanageable amount. Due to this, we've now moved to an adaptable block size between 8MiB and 64MiB with customer data being placed into appropriately sized blocks depending on their data volume. Since servers need an entire block before they can access the data inside we will adapt what size block is used for a specific customer's data based on their data volume with smaller blocks able to transition into larger ones as their data grows.
So this is the update for today, we're hoping that there won't be too many teething problems but to be honest with you the database scheme updates that apply to the Dashboard are rather major and were introduced faster than we would have liked due to the serious performance degradation we were seeing. What this means is, there may be bugs and we ask for your patience and diligence in reporting any you find.
Thanks for reading and have a great week!