Improved data sync pipeline for corrosion

Corrosion is our global system for replicating sqlite data. When a machine is created on a server, data about the machine gets inserted into corrosion and sent to every other server in the fleet. We use corrosion a bit differently than you would expect. It is not the source of truth for the data that it holds. Every data inside corrosion also exists elsewhere. For data about server-specific tables like machines, volumes, services - we get those from a bolt db on the host that flyd manages. But for global data like apps, ip_assignment we get those from our RDS database. We have a sidekiq job that gets triggered to pull data from corrosion into Fly. We call this process of getting data from these sources back into corrosion a reseed , and it is normally done when corrosion has issues and the current data is corrupted or there’s an incompatible upgrade and all the nodes need to start from the same snapshot. We will restart corrosion with the last known good snapshot and then reseed the row that has been updated since the last snapshot back into corrosion.

The thing is Corrosion’s reliability has gotten better and reseeds have become rare. However, on the 16th of July, we upgraded to a new version of corrosion that changed the way state was tracked in corrosion resulting incompatibility with the older versions. This required a reseed. When we triggered the reseed, the job for the apps table was very slow resulting in an incident. Since the last reseed, the apps table had grow quite large and the current reseed code was too slow to get data into corrosion quickly. A job for a different table kept hitting an error when inserting a row and restarting. Users saw 404s because their applications were missing in the database. A quick fix was deployed to resolve the issue.

We have now reworked the job for reseeding data into corrosion so it is faster and it can handle inserting the whole table into corrosion under twenty minutes. How that works is that we have a job to quickly collect a range of ids (the upper and lower bound for a batch) and queues parallel jobs for each of those ranges that does the actual selection and insertion into corrosion. This increases concurrency (we have a rate-limiter in place so has to not overwhelm corrosion) and ensures that a failed insert doesn’t stall others. The reseed job is run on a weekly basis with better monitoring and alerting if so that path is exercised more often and we’d notice early if something breaks.

8 Likes