I Deleted 240k Records from Production DB

…and everything worked better!

Some of you may know about the “regionalization” project that has been going on here at Fly.io; it’s one of the big “medium-term” projects that has been occupying our minds for the better part of last year. The whole story stems from our big September incident :trade_mark: last year, which was caused by bad data written to Corrosion, our state propagation system. Corrosion quickly propagated that state to every fly-proxy instance on the entirety of our fleet, locking all of them up. We identified two lessons learned here:

  1. The proxy deserializes and loads state of all apps, even if it doesn’t need it yet; and
  2. The same broken state is quickly replicated across the platform

(1) was solved by “lazy-loading” state only when needed, and (2) is where the “regionalization” project is intended to solve. Both of these are huge refactorings of our platform, and we’ve broken it down into many small steps each with incremental reliability benefits. We haven’t been providing a lot of updates on this publicly, but we have sporadically written about some of it before.

To recap the goal of the project: if I have an app in sjc that is never used outside of the area around sjc, then whatever I do to my app, it should not be affecting someone trying to access a completely different app from syd. That simply doesn’t make any sense. We’re solving this by spinning up regional clusters of Corrosion in addition to the global one, and storing fine-grained state only within the regional clusters. Of course, some states are inherently global: every fly-proxy needs to know whether an app exists, and where to find its TLS certificates; however, using the same example from before, if a request lands in syd for the app that mainly serves sjc, syd only really needs enough information to figure out that “I should forward this request to somewhere in sjc”. It does not need to know, for example, whether a specific machine is started or healthy.

An additional benefit of this is that the global Corrosion cluster can handle much less traffic as we migrate more and more of our systems to this new regionalized model. That translates quicker state replication and less transient consistency issues (so, for example, we will report less “your app doesn’t exist” after you have just created an app). A couple weeks back, we got to a point where no systems depended on health checks stored in the global Corrosion cluster anymore[1]. This, in fact, is one of the most busy tables in our global Corrosion cluster, because health checks tend to flap a lot for some apps (that’s what they’re designed to do!), and apps which have health checks tend to have more than one (for example, for different HTTP endpoints). Bluegreen deployments also used to generate one-off health checks only to be deleted soon after, which is only fixed recently as part of the same project.

So, with all of this in mind, and our platform seemingly stable after we stopped updating health checks stored in the global Corrosion cluster (but not deleting it), I finally decided to ship an update that cleans up every health check belonging to customer apps in the global Corrosion cluster. This turned out to be ~240k records, and I got to enjoy watching this line plunge down as the update was being rolled out:

(This is the number of records in the health_checks table in our global Corrosion cluster; at the time of screenshot the rollout hadn’t been completed yet, it ended up setting somewhere around ~10K records, corresponding to health checks internal to our platform, which are not deleted)

And, as expected, we’re now seeing a lot less synchronization traffic within our global Corrosion cluster:

This, of course, is still a decent amount of synchronization, and I do think we can cut down on this even more. For example, we’re still storing every machine’s configuration and status globally, and there’s no reason this can’t be handled the same way as health checks. Stay tuned as we continue to regionalize our platform and make it more reliable!


  1. Each proxy now only looks at health checks of machines local to that region. For out-of-region requests, as mentioned above, it forwards the request to that target region; this only happens if a request can absolutely not be served locally, either because an app has no machines there or all of them are unhealthy. ↩︎

12 Likes