How to have less breakages in the complex Fly ecosystem?

lillian · November 20, 2025, 8:22am

a closed employee-only Fly system somewhere where stuff just gets deployed, healthchecked, scaled up, scaled down, proxied, macarooned, corroded, and all the fine things one can do in Fly, 24/7

we do have this! it runs in production (because duplicating all of our infrastructure at this point would be very time-consuming and not representative) and we get reports sent to Slack every night. here’s yesterday’s:

the failures in this picture I believe were related to network instability between regions - something that is out of our control, but we are working to improve our alerting so we catch this kind of issue earlier.

we also run similar (“preflight”) tests on flyctl PRs and before releasing a new flyctl version.

a lot of our issues these days are “we don’t know what we don’t know”. every time a small thing breaks we gain knowledge on what can break; a lot (but not all) of that ends up being institutional knowledge.

startups tend to “move fast and break things!”, and while we’re certainly not immune to that, I’d like to believe we do try to break things as little as possible

I think in general we’re going in the direction of having less broken things.
we’re slowly replacing legacy systems with newer ones that fit the platform better - this trades long-term structural brokenness (or latency/complexity/unreliability, etc) for more obvious short-term brokenness after which the new system will work well.
at least that’s our hope!