How to have less breakages in the complex Fly ecosystem?

I should start with saying I am a Fly convert, so this isn’t a complaint. I am not sure whether it is a philosophical ponderance upon integration testing at cloud scale, if such things are possible, or maybe it’s just an observation on how often little things get broken.

I often see a user report here, and sometimes a helpful member of staff takes a peek, and says “ah yes, that foo shouldn’t bar like that”, and they fix it super-quick. A new version of flyctl appears, and then everyone’s happy… but then it happens with another cog, for another customer, a few days later.

I idly wonder if there could be a closed employee-only Fly system somewhere where stuff just gets deployed, healthchecked, scaled up, scaled down, proxied, macarooned, corroded, and all the fine things one can do in Fly, 24/7. Maybe this is a naive idea with an obvious flaw, but changes to the infrastructure could be deployed there first, so that explosions :collision: can be watched from a fun distance, rather than foobarring in customers’ production envs.

I appreciate that other cloud providers have been going a lot longer, so it’s unfair to compare, and I acknowledge the scrappy speed of Fly is exciting. But could any Fly peeps hereabouts give any thoughts as to how things might settle down in the future, even if that sounds kinda boring? :squinting_face_with_tongue:

2 Likes

In my view, the answer lies in the philosophical dimension.

Only a mathematically sound system can offer provable guarantees. Yet for many software engineers, mathematical soundness feels almost foreign. There is a tendency to draw a line between software and math, even though in reality no such separation exists. It often takes years - sometimes a decade or two after graduation - before one begins to recognize the unmistakable truth: when the components of a system are free from flaws, the system as a whole can be free from defects as well.

Getting to that understanding, however, is rarely painless. The road leading there is paved with trial and error, breaking changes, and the hotfixes needed to mend them.

2 Likes

a closed employee-only Fly system somewhere where stuff just gets deployed, healthchecked, scaled up, scaled down, proxied, macarooned, corroded, and all the fine things one can do in Fly, 24/7

we do have this! it runs in production (because duplicating all of our infrastructure at this point would be very time-consuming and not representative) and we get reports sent to Slack every night. here’s yesterday’s:

the failures in this picture I believe were related to network instability between regions - something that is out of our control, but we are working to improve our alerting so we catch this kind of issue earlier.

we also run similar (“preflight”) tests on flyctl PRs and before releasing a new flyctl version.


a lot of our issues these days are “we don’t know what we don’t know”. every time a small thing breaks we gain knowledge on what can break; a lot (but not all) of that ends up being institutional knowledge.

startups tend to “move fast and break things!”, and while we’re certainly not immune to that, I’d like to believe we do try to break things as little as possible :slight_smile:

I think in general we’re going in the direction of having less broken things.
we’re slowly replacing legacy systems with newer ones that fit the platform better - this trades long-term structural brokenness (or latency/complexity/unreliability, etc) for more obvious short-term brokenness after which the new system will work well.
at least that’s our hope!

8 Likes

Nice! I wonder if I have the view that we hear about quite a few issues in this forum, but they’re still only 0.1% of the big picture, and so it’s not really representative of where Fly is on reliability.

Super to hear :relieved_face:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.