How to have less breakages in the complex Fly ecosystem?

I should start with saying I am a Fly convert, so this isn’t a complaint. I am not sure whether it is a philosophical ponderance upon integration testing at cloud scale, if such things are possible, or maybe it’s just an observation on how often little things get broken.

I often see a user report here, and sometimes a helpful member of staff takes a peek, and says “ah yes, that foo shouldn’t bar like that”, and they fix it super-quick. A new version of flyctl appears, and then everyone’s happy… but then it happens with another cog, for another customer, a few days later.

I idly wonder if there could be a closed employee-only Fly system somewhere where stuff just gets deployed, healthchecked, scaled up, scaled down, proxied, macarooned, corroded, and all the fine things one can do in Fly, 24/7. Maybe this is a naive idea with an obvious flaw, but changes to the infrastructure could be deployed there first, so that explosions :collision: can be watched from a fun distance, rather than foobarring in customers’ production envs.

I appreciate that other cloud providers have been going a lot longer, so it’s unfair to compare, and I acknowledge the scrappy speed of Fly is exciting. But could any Fly peeps hereabouts give any thoughts as to how things might settle down in the future, even if that sounds kinda boring? :squinting_face_with_tongue:

4 Likes

In my view, the answer lies in the philosophical dimension.

Only a mathematically sound system can offer provable guarantees. Yet for many software engineers, mathematical soundness feels almost foreign. There is a tendency to draw a line between software and math, even though in reality no such separation exists. It often takes years - sometimes a decade or two after graduation - before one begins to recognize the unmistakable truth: when the components of a system are free from flaws, the system as a whole can be free from defects as well.

Getting to that understanding, however, is rarely painless. The road leading there is paved with trial and error, breaking changes, and the hotfixes needed to mend them.

2 Likes

a closed employee-only Fly system somewhere where stuff just gets deployed, healthchecked, scaled up, scaled down, proxied, macarooned, corroded, and all the fine things one can do in Fly, 24/7

we do have this! it runs in production (because duplicating all of our infrastructure at this point would be very time-consuming and not representative) and we get reports sent to Slack every night. here’s yesterday’s:

the failures in this picture I believe were related to network instability between regions - something that is out of our control, but we are working to improve our alerting so we catch this kind of issue earlier.

we also run similar (“preflight”) tests on flyctl PRs and before releasing a new flyctl version.


a lot of our issues these days are “we don’t know what we don’t know”. every time a small thing breaks we gain knowledge on what can break; a lot (but not all) of that ends up being institutional knowledge.

startups tend to “move fast and break things!”, and while we’re certainly not immune to that, I’d like to believe we do try to break things as little as possible :slight_smile:

I think in general we’re going in the direction of having less broken things.
we’re slowly replacing legacy systems with newer ones that fit the platform better - this trades long-term structural brokenness (or latency/complexity/unreliability, etc) for more obvious short-term brokenness after which the new system will work well.
at least that’s our hope!

8 Likes

Nice! I wonder if I have the view that we hear about quite a few issues in this forum, but they’re still only 0.1% of the big picture, and so it’s not really representative of where Fly is on reliability.

Super to hear :relieved_face:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Elsewhere in the forum there was some feedback/discussion about this recent incident. It looks like for the worst of it, deployments were down, and the API was not operational, so machines could not be started, updated, or stopped. Based on the incident timestamps, the problems continued for at least 4.5 hours, and the incident was declared fully resolved at +15 hours.

I appreciate one more breakage isn’t statistically significant, but I’d regard it as the kind of lengthy or worrisome outage that my original post was pondering upon. Could readers have a postmortem? I wonder also if Fly were to say, “yeah, we probably do need to slow down a bit”, that might be reassuring.

1 Like

I don’t have a detailed postmortem to share, but tl;dr is we accidentally deleted some core production apps, and struggled to bring those apps back up because deploying apps relies on the apps that were deleted. we’re implementing safeguards around this so it won’t happen again.

1 Like

OK, the “won’t happen again” bit sounds good; thanks Lillian. :star_struck:

My eyebrows are a touched raised though that an outage of this severity is not yet officially commented on at all, other than the notes in the incident software. Do you think you could give the boss a nudge? I don’t yet have customers on my Fly apps, but I know a bunch of people do, so I am asking on their behalf.

tbh with such incident i expect there will be like post morterm or something … i might be a small customer (300$ /mo usage) luckily i dont have any major bug on my side on that day , i cant imagine if i have. crazy bug where the user can withdraw money and i cant deploy new server or shutdown my existing server coz previsouly when i had major incident on my side i rely on fly scale count 0so atleast the app is stopped and i can fix the bug for time being