How is reliability?

I used Fly years ago and had really one of the worst experiences I’ve ever had. How are things these days?

From the incidents page, it seems like Fly is averaging maybe 25-30 incidents (of varying degrees of severity) per month and they seemed to have abandoned their Infra Log site. So that doesn’t seem great, but I certainly want to factor in the community’s impressions.

Thanks!

1 Like

Quite recently I talked about my two cents on this and here ya go:

https://community.fly.io/t/fly-io-availability/26349/3?u=lubien

Thanks for posting.

1 Like

for what it’s worth, most of the incidents listed on the status page are just a byproduct of the global internet not being as stable as one would imagine.
a lot of PaaS/IaaS, while having services across the world, mostly run regions independently from each other; and, while a Fly app running in ams will certainly keep working if sin has flaky network, a client in Singapore will have a hard time connecting to that app - or, an app replicating a database between lax and syd would probably not be able to do that very well if a shark ate a subsea cable again.

people run lots of different kinds of apps on the platform; most incidents only impact a subset of customers. it’s possible, I won’t say easy, to architecture a global app to work around these issues; but we’re happy to help with it. single-region (or couple-nearby-regions) apps will be just fine.

not to say we haven’t had outages for entirely-us reasons! a lot of those tend to be one-time things we learn a lot from.
I wrote something similar to lubien recently here regarding that, notably:

4 Likes

This was one of the nice things about the Infrastructure Log, for those who haven’t seen it before: it tried to ascertain, in retrospect, the scope and severity of each event—in some cases rendering it just as a small, dashed box.

infra-log severity grid for aug 25 through sept 7, 2024; the incident on the 7th is just a barely visible dashed box, whereas the 1st's is solid red, prominent, and large

Typically, there was also an entire prose paragraph describing the dimensions and nuances in more detail, which was even more valuable.

(In contrast, the status page updates are written mostly in the heat of the moment, as more of a real-time messaging system.)

It seems like the Big Red Box™ days are mostly in the past now—or at least bounded at ≤10% worse than other cloud platforms. Moreover, a lot of the classic rough edges in the service, like with certificates, suspended clocks, features introduced 3 years ago but still not mentioned in the official docs, have been gradually getting smoothed out over the past few months, which is also encouraging, :tulip:

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.