It’s no great secret that Fly has had reliability issues related to increased demand (post-Heroku exodus) over these past couple years.
But it seems to me that things have been pretty solid recently. I’ve gotten a few notifications about dashboard availability but not much regarding deployments or the Machines API, which for me are the heart and soul of why I use fly.
Is that just me? Did I miss something? Or has the infra/ops team been able to really dig in and lock things down a lot better than before? I’d love it if Fly would publish a blog post about these ongoing efforts.
My experience in the London region has been very solid, but that’s for an app that’s still in development and has no monitoring, and thus is a weak anecdote.
Failure reports from this forum have, in my view, still been rather choppy, though how many of them are vibe-coders who’re accidentally sabotaging their own apps is rather hard to say. The number of folks who’re needing a self-hosted Postgres rescue due to host corruption is still quite frequent, but an alarming number of them only have a single node, so I’d be willing to give Fly a pass there too.
What I haven’t seen though are the sorts of lengthy global outages that make people pull their hair out. Looking at issues on this forum recently I’m seeing issues that are either user error or at worst isolated to a region, and they seem to be getting resolved in reasonable time frames.
DB-wise, it looks like the Managed Postgres has been very stable for folks, and that’s what I’m using. I’m seeing a number of DB issues that are solved just by migrating to MPG so that gives me peace of mind.