I’m a big fan of Fly.io and a truly believer on its potential. Unfortunately, the frequent outages that have been happening are really showing that Fly is not production ready yet. It’s been hours since my application is down due to a failured deploy (and the inability to redeploy). In the past 2 and a half months, there are 36 incidents reported here: Fly.io Status - Incident History . I really think the team should be focusing more on working on the infrastructure than adding new features. I don’t wanna move out from Fly but it’s been hard.
You realize that there are different teams that:
- Build new features
- Maintain the platform
also they are: Reliability: It's Not Great
We’re in an awkward phase where the company isn’t quite mature enough to support the infrastructure we need to deliver a good developer UX, and we’re going to take the bad with the good until that changes.
I agree with you. If it helps, here’s roughly what the whole company is doing right now:
- Product engineering is entirely focused on reliability and communications
- Support is doing ongoing support work, and pitching in extra to help us brute force customer communications
- Infra ops is entirely focused on reliability, incident management, and adding more people
- Framework teams are still working on Frameworks
There aren’t people working on new features. Reliability work sometimes manifests as features (like the status page). These contribute, though. We’re not working on anything except: stuff that makes apps on our infrastructure more resilient, and stuff that helps us communicate “are we broken or is your app broken?” to y’all.
This weeks’ outages have been pretty specialized Nomad/Consul/raft issues that not everyone can address. We have managed to add folks to help out and get them ramped up quicker than I expected, which is helpful. We’re in an architectural hole that we can’t “ops” our way out of. Fingers crossed we can get out from under this stuff any day now.
Like the OP, I LOVE Fly and what you are building and I’ve been a Fly evangelist in my company but to be honest, I can’t believe this is happening again, not even one day ago we had a big downtime due to a similar reason than the one we had today and it’s already happening again. I really wanted to keep my servers with you, I really love your product and I gave you many chances but definitely, your service is not ready for production usage, these downtimes are driving me and my users crazy.
I’m still rooting for Fly and I sincerely hope you succeed and fix all these issues.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.