I am writing this because there are so many errors on fly. My question is how confident can I be that my app will never go down?
Let’s say I have set of apps which I want to be highly available. What config should I use? what plan should I use so that Fly can give me best experience without me dealing with fly platform problems?
I will deploy all config and switch to whatever paid plan you ask me to.
You can do some work to mitigate issues in our infrastructure. Mostly it’s just a matter of avoiding moving parts:
Switch to Machine based apps: Most apps use something called Nomad. We have continuous issues with Nomad (mostly because we’re holding it wrong, today because Nomad had an operational failure we haven’t seen before). The new, Machine backed apps have far fewer moving parts. You are less likely to have an app process go away due to something in our infrastructure on Machines.
This is not the simplest switch, you currently have to create new apps to get off Nomad.
Run 2x Machines for every app you care about: Single instance apps are especially brittle in our infrastructure. This is true on Nomad AND Machines. You should run 2+ instances. This will protect from hardware level issues.
In this kind of config, an app that’s running is most likely to stay available. Deploys may still break, since there are a bunch of moving pieces when you deploy an app. Our global proxy could also still fail, and there’s no way you can mitigate that.
Paid plans won’t change anything about how our system behaves. But they will get you more direct access to engineers here. Which is helpful when you’re trying to figure out of a problem is us or you, and occasionally helpful for identifying bugs/operational issues that our monitoring hasn’t really keyed us in to yet.
I will switch to machines based apps. We will re-create existing ones.
The thing that concerns the most is your own admission “Deploys may still break”. If I deploy via CI/CD pipeline and once fly deploy command succeeds, can’t you ensure that the app will be surely deployed.
My concern is that Flyctl should fail the command if deploys are going to fail. Not deploying when I want to deploy and failing CI pipeline is at least something that alerts us. Succeeding everything and then going down (which happened today) is not going to work I guess for anyone using Fly.
The fly deploy command blocks until the deploy succeeds. For Nomad apps, you can ctrl+c and have the deploy continue in the background. For Machine based apps, the fly deploy command actually does all the orchestration. If the process exits, the deploy stops.
I think what you’re saying, though, is that fly deploy shouldn’t do anything if it can’t succeed. This is roughly how Machine apps work. Detecting “can’t succeed” in Nomad is actually very complicated, and we never quite figured out how to deliver the UX we wanted there.