Automatically Retry Deploy Orchestration With --deploy-retries

Hey everyone,
As of v0.2.94 of flyctl, we now support retrying deploy orchestration! For those of you who want to try this feature right away, you can by setting --deploy-retries=5 (or whatever other number!)

What is deployment orchestration?
Deploying your app is pretty complex. There’s a lot of moving parts that go behind the scenes, and any of those moving parts can fail or a variety of interesting reasons. A few weeks ago, I was mostly working on improving the remote builder experience (I actually plan on continuing to work on remote builders, but no spoilers for my next community post :wink:). Now, I’m working on deployment orchestration.

Basically, this part of deployments is when flyctl is trying to update all your machines, make sure health checks are passing, and make sure that traffic is flowing to the right machines. When deployment orchestration fails, one of two things can happen:

  • in the best case, your app just remains in the state it was before the deploy failed
  • in the worst case, your app is in a gross “in-between state” where some machines are using your new image, and some are using your old one

The correct solution here is normally to just try deploying again, which will usually succeed if you just experienced a small platform hiccup. But why have to manually retry deploys at all?

What deploy retries help with
If you enable deploy retries, you should notice less issues with acquiring leases, updating machines, and a few more steps during deploys. Generally speaking, small intermittent platform issues should be less noticeable in the future.

What they don’t help with
The two big areas where deploy retries won’t help are remote builder issues, and what I call ‘application issues’. For example, if for whatever reason flyctl is having trouble communicating with your remote builder, then these new changes won’t help (though we’re working on that!).

Alternatively, if your app is failing health checks because your application code has a bug, these changes won’t help either. Using something like canary or bluegreen deploys could help in these cases, though.

This also won’t help when there’s major platform issues. For example, in the terrible scenario that the machines API is completely dead, it doesn’t matter how many times we try to communicate with it.

Why is the default set to ‘auto’?
The final goal is to enable this feature for all users by default. However, these changes required a large amount of restructuring of how deploys work internally. To make sure that we don’t introduce some horrible bug on accident, we’re slowly rolling out this feature to all users over the course of a week. As of writing, around 5% of you should be getting automatic retries by default. If you want to opt-in early, you can add the --deploy-retries=5 to fly deploy. If you want to opt out, just set --deploy-retries=0.

6 Likes