how do fly.io deploys work?

I’m running into a problem where usually I deploy my app, I have a little bit of downtime.

This seems like it must be because I’m configuring my healthchecks wrong, but I couldn’t find an explanation of how fly.io deploys move traffic between the old version and the new version. Is there an explanation somewhere I can read so that I can understand exactly what’s happening?

You’re probably hitting a slow service propagation issue we’ve been fighting. It will hopefully be fixed for good in a few weeks, but the tldr is we’re probably causing the downtime when you deploy an app with a single VM. You can reduce the impact of this by running fly scale count 2 or more.

When you deploy we:

  1. Start a canary VM (if you’re using a volume, we skip this step)
  2. Remove an old VM from service discovery
  3. Wait 30 seconds
  4. Stop old VM
  5. Start new VM, add it to service discovery

fly-proxy relies on service discovery to know where to send requests. The problem is, it frequently takes 60-120s for fly-proxy to detect service discovery changes. So there’s a window when the old VM stops where fly-proxy doesn’t know about the new VM.

Running multiple VMs mitigates this by chance, it buys fly-proxy more time to detect the new VMs.

that makes sense, thanks!

I actually can’t scale my app to 2 because of some design decisions I’ve made (I have a very basic pubsub system that relies on everything being in a single process), but I can live with the minor deploy inconvenience for now.

just in case: is there a way to tell fly to leave the old VM running for longer (like 2 minutes) while the service discovery is catching up?

(edit: never mind, I read your explanation wrong, I don’t think that makes any sense :))

Can your app handle running two vms at a time for a few minutes during a deploy? That’s an easy enough tweak, I think we may even be able to apply it to all apps running less than 3 instances. You’ll need to fly deploy --strategy canary.

Hmm, I think that running two at a time would probably be worse. Realistically I don’t plan to deploy that much or have that much traffic so it’s not a big deal :slight_smile:

Ok, well it is pretty irritating so we’ll hopefully get it fixed for you soon anyway. :slight_smile:

1 Like

I still get a little downtime during deploys, presumably due to the same slow propagation issue.

I’ve tried using 2 VMs, and also using a --canary deploy, but still happens.

This suggests using 3 VMs may help Rolling deployments sequence - #12 by amithm7 but since they are generally idle (I’d like a more serverless-style) I’d rather not add VMs just for the occasional deploy.

Beyond this (but related) what would be neat would be on-deploy to have a new set of VMs started, ready to go, and then manually/auto switch all (or a percentage) of traffic dynamically to the new environment. That would also help with rollbacks, as if there was any problem you could switch traffic back to the old one. It would depend on how quickly your load balancer updates globally to point at the new/old set, but assuming it was near-instant, that would also avoid downtime during a deploy.

This is already available with the bluegreen deployment strategy, but you’d still need a few VMs for it to overcome the propagation issue.

Thanks, I’ll try bluegreen and see if it helps, but it sounds like it may still need 3 VMs.

It seems like that will indeed auto migrate all the traffic, but I was thinking what would be nice would be a way of manually deciding the split (like how a certain mega-corp does it) as then you could rollback near instantly by simply switching 100% of traffic to the prior deployment. Or do a 10/90 split as a test etc. I don’t believe that’s currently possible as it reads like one environment/deployment at a time is running.

revisions