how do fly.io deploys work?

julia · December 10, 2021, 12:09am

I’m running into a problem where usually I deploy my app, I have a little bit of downtime.

This seems like it must be because I’m configuring my healthchecks wrong, but I couldn’t find an explanation of how fly.io deploys move traffic between the old version and the new version. Is there an explanation somewhere I can read so that I can understand exactly what’s happening?

kurt · December 10, 2021, 2:01am

You’re probably hitting a slow service propagation issue we’ve been fighting. It will hopefully be fixed for good in a few weeks, but the tldr is we’re probably causing the downtime when you deploy an app with a single VM. You can reduce the impact of this by running fly scale count 2 or more.

When you deploy we:

Start a canary VM (if you’re using a volume, we skip this step)
Remove an old VM from service discovery
Wait 30 seconds
Stop old VM
Start new VM, add it to service discovery

fly-proxy relies on service discovery to know where to send requests. The problem is, it frequently takes 60-120s for fly-proxy to detect service discovery changes. So there’s a window when the old VM stops where fly-proxy doesn’t know about the new VM.

Running multiple VMs mitigates this by chance, it buys fly-proxy more time to detect the new VMs.

julia · December 10, 2021, 2:06am

that makes sense, thanks!

I actually can’t scale my app to 2 because of some design decisions I’ve made (I have a very basic pubsub system that relies on everything being in a single process), but I can live with the minor deploy inconvenience for now.

julia · December 10, 2021, 2:08am

just in case: is there a way to tell fly to leave the old VM running for longer (like 2 minutes) while the service discovery is catching up?

(edit: never mind, I read your explanation wrong, I don’t think that makes any sense :))

kurt · December 10, 2021, 2:14am

Can your app handle running two vms at a time for a few minutes during a deploy? That’s an easy enough tweak, I think we may even be able to apply it to all apps running less than 3 instances. You’ll need to fly deploy --strategy canary.

julia · December 10, 2021, 2:21am

Hmm, I think that running two at a time would probably be worse. Realistically I don’t plan to deploy that much or have that much traffic so it’s not a big deal

kurt · December 10, 2021, 2:23am

Ok, well it is pretty irritating so we’ll hopefully get it fixed for you soon anyway.

greg · December 28, 2021, 6:21pm

I still get a little downtime during deploys, presumably due to the same slow propagation issue.

I’ve tried using 2 VMs, and also using a --canary deploy, but still happens.

This suggests using 3 VMs may help Rolling deployments sequence - #12 by amithm7 but since they are generally idle (I’d like a more serverless-style) I’d rather not add VMs just for the occasional deploy.

Beyond this (but related) what would be neat would be on-deploy to have a new set of VMs started, ready to go, and then manually/auto switch all (or a percentage) of traffic dynamically to the new environment. That would also help with rollbacks, as if there was any problem you could switch traffic back to the old one. It would depend on how quickly your load balancer updates globally to point at the new/old set, but assuming it was near-instant, that would also avoid downtime during a deploy.

jsierles · December 28, 2021, 7:30pm

This is already available with the bluegreen deployment strategy, but you’d still need a few VMs for it to overcome the propagation issue.

greg · December 28, 2021, 8:09pm

Thanks, I’ll try bluegreen and see if it helps, but it sounds like it may still need 3 VMs.

It seems like that will indeed auto migrate all the traffic, but I was thinking what would be nice would be a way of manually deciding the split (like how a certain mega-corp does it) as then you could rollback near instantly by simply switching 100% of traffic to the prior deployment. Or do a 10/90 split as a test etc. I don’t believe that’s currently possible as it reads like one environment/deployment at a time is running.

revisions

kurt · February 23, 2022, 2:44am

It’s been a few months, but this is much better now. @amos and @jerome and @thomas spent the better part of 3 months on this problem and we’ve had the results running for a little over a week.

You should be able to run a single app and do a deploy with zero downtime. And when you first launch an app, it’ll most likely work before you even notice.

joevandyk · November 2, 2022, 10:36pm

When I deploy my app, I notice that users are served both the old and new versions during the deploy process, even with the bluegreen strategy. Any ideas?

Topic		Replies	Views
fly scale vm causing downtime Questions / Help	4	342	July 12, 2022
Memory scaling on a single VM causes downtime. Intentional or a bug? Questions / Help	6	359	August 16, 2022
Microservices and deploy/releasing Questions / Help wishlist	2	1698	August 27, 2021
Instance not ready to receive requests	5	523	February 23, 2022
Deployment running for several minutes	5	306	February 4, 2023

how do fly.io deploys work?

Related topics