I’m running into a problem where usually I deploy my app, I have a little bit of downtime.
This seems like it must be because I’m configuring my healthchecks wrong, but I couldn’t find an explanation of how fly.io deploys move traffic between the old version and the new version. Is there an explanation somewhere I can read so that I can understand exactly what’s happening?
You’re probably hitting a slow service propagation issue we’ve been fighting. It will hopefully be fixed for good in a few weeks, but the tldr is we’re probably causing the downtime when you deploy an app with a single VM. You can reduce the impact of this by running fly scale count 2 or more.
When you deploy we:
Start a canary VM (if you’re using a volume, we skip this step)
Remove an old VM from service discovery
Wait 30 seconds
Stop old VM
Start new VM, add it to service discovery
fly-proxy relies on service discovery to know where to send requests. The problem is, it frequently takes 60-120s for fly-proxy to detect service discovery changes. So there’s a window when the old VM stops where fly-proxy doesn’t know about the new VM.
Running multiple VMs mitigates this by chance, it buys fly-proxy more time to detect the new VMs.
I actually can’t scale my app to 2 because of some design decisions I’ve made (I have a very basic pubsub system that relies on everything being in a single process), but I can live with the minor deploy inconvenience for now.
Can your app handle running two vms at a time for a few minutes during a deploy? That’s an easy enough tweak, I think we may even be able to apply it to all apps running less than 3 instances. You’ll need to fly deploy --strategy canary.
Beyond this (but related) what would be neat would be on-deploy to have a new set of VMs started, ready to go, and then manually/auto switch all (or a percentage) of traffic dynamically to the new environment. That would also help with rollbacks, as if there was any problem you could switch traffic back to the old one. It would depend on how quickly your load balancer updates globally to point at the new/old set, but assuming it was near-instant, that would also avoid downtime during a deploy.
Thanks, I’ll try bluegreen and see if it helps, but it sounds like it may still need 3 VMs.
It seems like that will indeed auto migrate all the traffic, but I was thinking what would be nice would be a way of manually deciding the split (like how a certain mega-corp does it) as then you could rollback near instantly by simply switching 100% of traffic to the prior deployment. Or do a 10/90 split as a test etc. I don’t believe that’s currently possible as it reads like one environment/deployment at a time is running.