You’re probably hitting a slow service propagation issue we’ve been fighting. It will hopefully be fixed for good in a few weeks, but the tldr is we’re probably causing the downtime when you deploy an app with a single VM. You can reduce the impact of this by running fly scale count 2
or more.
When you deploy we:
- Start a canary VM (if you’re using a volume, we skip this step)
- Remove an old VM from service discovery
- Wait 30 seconds
- Stop old VM
- Start new VM, add it to service discovery
fly-proxy
relies on service discovery to know where to send requests. The problem is, it frequently takes 60-120s for fly-proxy
to detect service discovery changes. So there’s a window when the old VM stops where fly-proxy
doesn’t know about the new VM.
Running multiple VMs mitigates this by chance, it buys fly-proxy
more time to detect the new VMs.