Can you try fly deploy --strategy canary and see if that helps? It may work a little better for you than the rolling deploy, as it will start another vm for a while which might help our proxy connect the new vm fast enough to use it. We can also take a look at your app to see if there’s something we can adjust.
There is a known slow service propagation issue to which a fix is being developed. As such it can take up to a minute before requests are routed to new vms after a deploy and so get handled as expected.
As @zee says what you need is for an old vm to hang around long enough so your new vm route, service etc is fully propagated. If you only have one vm this issue randomly happens. As you say, transient. It can be mitigated by having three:
… since it’s likely by the time the third old vm is replaced, the first new vm is ready to go, routing-wise. As this issue is fixed that won’t be necessary, however for now I’ve found that helps (in my case with the rolling strategy). So if you get it again, give that a try.
We’ve deployed a tentative fix for this longstanding issue! Now we see updates too quickly.
@lpil There’s a race between the old instance being removed from our state and the new instance being added. If the old instance is removed and we don’t yet know about the new instance, then we’ll close the connection since we have no instance to route it to.
Before, our state would become stale for up to a minute (ughh). This had the side-effect of thinking your app still had an instance running and we didn’t outright close the connection. Instead, the connection went into our pipeline, which is filled with retry contingencies to make sure we eventually reach your app (or time out if the state is really stale).
Our accept loop does not have a retry for this and since we’re seeing new state almost instantly, it’s possible our proxy doesn’t know of any routes to your app for a few seconds.
I’m currently working on a fix for this. I believe the canary deployment strategy fixes this for you right now, but that’s not a good long-term fix.
But … as I read that, the 3+ vm trick would still help right now, as that spreads out the time the deploy takes. As there would be at least one route for the app (either an old one using an “old” vm, or a new one from that deploy). And so … no downtime. Though possibly a bit of eventual consistency., app-wise. But that’s fine (for me anyway).
Ideally yes you would get the same result even with only one vm, but, as you say, that’s tricky.