I’m running a Express server with an attached PostgreSQL database. Whenever I deploy with --strategy bluegreen and I keep refreshing the URL during the deploy time, I’ve noticed there is a short downtime of ~20s. I’m building an API, and that makes this unacceptable. I’d love to learn how I can fight this downtime. Oh yeah, I’m using a custom Dockerfile.
First, I deploy a new version of my server with flyctl deploy --strategy bluegreen. I look at flyctl status --watch while it’s deploying, and eventually I see the new instance being started up. It passes the health checks, and then both run simultaneously for a few seconds after which the old instance shuts down. What I’ve noticed in the browser is that my requests keep hitting the old instance even after the new instance has passed health checks. When the old instance shuts down, new requests to the URL hang for about ~20s before being processed by the new instance. I’d like to eliminate these 20 seconds, and just switch over traffic to the new instance as soon as it’s passing health checks, and then after that shutting down the old instance. How do I accomplish this?
Hi sorry reviving this old thread but I’m experiencing the same issue today. Exactly how OP described.
Deploy w/ strategy = “bluegreen”
Wait until “green” deployment is successful
Fly will destroy the “blue” deployment
Reloading the browser will cause app to hang ~20-30 seconds because it’s trying to communicate with the “blue” deployment
Error in the logs: machine is in a non-startable state: destroyed
After about 30 seconds, the browser request will route to the new “green” deployment.
Is that downtime or slow responses? I can’t quite tell from the chart here.
This should have been much improved in the past few weeks as we’ve moved to a new state propagation system.
We’re still working on the various components in the deployment path to further reduce the slowness experienced during deploys. Ideally there would be none, but there’s a large refactor required to bundle deployments at a higher level. Right now deployments are orchestrated by flyctl (the client) and update machines based on the deployment strategy. From our systems’ standpoint, machines are being stopped and started and their services are being deleted and re-added. It’s possible for a node to receive all “stops” and “deletes” before receiving any “starts” and “creates”. Now if we had a higher-level structure for deployments or if these changes were all bundled in a “transaction”, we could probably do better.
If you are getting errors (that aren’t timeouts), then that’s another matter. Timeouts are not great either, but at least that’d be inline with my explanation