When a bluegreen deploy fails I end up with extra machines

reconbot2 · July 23, 2024, 12:29am

I’ve got an app with 1+N web processes and 1 worker process. I don’t want web downtime during deploys but can tolerate a minute or two of worker downtime.

Currently I’ve got a web and a worker process group and I am using bluegreen deploys. I have a health check on my web process group but none on my worker process group.

If the healthcheck fails during deploy, the new web machines get shut down. However the old worker machine stays running and then I have two worker machines. This seems like a bug.

What’s the best way to architect or configure this app?

charsleysa · July 23, 2024, 10:56am

Hi @reconbot2

We separated our worker into a separate app. It allows for better control.
I don’t think the process group feature really adds any benefit apart from reducing the number of apps and reusing docker images. The tradeoff isn’t really worth it.

reconbot2 · July 23, 2024, 12:09pm

I really like the atomic unit of deploy and rollback, So my services can’t be at a different version from each other. However, if fly doesn’t offer that I guess I’ll have to build it myself?

charsleysa · July 23, 2024, 12:25pm

Depends on your needs but I would say that your workers should be capable to talking to different versions of your app and vice versa.

This is especially helpful as you grow and deploys take longer due to more servers. You don’t want a new version of the app to brake simply because it happens to be talking to one of the old worker machines, and you don’t want one of your workers to break because a new app machine has changed something that the old worker can’t handle.

kurt · July 23, 2024, 2:37pm

This is definitely not the intended behavior.

What happens if you do a rolling or canary deploy and it fails?

reconbot2 · July 23, 2024, 3:08pm

Canary seems to do what I’d expect. It launches a web process and when that doesn’t come up it gets torn down. However when I do a second deployment, I end up with two workers running instead of one?

reconbot2 · July 23, 2024, 3:16pm

Yes the worker and web should be tolerant of mixed deployments. And If I separate out the deployments that I probably also should pull out database migrations from the release commands and then orchestrate the 3 steps my own release process. It’s a new product and I don’t want to be investing in this at the moment.

reconbot2 · July 23, 2024, 3:19pm

And I guess to file another bug, every successful canary release I do seems to add a worker machine

I opened a support issue with the app’s name if you want to grab some details.

kurt · July 23, 2024, 4:19pm

Is it adding a worker machine that runs, or a standby worker machine that’s stopped?

reconbot2 · July 23, 2024, 4:38pm

Canary adds a running worker machine with every deploy.

kwaw · July 26, 2024, 1:13pm

@reconbot2 This should be fixed in Release v0.2.97 · superfly/flyctl · GitHub
The rollback should leave you with the proper count of machines now. LMK if it doesnt!

reconbot2 · July 27, 2024, 2:59am

Bluegreen is doing what I’d expect! (v0.2.99) Thank you.

Topic		Replies	Views
health-check failing during blue-green deployment elixir	3	256	May 9, 2024
New blue-green deployments failing - machines never passing healthchecks	23	178	January 23, 2025
BlueGreen Deploys Fail with Timeout	1	37	January 16, 2025
bluegreen deploy strategy for a non-TCP/HTTP service?	5	145	April 19, 2024
Old instance stopped before new one is healthy on bluegreen deploy	2	348	December 7, 2022

When a bluegreen deploy fails I end up with extra machines

Related topics