New machines seem to start on an unhealthy host

Hello, i’ve this message on my apps :

And according to the doc, scaling to 0 and scaling back up should fix the issue and allocate machine to a new healthy host.

The thing is, it’s of course my production app with a lot of traffic, and i cannot afford a downtime. It’s a a bluegreen deployment, and event a ‘secret set’ command is not working, because it fails to start new machines ! new machines stays in ‘created’ state, and i have to kill them manually with cli after the commands or deployment fails.

scaling up is also not working most of the time, it hangs, finish with an error, and new machine stays there in crated state and never go live. I also have multiple process groups. So the question is : should i scale to 0 and hope the new machines will start to work after that ? should i do it process group by process group ? or do i need to scale all process groups to 0, then try to scale them up after again ?
I want to avoid scale to 0 and not being able to start new machines after. Cause running machines do work fine. the issue is with new ones :s

Why new machines are trying to start in the unhealthy host ?

Greetings,

Greetings

Hm… I would try fly m clone --region=ams at this point. (You may also need an explicit fly m start afterward.)

This is assuming that you’re not using volumes. (Generally you wouldn’t be, with blue-green, but it’s worth mentioning given that it’s the most common cause of this kind of host-pinning behavior.)

fly scale can be a little stubborn during host outages…


Aside: You can verify which host a given Machine landed on by looking at 32 bits of its 6PN address.

it is possible for a host to marked as having an issue while still being eligible to accept new machines. this obviously doesn’t make sense (whether the host actually has an issue or not) and we’ll get it fixed.