I have an app running on two machines. From time to time, both machines get stuck in the “starting” state, and the app becomes completely inaccessible.
The first time this happened was months after the initial deploy, but recently it’s been occurring more and more often. Each time, the only workaround I’ve found is to clone the machines and force-destroy the ones stuck in “starting”.
Has anyone else experienced something similar, or is there a known cause/fix for this behavior?
We’re continuing to investigate the cause of these. From what we’ve seen, it seems to impact a small number of machines when waking from a suspended state. As such, changing your app to use a stop instead of a suspend should avoid them altogether.
If switching away from suspend is not an option, two things to try if you find a machine in this state:
Run a machine metadata update with fly machine update <machine-id> --yes --metadata foo=bar . The update should force it out of the starting state. You can update any value, but a metadata update doesn’t change anything in your actual machine settings, so it’s a good fit for cases like this.
If that fails, clone a fresh machine with fly machine clone and destroy the stuck one.