Edit: These have been fixed, see: Fixed: unreasonably slow resumes of suspended Machines
We’re continuing to investigate the cause of these. From what we’ve seen, it seems to impact a small number of machines when waking from a suspended state. As such, changing your app to use a stop instead of a suspend should avoid them altogether.
If switching away from suspend is not an option, two things to try if you find a machine in this state:
- Run a machine metadata update with
fly machine update <machine-id> --yes --metadata foo=bar. The update should force it out of the starting state. You can update any value, but a metadata update doesn’t change anything in your actual machine settings, so it’s a good fit for cases like this. - If that fails, clone a fresh machine with
fly machine cloneand destroy the stuck one.