Another one for persistent "machine still attempting to start" errors

Edit: These have been fixed, see: Fixed: unreasonably slow resumes of suspended Machines

We’re continuing to investigate the cause of these. From what we’ve seen, it seems to impact a small number of machines when waking from a suspended state. As such, changing your app to use a stop instead of a suspend should avoid them altogether.

If switching away from suspend is not an option, two things to try if you find a machine in this state:

  1. Run a machine metadata update with fly machine update <machine-id> --yes --metadata foo=bar . The update should force it out of the starting state. You can update any value, but a metadata update doesn’t change anything in your actual machine settings, so it’s a good fit for cases like this.
  2. If that fails, clone a fresh machine with fly machine clone and destroy the stuck one.
1 Like