Another one for persistent "machine still attempting to start" errors

For months — across multiple apps, machines, and regions — I eventually see machines reach a state in which they fail to resume from suspended states with the following message:

[error] [PM01] machines API returned an error: “machine still attempting to start”

There is no further information, and the only fix is to scale to zero to remove the dead machines and then scale back up again afterward. I see multiple reports of the same problem throughout this forum, not one of which is answered.

I have seen it on both Astro web server apps and Node.js API apps — seemingly very little in common. It only seems to start occurring after I leave the app alone (no new deployments) for some time, although once it’s started to happen on a machine, redeploying does not fix it, only scaling down and back up does. These apps are all very low traffic and have suspension enabled, which indicates to me that repeated suspend/resume cycles eventually trigger a bug in Fly’s systems which permanently breaks resume functionality. I have observed the behavior on both single-machine and multi-machine deployments: one by one the machines bite the dust until they’re all dead and I have to manually intervene. Other than that, I have no idea, because no other errors are ever logged.

Anyone else have an idea about this or have found a resolution to it? I’ve been really happy with every other aspect of Fly, but this issue (which just occurred again today on a brand new app that is only 6 days old) has me about at the end of my rope.

confirm, also experiencing this ridiculously long times for starting suspended machines

Edit: These have been fixed, see: Fixed: unreasonably slow resumes of suspended Machines

We’re continuing to investigate the cause of these. From what we’ve seen, it seems to impact a small number of machines when waking from a suspended state. As such, changing your app to use a stop instead of a suspend should avoid them altogether.

If switching away from suspend is not an option, two things to try if you find a machine in this state:

  1. Run a machine metadata update with fly machine update <machine-id> --yes --metadata foo=bar . The update should force it out of the starting state. You can update any value, but a metadata update doesn’t change anything in your actual machine settings, so it’s a good fit for cases like this.
  2. If that fails, clone a fresh machine with fly machine clone and destroy the stuck one.
1 Like

Same here! Happening a lot today!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

For closure, this has now been fixed. Please see Fixed: unreasonably slow resumes of suspended Machines

1 Like