"No GPUs available to fulfill request", unable to restart/start GPU machines in region ord

quanrong · September 5, 2025, 12:28pm

Since about one day, stopped L40s machines in region “ord” frequently fail to start, and stay in a failed state. Trying to restart them manually often results in:

Error: failed to restart machine XXXXX: could not stop machine XXXXX: failed to restart VM XXXXX: unknown: could not reserve resource for machine: no GPUs available to fulfill request

This is being very disruptive. Nothing has changed in our end, it was working fine before. Is there a known reason for this?

jfent · September 5, 2025, 3:15pm

no GPUs available to fulfill request

Unfortunately this simply means that there isn’t enough GPU capacity on the host that your machine was placed. You can see how much capacity is available by querying the Machines API:

# Sept 5 2025, 18:12 UTC
$ curl -s 'https://api.machines.dev/v1/platform/regions?size=l40s' | jq -c '.Regions[]|[.code, .capacity]'
["ord",7]

Destroying and recreating in ord may work to get the machine placed on a different GPU host, although this is admittedly a roll of the dice.

If you’re able to use other GPU types, there is slightly more capacity of those:

$ curl -s 'https://api.machines.dev/v1/platform/regions?size=a100-80gb' | jq -c '.Regions[]|[.code, .capacity]'
["ams",7]
["iad",11]
["sjc",10]
["syd",8]

$ curl -s 'https://api.machines.dev/v1/platform/regions?size=a10' | jq -c '.Regions[]|[.code, .capacity]'
["ord",14]

$ curl -s 'https://api.machines.dev/v1/platform/regions?size=a100-40gb' | jq -c '.Regions[]|[.code, .capacity]'
["ord",22]

halfer · September 5, 2025, 6:25pm

I seem to recall that errors for non-GPU machines actually explicitly mention region/capacity issues, which helps engineers understand that it is not a platform fault per se. Is the error report different between the two machine types?

jfent · September 5, 2025, 6:48pm

I forget what the non-GPU machines capacity error wording is specifically, but I don’t think it’s materially different from no GPUs available to fulfill request.

I think you used to get a similar message if you tried to start a GPU machine in a region with no GPUs, but this user wasn’t experiencing that problem (and I believe we updated that error message to point users to GPU-enabled regions anyway)

halfer · September 5, 2025, 6:58pm

Thanks. I wonder if I am thinking of failed volume creations due to lack of capacity; I did a forum search and found the error is:

failed to create volume: no capacity available in sjc

We still get people here asking what the problem is but, IMO, this error wording is clearer that they should fall back to another region.

jfent · September 5, 2025, 7:16pm

This is almost always not possible for GPUs - as you can see from the API query responses above, only one GPU kind is offered in more than 1 region.

Additionally, the error you quoted is one seen on create. OP was discussing an error that is only seen on start, i.e. the machine/volume have already been placed on a host, so it’s not necessarily the case that the entire region is out of capacity, just that particular host at that particular time. Trying to start the machine later may succeed, because at that point the host may have spare GPU capacity.

quanrong · September 5, 2025, 7:31pm

Yes, that’s what confused me: most of the time it works fine, but then out of the blue all 3 GPU servers fail on waking up from stopped state, and at some point it works again. It would be nice to have a more informative error message.

jfent · September 5, 2025, 7:42pm

I hear ya both, I’m just not sure what y’all think wasn’t clear in the original message.

How would you change/replace could not reserve resource for machine: no GPUs available to fulfill request in a way that would’ve made it obvious to you what the problem was?

halfer · September 5, 2025, 7:53pm

Ah yes, a fair point. I think I would give way to your position here, Jacob.

That said, I wonder if I would be somewhat of the view that error messages will always be found wanting by someone, and so a suggestion occurs to me: Fly error messages in general could contain a shortlink to an expanded help section, which contains useful pointers like the ones in this thread, which of course won’t fit into the brevity of an error.