Since about one day, stopped L40s machines in region “ord” frequently fail to start, and stay in a failed state. Trying to restart them manually often results in:
Error: failed to restart machine XXXXX: could not stop machine XXXXX: failed to restart VM XXXXX: unknown: could not reserve resource for machine: no GPUs available to fulfill request
This is being very disruptive. Nothing has changed in our end, it was working fine before. Is there a known reason for this?
Unfortunately this simply means that there isn’t enough GPU capacity on the host that your machine was placed. You can see how much capacity is available by querying the Machines API:
I seem to recall that errors for non-GPU machines actually explicitly mention region/capacity issues, which helps engineers understand that it is not a platform fault per se. Is the error report different between the two machine types?
I forget what the non-GPU machines capacity error wording is specifically, but I don’t think it’s materially different from no GPUs available to fulfill request.
I think you used to get a similar message if you tried to start a GPU machine in a region with no GPUs, but this user wasn’t experiencing that problem (and I believe we updated that error message to point users to GPU-enabled regions anyway)
This is almost always not possible for GPUs - as you can see from the API query responses above, only one GPU kind is offered in more than 1 region.
Additionally, the error you quoted is one seen on create. OP was discussing an error that is only seen on start, i.e. the machine/volume have already been placed on a host, so it’s not necessarily the case that the entire region is out of capacity, just that particular host at that particular time. Trying to start the machine later may succeed, because at that point the host may have spare GPU capacity.
Yes, that’s what confused me: most of the time it works fine, but then out of the blue all 3 GPU servers fail on waking up from stopped state, and at some point it works again. It would be nice to have a more informative error message.
I hear ya both, I’m just not sure what y’all think wasn’t clear in the original message.
How would you change/replace could not reserve resource for machine: no GPUs available to fulfill request in a way that would’ve made it obvious to you what the problem was?
Ah yes, a fair point. I think I would give way to your position here, Jacob.
That said, I wonder if I would be somewhat of the view that error messages will always be found wanting by someone, and so a suggestion occurs to me: Fly error messages in general could contain a shortlink to an expanded help section, which contains useful pointers like the ones in this thread, which of course won’t fit into the brevity of an error.