Machines consistently failing to restart

koz · February 14, 2026, 5:16pm

I have as part of my app worker machines that will turn off, and then get restarted when new jobs arrive. I am consistently seeing the following error

17:11:15
Virtual machine exited abruptly
17:11:15
machine restart policy set to 'no', not restarting
17:11:19
 INFO Starting init (commit: 350f2667)...
17:11:27
ERROR Error: an unhandled error occurred: No such file or directory (os error 2)

This never resolves, and I have to create new machines to get my workers back online. These machines all have volumes they mount on startup, and they are all GPU instances. Its making it impossible to operate my site, as every time my VM’s turn off it is likely they will never restart, and my users are complaining about the interruptions this is causing

halfer · February 14, 2026, 6:00pm

Do you mean they exit naturally, or are configured to auto-suspend? Could we see your YAML config?

I wonder if this is either the machine not finding the volume, or an application level problem that needs debugging on your side. Is it worth you running a demo on a non-GPU spec, to see if that has the same problem? I appreciate that you won’t be able to run your workload there, but I should think you could emulate this, in a nonprod instance.

koz · February 16, 2026, 12:39pm

They exit naturally. They are job processing nodes, so they come online look for work, and then exit after a period of idleness. They are not set to restart I have web server nodes that start them when a new job is submitted, or detected in the job queue. Relevant YAML is

primary_region = 'ord'

[build]

[processes]
  jobs = "./jobserver.sh"

[[mounts]]
  source = 'jobs'
  destination = '/var/lib/data'
  initial_size = '30gb'
  processes = ['jobs']

[[restart]]
  policy="never"
  processes = ['jobs']

[[vm]]
  memory = '8gb'
  cpu_kind = 'performance'
  cpus = 2
  gpus = 1
  gpu_kind = 'a100-pcie-40gb'
  processes = ['jobs']

I don’t think its application side, as this error occurs before the firecracker VM even starts, and it generally has been working for a while. The only way I am able to fix it is by deleting the volume and creating a new one. The volumes just store my uv cache so the data is ephemeral, but I need the volume as the uv cache size is too large for the docker images.