failed to launch VM: insufficient resources to create new machine

Hi, I’ve been expierencing this error for some hours now

Failed: error creating a new machine: failed to launch VM: insufficient resources to create a new machine with existing volume ‘vol_xxxxxxxxxxxxxx’

I’ve tried to switch to other regions (cloning the required volumes, of course), but none of them have been successful for me. I always get the same error.

My deployment requires GPUs so my region options are limited. According to the CLI ‘fly platform regions’ output, ams region has available capacity; however, the error persists.

Is anyone experiencing this as well?

Hi… It would probably help to post the exact commands that you’ve been trying, since some ways of creating Machines will stubbornly cling to older volumes, etc.

Also, is it complaining about the same volume ID each time, and the like?

This just shows its default Machine size (which is performance-1x, non-GPU), whereas GPU hosts are a completely disjoint subset, as I understand it.

You can try to use the underlying Machines API to query instead for the desired GPU flavor, but that part of the infrastructure is suspect lately (unfortunately). For example, iad is currently showing negative capacity (for the default performance-1x).

Oh, I see.

The command I’m using is: fly deploy

With this fly.toml

app = 'my-app-name'
primary_region = 'iad'
vm.size = 'a100-80gb'

[build]
image = 'myorg/an-image'

[deploy]
wait_timeout = "10m"

[env]
  SOME VARS HERE...

[[mounts]]
  source = 'data'
  destination = '/app/data'
  initial_size = '5gb'

[http_service]
  internal_port = 8000
  force_https = true
  auto_stop_machines = 'stop'
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

The volume ID changes if I use different regions (primary_region field), but if I use the same region between runs, the command always refers to the same volume ID (the one assigned to the ‘data’ mount in the toml file for the current region being attempted)

Thanks… Capacity looks super-tight for a100-80gb right now, assuming that the API actually is still accurate:

region capacity
ams 3
iad 0
sjc 0
syd 8

Also2, what command were you using to clone the volume? Remember that you (non-intuitively) have to say in advance that it’s going to be for a GPU Machine, :dragon:

:hushed_face:

I was using this to clone the volume: fly volumes fork vol_xxxxxxxxxxx --region zzz

I saw ord region has 19 of a100-40gb size, I’ll recreate the volume for that region using the –vm-gpu-kind parameter and reattempt the deploy using a100-40gb size. I already try this other size, but maybe the key is in the way the volume was forked (I honestly doubt it, but who knows…)

Well… that did the trick. The ·$%&# thing just created the machine… :zany_face:

I destroyed the previous volume on the ord region and re-forked, specifying the GPU type to ‘a100-40gb’ (the one available in that region)

Then I switched my deployment to ord region and set vm.size to ‘a100-40gb’

fly deploy output:

==> Verifying app config

Validating fly.toml

✓ Configuration is valid

--> Verified app config

==> Building image

Searching for image 'myorg/an-image' remotely...

image found: img_xxxxxxxxxxxxxxxx

Watch your deployment at https://fly.io/apps/my-app/monitoring

INFO Using wait timeout: 10m0s lease timeout: 13s delay between lease refreshes: 4s

Process groups have changed. This will:

 * create 1 "app" machine

No machines in group app, launching a new machine

-------

 ⠏ Machine xxxxxxxxxxxxxxxx [app] was created

The machine has been created, and in Grafana I can see:

Pulling container image docker-hub-mirror.fly.io/xxxxxxxxxxxxxxx

And my image boot messages…

All right!! Thank you so much! :partying_face:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.