Update on `could not reserve resource for machine` error

We’re getting reports of customers hitting could not reserve resource for machine errors when trying to deploy in certain regions (mad and nrt being the most affected at the moment). We want to be open with you about what’s happening and why, and what we’re doing to fix it.

What’s happening?

We have a few hosts in specific regions that are at or near capacity. If you (or your deployment) try to provision more resources on a host that doesn’t have the resources to spare, you will see this error. If you’re not already on one of these hosts, you should be unaffected, even within the same region.

If you have Machines on one of these hosts and they have auto-stopped, the host may not have enough CPU or memory free to start them up again when you deploy, or when the proxy tries to wake them up.

You might also hit this if you are cloning a machine with a volume, or increasing the RAM or CPU specs on existing Machines.

If you’re deploying only to update the image, and your Machines are already running, you should not hit any resource errors.

What’s causing it?

This might seem simple at first glance (just get more hosts!), but the root issue is more complicated.

Just because we provision compute capacity doesn’t mean it gets used evenly. When we look at CPU usage on some of the “full” hosts, they look just fine. On top of that, the regional capacity is often fine as well, meaning a machine could deploy to a different host in the same region with no issues. So this isn’t so much an issue with capacity as as issue with rebalancing the workload among hosts in a region.

So what changed? The biggest difference is the shift from primarily Nomad to primarily Machines apps. Machines made it much easier to scale to zero, which means (you guessed it) we do a lot more stopping and starting of Machines versus leaving allocs alone. We also try to recreate Machines on the same host to speed up deployment. Combined with scale to zero, this makes it possible to not have enough host capacity to update a (stopped) Machine.

Nomad would fail here too, but it would fail differently and in a less visible (and less frequent) way. Instead of stuck allocations, Machines deployments fail quickly - thus the dreaded could not reserve resource for machine error.

What can you do?

Your best action for the moment is to get your Machine (and volume, if you’re using one) onto another host in the same region. Make sure you’re using the most recent flyctl (there’s some tasty new volume forking spice in there for you!).

1. Move Machines without volumes

If your app doesn’t use volumes (and so isn’t storing data you need to keep), the easiest solution is to upgrade to flyctl v0.1.72 or newer, and redeploy. If the host doesn’t have the resources to start a Machine in place, it will now try replacing it with a new Machine on a different host.

If you prefer to surgically replace specific Machines, you can fly machine clone the one you want to move, then destroy the Machine on the old host.

2. Move Machines with attached volumes

If your app takes care of replication, you can clone the old Machine — fly machine clone will provision an empty volume of the same size — and then destroy the original Machine. If the Machine you’re about to destroy is a cluster leader and is still running, it’s worth failing over to a different leader.

If the app doesn’t do any data replication, but the data you need is in a volume snapshot, you can clone an existing Machine in your app, using the --from-snapshot option to populate the new volume from a snapshot of the original. Then delete the old Machine.

Once you’re confident that your new Machines have access to the correct data on their volumes, you can delete the unused volumes so that you won’t be charged for them.

3. Keep Machines from auto-stopping

As a temporary preventive measure, you can turn off auto-stop (i.e. scale to zero) on your app by setting the following in your fly.toml and deploying:

[http_service]
...
auto_start_machines = true
auto_stop_machines = false
...

What are we doing to fix it?

The good news is we’ve been working on it!

We’re starting with a simple fix to detect the error in flyctl and replace the machine with one on a different host in the same region as part of fly deploy.

This has limitations, but we plan to add support for some of them, like apps with volumes, soon after.

We’re also working on a longer-term solution to rebalancing regions. This is a hard problem, and we’ll keep you updated on the progress.

12 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.