Postgres database down after v2 migration: "unable to use requested volume due to capacity constraints"

We have two Postgres databases that were disabled yesterday as part of the migration to v2.

When I run “fly deploy -a xxx-db-production --image flyio/postgres:13 --local-only”, I get an error related to volumes capacity:

This deployment will:
 * create 1 "app" machine

No machines in group app, launching a new machine

 ✖ Failed: error creating a new machine: failed to launch VM: unable to use requested volume, 'vol_okgj545083xry2wz' due to capacity constraints
Error: error creating a new machine: failed to launch VM: unable to use requested volume, 'vol_okgj545083xry2wz' due to capacity constraints (Request ID: 01HE94VFRNWR1VPWE5XQYYM0ZQ-lhr)

Right now we’re stuck with the database (and website) down. Does anyone know if the capacity constraints error is due to a configuration issue on our side, or an issue on Fly’s side?

And it’s a bit concerning that when I view the database’s volumes it says 0MB used for every volume :grimacing:

Hi there,

Can you check with fly volumes list and see if the information there looks OK?

I checked on my side and I do see various amounts of sector allocation in your volumes; they were NOT deleted, I’m fairly certain it’s just a display issue with the web UI. volumes list output would help confirm.

The capacity constraint issue might be because the host in which your volume resides no longer has enough resources to allocate your machine - since the volume itself cannot be moved, this creates the situation you have observed here.

I think the safest way to proceed here is to try to create a new database app from your most recent data snapshot; that way we leave the existing app and its volume untouched and don’t risk getting it into an even more wedged state, and new instances will be created by default on a host with enough capacity.

fly volumes list -a your-app should give you a list of volumes. If possible, choose the one that was attached to the primary unit. If that’s hard to determine, you can choose the oldest one. If the procedure doesn’t work with that one, you can just retry with a different volume - the procedure is non-destructive to volumes or snapshots.

If that checks out, you can get the list of snapshots for that volume:
fly volumes snapshot list vol_XXXXX

From that list, get the most recent snapshot which looks like vs_yyyyyyyyyyyy and use its ID to create a new Postgres app. I’m feeding it all the configuration parameters in a single line (this is a single command, in one line, please take that into account as Discourse might split it into two or more lines):

fly postgres create --name your-db-restored --vm-size performance-4x --volume-size 50 --initial-cluster-size 3 --region REGION --org your-organization --image-ref --stolon --snapshot-id vs_yyyyyyyyyyyy

What we’re doing here:

  • Creating a new Postgres app (change the name if preferred)

  • Selecting a performance-4x size, this will default to 8 GB RAM; you can tweak this as desired and/or adjust VM size and memory once the database is rescued.

  • Selecting a 50-GB volume

  • The new cluster will have three units.

  • I’m pointing to the latest Postgres 13 image to match the old one.

  • I’m creating the new Postgres app with Stolon-based replication, which should mimic what your old one was using.

  • Finally and most importantly, we’re asking it to initialize the volume with the data from the snapshot we identified earlier.

What we want here is to start a new, fresh database server with mostly the same configuration as your old one, so it can understand the data in your existing volume. Instead of messing with the actual volume, we’re restoring from a snapshot.

Once this is completed, wait a bit for the new instance to come up and try connecting to it using fly postgres connect -a your-app , and check that your existing data is still there.

From here, you can reconfigure your app to connect to the new database, and once you’re certain things are working well, you can delete the previous database app and tune the new one to work as intended.

Let us know if this works.

  • Daniel

Am running into the same issue too. “the host in which your volume resides no longer has enough resources to allocate your machine” - why does this happen? Is this just due to fly underestimating the resources used by all the volumes on the same host?

Is there anything we can do to mitigate this issue? For example, if I keep the app always running with min_machines_running=1, will this guarantee that my volume will always be mounted and running?

@roadmr these are great instructions! We were able to help these folks out as they also emailed support, but this is good for others to see.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.