Can I create a new machine in ORD to attach my PG volume to?

      • Machine: 4d89213c256138
      • Region: ORD
      • Issue: VM resize (shared-cpu-1x → performance-1x) caused EXT4 filesystem corruption (bad block bitmap checksum). Postgres aborts on startup.
      • Snapshot restore stuck: vol_vgjzndkg6ypp56pv in “restoring” state for 35+ minutes
      • Can’t launch new machines: ORD returning “insufficient resources” for ALL sizes including shared-cpu-1x
      • Fork volume worked (vol_493z1dyzj6l3jdz4) but can’t attach to a machine — no ORD capacity
      • Production database is down
      • Snapshot IDs: vs_7PZqQeaq75G1h9AqVD2laNbz (1hr), vs_DzQY5okY4l1Gc8GNv37Xxe5 (1 day), vs_jQxD2A8Dbg1BclXZ4zNJ6G0D (2 days)
      • Single node, no cluster — just one machine with one volume

      Need either ORD capacity freed up or help completing the snapshot restore.

Is this a self-hosted Postgres? If so, is it in a cluster, and how many nodes do you have in the cluster? I believe folks in this forum have rebuild a single unhealthy node in a cluster from other healthy nodes.

Hey halfer - thanks for the response. Single node, no cluster — just one machine (4d89213c256138) with one volume. The volume has filesystem corruption (EXT4 bad block bitmap checksum)
and Postgres won’t start. I have a clean forked volume (vol_493z1dyzj6l3jdz4) ready to go but can’t launch any machines in ORD — getting
“insufficient resources” errors on all VM sizes including shared-cpu-1x. Need either ORD capacity or help with the stuck snapshot restore (35+ min
in “restoring” state).

OK, a clean forked volume is a good start. Thus, at least you have not lost your data. I assume also you have snapshots configured and tested as working.

Is this a good time to boot up a managed Postgres instance and import your data into that?

If not, and if ORD is out of capacity, consider spinning up something in a neighbouring region for now. I assume getting back online is more important than region-related latency issues.

Thanks — yes, the forked volume gives me confidence the data is intact. I have snapshots but haven’t been able to test them since the restore has
been stuck for 40+ min.

I’m considering spinning up a fresh Postgres in a nearby region (DFW or IAD) to get back online, but the forked volume with my data is stuck in
ORD. So I’d come up online with an empty database. Is there a way to move or restore data across regions, or do I need to wait for ORD capacity to
come back so I can read from the forked volume first?

Open to managed Postgres suggestions too — what would you recommend?

I don’t use PG on Fly. However, I think people have had a good experience with Fly’s MPG. It is way, way safer than running a single node on a Fly app; people getting burned by that configuration is commonplace here. Fly MPG seems to be supported in your region.

How much data do you have in GB? Would it take you a long time to move the data to another region?

Thanks — only about 583MB of data, so moving regions would be quick once I can read from the forked volume. The problem is I can’t launch any machine in ORD to access it.

MPG sounds like the right move long-term. Is that fly postgres create with the --flex flag, or is there a separate setup? And can I restore from a volume snapshot into MPG?

I think fly postgres is for the self-host option; you want flyctl mpg.

I am surprised ORD is completely full. I will see if I can create a machine there. Update: I am struggling to find an image to launch, but the capacity number does look problematic.

would be great - If there’s capacity I can get my data back, then move to MPG

Hmm, I assume negative capacity means there is a problem:

$ curl -s 'https://api.machines.dev/v1/platform/regions?size=shared-cpu-1x' | jq -c '.Regions[]|[.code,.capacity]' | head -15
["ams",-13386]
["arn",7667]
["bom",1150]
["cdg",16463]
["dfw",-5210]
["ewr",-4051]
["fra",-8822]
["gru",11250]
["iad",-65322]
["jnb",2828]
["lax",-5298]
["lhr",7722]
["nrt",-5285]
["ord",-52415]
["sin",-3931]

The capacity API is no longer accurate, unfortunately. See the following other recent forum thread for more context:

https://community.fly.io/t/persistent-could-not-reserve-resource-error-in-ord/27356

As a small side note, there are some suggestions for doing so in the other thread, for those who happened to find this one first.

(And it looks like that approach did work in this case.)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.