Postgres unreachable due to irreparable damage on host machine

I’m hoping someone from fly.io support can help with an issue I have with the db for my hobby project that occurred due to a a fly.io hardware error.

My Postgres volume vol_4oj7858k9qw6w59r (app: backyardgarden-pg, region: sjc, zone: 1527) is stranded due to hardware damage on that host. The volume is in created state and currently unattached, but I cannot create a new machine in zone 1527 – I get “insufficient resources” (Request ID: 01KRS7MQB4WQEGBKD8W7GPPW01-ord).

I need either:

  1. The volume migrated to a healthy zone so I can attach a machine and pg_dump, or
  2. A raw snapshot/export of the volume data created by your team directly

(looks like there are no snapshots of this volume)

I think volumes have five days of snapshots configured by default. Did you turn them off?

No I haven’t knowingly turned off snapshots. Is it possible the snapshots are stored on the same host and just aren’t showing up? Or is there another way to check the snapshot config?

Or does that default not apply to unmanaged postgres instances?

Good questions, not sure. I hope the snapshots are on the same host; that might be a counterproductive design decision, for obvious reasons. I’d do this next:

flyctl volumes ls

Then, if you can get a volume ID from that:

flyctl volumes snapshots ls <volume-id>

Could you put the output of both of those here?

Snapshots are stored on S3, last I heard.

There are nuances with listing them, however, although I don’t recall the details…

(It’d be best to be as specific as possible about the exact commands that you’re trying, as @halfer is generally indicating above.)

The ls gives:

 ID                    │ STATE   │ NAME    │ SIZE │ REGION │ ZONE │ ENCRYPTED │ ATTACHED VM │ CREATED AT  
 vol_4oj7858k9qw6w59r* │ created │ pg_data │ 1GB  │ sjc    │ 1527 │ true      │             │ 1 month ago 

* These volumes' hosts could not be reached.

and the snaphots ls gives:

No snapshots available for volume vol_4oj7858k9qw6w59r

Here’s the gotcha that I was thinking of…

(It weirdly says no snapshots available if you omit the -a <my_app_name>, rather than just reporting an error.)

this host has been offline for a week or two (can’t remember off the top of my head), we’ve had a host issue up saying it would be down for a while and just switched it to permanent loss recently once we found out it couldn’t be fixed. the snapshots might’ve expired in that time?

Ah! including the -a <my_app_name> shows snapshots! Thanks for that

Good spot from @mayailurus. Assuming you have the app name set in your TOML config, and assuming that config is in the root folder of your project, just cd to that folder, and you can drop the -a spec.

Hopefully now you can see snapshots, you can recover from the most recent one. I’d say that NOW is a good time to arrange off-site backups, so you don’t get caught with this risk again. Also, if you only have the default number of daily snapshots, change it to something higher.