Can't recover from volume snapshots on MAD region

Hey!

I received an email saying that a host has suffered irreparable hardware damage and I need to migrate Fly Machines to other hosts and restore volumes from backups. This happened 1 month ago in the MAD region, but I can’t recover following instructions. Am I missing something?

I tried to follow the 2 commands they suggest to start a new volume/app from the volume snapshot.

fly volumes create pg_data \\
  --snapshot-id vs_XvaK5wyKZk9lcJXg8PADZ \\
  --size 10 \\
  -a gymious-prod-db
fly postgres create \
  --snapshot-id vs_XvaK5wyKZk9lcJXg8PADZ \
  --volume-size 10

The first cmd ran without any problem, except for the fact that the created volume is empty.
In the image below I show both volumes, the old one and created and empty one on CDG region.

Second cmd failed with the following output.

fly postgres create \
--snapshot-id vs_XvaK5wyKZk9lcJXg8PADZ \
--volume-size 10

Creating postgres cluster in organization personal

Creating app...

Error: "Failed to resolve source machine from snapshot."

Could this be related to the machine attached to the volume? The machine is dead but I can’t delete or attach another machine to this volume.


MAD machine can’t be stopped or deleted.

Did I just lost 10GB of data?

Any way to recover the data from the snapshots?

Error: “Failed to resolve source machine from snapshot.”

Could this be related to the machine attached to the volume? The machine is dead […]

I think you’re right. When restoring from a snapshot, we determine which Postgres app image to use by resolving the last Machine associated with the volume, and then using that Machine’s image. I figure the host being unreachable is what’s causing the image resolution issue in your case.

If you do fly postgres create --snapshot-id <your-snapshot> --image-ref <your-image-version>, this avoids the need to derive an image from a source Machine and hopefully gets you further.

The Backup, Restores, & Snapshots doc walks through identifying your Postgres image version and restoring into a new cluster with the same image.

1 Like

Thanks @leslie , it worked.

But now I can’t see any data on the DB.

Is it possible that the snapshot is empty?

This is the new volume

From your screenshot, it looks like there’s enough usage in your volume that I believe it’s non-empty.

One reason your restored app may be a blank slate is if it runs a very different image than your original - specifically, if one uses flyio/postgres-flex and the other uses flyio/postgres.

Plenty changes between these two implementations and they also write data to different paths (i.e /data/postgresql vs /data/postgres). If your original cluster writes to one location but your new cluster expects to find data elsewhere, that could explain why your restored app is empty despite the volume actually having data.

As a next step, I suggest double checking that both of your Postgres apps are running the same image. Run this against both and compare the image versions:

# Note the "Repository" and "Tag" columns
fly image show --app <postgres-app-name>

If there’s an image mismatch, you’ll have better luck restoring from the same snapshot if you recreate your new Postgres app using the same image as your original cluster.

1 Like

The problem is that the machine 3d8dd10b925238 isn’t reachable and doesn’t show any information. Following you command, displays N/A. Which is the original machine attached to the affected volume.

Any other way to retrieve that information?

I test with flyio/postgres-flex:17 and saw an error requesting version 15. Tried flyio/postgres-flex:15 and wasn’t working. Tried flyio/postgres:14.6 and everything worked but can’t access the data.

Thanks for trying that and for testing those different versions.

Since fly image show isn’t working here, I’m not sure if there is a user-facing way to fetch your image details when a host is down. In the meantime, I can see in our backend that your unreachable Machine runs flyio/postgres-flex:15.2.

Going back to your new empty cluster, it sounds like (from your last post) it may be running flyio/postgres:14.6. If that’s right, then I suggest retrying your snapshot restore with the matching image:

fly postgres create \
  --snapshot-id <your-snapshot-id> \
  --volume-size 10 \
  --image-ref flyio/postgres-flex:15.2

Specifying postgres-flex:15 (without minor version) defaults to the newest image version in this release, postgres-flex:15.10. This may have caused the issues when you tried 15 earlier.

I hope this helps!

1 Like

Thanks for helping @leslie . We have our data back :slight_smile:

1 Like

Just FYI for anyone coming across this in the future, it should be possible to find the correct image by querying our GraphQL API: https://community.fly.io/t/region-deprecated-empty-postgresql-backups/25830/6

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.