I received an email saying that a host has suffered irreparable hardware damage and I need to migrate Fly Machines to other hosts and restore volumes from backups. This happened 1 month ago in the MAD region, but I can’t recover following instructions. Am I missing something?
The first cmd ran without any problem, except for the fact that the created volume is empty.
In the image below I show both volumes, the old one and created and empty one on CDG region.
Error: “Failed to resolve source machine from snapshot.”
Could this be related to the machine attached to the volume? The machine is dead […]
I think you’re right. When restoring from a snapshot, we determine which Postgres app image to use by resolving the last Machine associated with the volume, and then using that Machine’s image. I figure the host being unreachable is what’s causing the image resolution issue in your case.
If you do fly postgres create --snapshot-id <your-snapshot> --image-ref <your-image-version>, this avoids the need to derive an image from a source Machine and hopefully gets you further.
The Backup, Restores, & Snapshots doc walks through identifying your Postgres image version and restoring into a new cluster with the same image.
From your screenshot, it looks like there’s enough usage in your volume that I believe it’s non-empty.
One reason your restored app may be a blank slate is if it runs a very different image than your original - specifically, if one uses flyio/postgres-flex and the other uses flyio/postgres.
Plenty changes between these two implementations and they also write data to different paths (i.e /data/postgresql vs /data/postgres). If your original cluster writes to one location but your new cluster expects to find data elsewhere, that could explain why your restored app is empty despite the volume actually having data.
As a next step, I suggest double checking that both of your Postgres apps are running the same image. Run this against both and compare the image versions:
# Note the "Repository" and "Tag" columns
fly image show --app <postgres-app-name>
If there’s an image mismatch, you’ll have better luck restoring from the same snapshot if you recreate your new Postgres app using the same image as your original cluster.
The problem is that the machine 3d8dd10b925238 isn’t reachable and doesn’t show any information. Following you command, displays N/A. Which is the original machine attached to the affected volume.
Any other way to retrieve that information?
I test with flyio/postgres-flex:17 and saw an error requesting version 15. Tried flyio/postgres-flex:15 and wasn’t working. Tried flyio/postgres:14.6 and everything worked but can’t access the data.
Thanks for trying that and for testing those different versions.
Since fly image show isn’t working here, I’m not sure if there is a user-facing way to fetch your image details when a host is down. In the meantime, I can see in our backend that your unreachable Machine runs flyio/postgres-flex:15.2.
Going back to your new empty cluster, it sounds like (from your last post) it may be running flyio/postgres:14.6. If that’s right, then I suggest retrying your snapshot restore with the matching image:
Specifying postgres-flex:15 (without minor version) defaults to the newest image version in this release, postgres-flex:15.10. This may have caused the issues when you tried 15 earlier.