Issue Getting Data After Host Machine Crash

Howdy,

Having a bit of an issue. I have an admin app that we do not look at every day. It turns out on April 2nd the host machine went down. I got the following message in the fly.io console. I did not get any emails from fly.io telling me about this, nor do I see any activity in the activity stream.

2024-04-02 00:13:07 UTCA server hosting some of your apps has suffered irreparable hardware damage. Please migrate your Fly Machines to other hosts and restore volumes from any backups.

I discovered this issue yesterday. My app was up, but in a different datacenter. It seems when the host machine went down some how the app moved from ORD to SEA. However the volume did not. So the app was up but not operating.

I also noticed there was 2 different volumes in the original data center. So maybe one got copied? I go to the volumes tab in the fly.io UI and it just spins until it errors out. I can’t even see my volumes there.

I redeployed the APP to the ORD datacenter and reconnected to one of the volumes, APP starts but the volume seems empty. I then try again with the second volume. There is something wrong with this one. It errors out when I try to boot the app, it errors out when I try to list the snapshots. I can’t fork it.

I then restore from a snapshot of the first volume. Same thing.

Snapshots have continued to be made. Since this all started on April 2nd. I can’t restore a snapshot back that far.

Does anyone have any idea how I can get my database?

Thanks.

Hi Spicer, I hate to be the bearer of bad news, but I’m afraid your data is very likely gone. This is actually the first time we’ve had hardware failures like this, and we’re still working out the kinks in how to respond. Yes, we need to send out emails sooner. We also need to emphasize still more that we view Fly Volumes as persistent storage, but not durable long-term storage; as we say in the docs:

Create and store backups: If you only have a single copy of your data on a single volume, and that drive fails, then the data is lost. Fly.io takes daily snapshots and retains them for 5 days, but the snapshots shouldn’t be your primary backup method.

But if that comes as a surprise to you, then we didn’t succeed in setting your expectations about the platform correctly, and that’s something we should be better about.

None of this is helpful for getting your data back though. The odds are not good, but here are some long shots. There is still a volume vol_kgj5450wwzqry2wz which is in ord and was created in 2023 and has has some data on it. That will not be the exact same as the volume you were using, but maybe it has some of your data. You might see if you can attach a Machine to that. Your app has other volumes, but they all look like they were created today, on April 22nd. They wouldn’t have your data.

If you’re trying to recover exactly the data on the vol_0enxv3y9nko48okp, if it’s possible, it will require thinking outside the box. The database is gone on the Fly.io side, but maybe you had it dumped locally at some point? Or, if you echoed query results to your logs, maybe you could recreate your data from the logs? Try to think of any place that might have seen data cached or echoed, and try to recover it form there. I wish you the best of luck.

@john-fly Thanks for your thoughtful response. I have to admit when I first realized what was going on I was very mad. To your point I took snapshots for granted. I just assumed they were for backups. Glad I am learning this now on something we can easily recover from. Good news is loosing the data is not the end of the world. I will try your suggestions tho. Maybe there is hope. Thank you for the suggestions.

I went ahead and setup backups. I used the strategy in the post below (maybe this will be helpful for someone). I am sure we could do the same thing with fly.io scheduled machines. I just did not find any easy to follow documentation.

https://rodolfosilva.com/github-actions/automating-flyio-database-backups-with-github-actions/

Thanks again @john-fly for the thoughtful response. Good to know the fly team has a good support team behind it.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.