volumes disappeared after hardware failure

Recently got notification about hardware failure on a host with my VM.


I was thinking about moving it to another region, but stuck on a volumes migration.

There is an error at fly.io admin panel, and there is no any additional description.

Is it correct, that volumes, within snapshots are lost? Or snapshots stored somewhere else and could be restored?

Hi Konstantin,

Restoring a volume from snapshot needs to be done from the command line.

If you run flyctl volumes list -a <YOUR_APP_NAME>, you’ll see the volumes associated with your app (and in this situation, at least one of them will have an asterisk which says that flyctl could not connect with the host).

To see your snapshots, run flyctl volumes snapshots list <THE VOLUME ID YOU FOUND ABOVE>. This will show you the snapshots of that volume.

To create a new volume from a snapshot, use the snapshot ID to run flyctl volumes create --snapshot-id <SNAPSHOT ID>.

Let me know if you have any more questions.

fly volumes snapshots list vol_xme14981336vowpl


hm, it shows that there are no snapshots available :frowning:

Ah, I see. I hate to be the bearer of bad news, but it looks like all the snapshots have expired. By default snapshots expire after 5 days, since we view Fly Volumes as persistent storage, but not durable long-term storage; as we say in the docs :

Create and store backups: If you only have a single copy of your data on a single volume, and that drive fails, then the data is lost. Fly.io takes daily snapshots and retains them for 5 days, but the snapshots shouldn’t be your primary backup method.

But if that comes as a surprise to you, then we didn’t succeed in setting your expectations about the platform correctly, and that’s something we should be better about.

Your data still exists on the disks of the host server, and if we can get it back in service then you should see your data come back. But if the server cannot be restarted, then I’m afraid your data is gone, at least on the Fly Platform. If you’re trying to recover exactly the data on vol_xme14981336vowpl, if it’s possible, it will require thinking outside the box. The volume is gone on the Fly.io side, but maybe you had it dumped locally at some point? Or, if you echoed query results to your logs, maybe you could recreate your data from the logs? Try to think of any place that might have seen data cached or echoed, and try to recover it form there. I wish you the best of luck.

I will partially restore data from local cache (1-2 weeks to disassemble it). Is there a way to get a notification upon disk restore?

You should get an email and I will follow up with a manual email when I learn your host has been restored.

Actually, I find out that there was no notification about service disruption. It means that there was no way to restore volumes, if I didn’t set up some external monitoring to find out news about the problem, before volumes vanished due to snapshot liveness window. Hope I will get notification about host restoration :confused:

Actually today server got back to online state with all my data. That allowed to migrate data to another datacenter, and everything now works like a charm. Big thanks for support!

2 Likes

Hi Konstantin, I’m glad we got everything restored, and I just wanted to follow up with a few clarifying points for you and anyone else who reads this about why this happened and how to prevent this from happening again.

This started with a single host server outage. We don’t regard single host failures as a failure of the Fly Platform, which is why we say often in the docs and CLI (and are looking for more places to repeat) that the intended way to use the Fly Platform is to run Apps with two or more Machines. I will quote this again from the docs:

Create and store backups: If you only have a single copy of your data on a single volume, and that drive fails, then the data is lost. Fly.io takes daily snapshots and retains them for 5 days, but the snapshots shouldn’t be your primary backup method.

It looks like the software you’re running is Vaultwarden, and after a quick look I couldn’t tell if Vaultwarden offers a “High Availiblity” configuration or not. If not, it will be difficult to run it as a unified cluster. But even if you have to run this as a single Machine, you should have an automatic procedure to make your own backups.

But the second step towards this problem was our mistake. You didn’t receive any notification that the host server was offline. We should have sent one. The reason for this mistake is that we were in the middle of a release of a new, more automatic method of sending notification emails. Due to this new change in procedure, was a mixup among our staff about what needed to be done to send notification emails, and as a result no emails were sent. If we had sent emails, you could have taken action in time to restore from a Fly.io Volume snapshot, instead of waiting for the server to come back online.

I apologize for this mistake. There should have been notification emails. But I hope you can see there was a specific cause to the failure to send emails, and you can trust that you will receive host failure notification emails in the future. With notification emails and multi-machine deployment or custom backups, you data will be secure.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.