Recovery options for bad volumes?

hoelzro · March 4, 2026, 5:54pm

Hello! I recently ran into an issue with one of my apps - I think a problem with the underlying machine and/or volume is responsible, and I’d like to find out what my options are (if any) in terms of data recovery.

The symptoms I ran into were:

Trying to hit the application was resulting in 504s.
fly deploy failed to complete due to timeouts (fly deploy printed repeated 408 errors). I did a fly machine destroy and retried to force creation of a new machine, but this had the same problem, leaving machines in a created state.
fly machine run node:22.10.0-slim --shell -v $VOL_ID:/data times out to try to create a machine by hand, with the same “machine stuck in created” behavior. Using fly machine run node:22.10.0-slim --shell to create a machine without attaching a volume, or creating a machine and attaching a brand-new volume I just created, does work.
There are multiple other volumes for the app stuck in an enabling_remote_export state - I’m guessing these correspond to snapshot creation.
The volume doesn’t seem to have any snapshots - it looks like the automatic snapshotting will still delete snapshots, even if new ones aren’t getting successfully created.
Forking the volume did create a new volume, but it’s stuck in an enabling_remote_export state.

I fortunately have a three month-old backup, so not all is lost - sadly my backup process was on a scheduled Fly machine that decided to just…stop running its daily schedule three months ago.

I did read the warning about using volumes, and I’m wishing I’d had better monitoring (guess what I set up yesterday!), but I’m hoping folks here can give me some ideas on anything I could try to recover my data. Thanks!

mayailurus · March 4, 2026, 6:52pm

Hm… It looks like you’ve tried all the classics already, … The only other ideas that really come to mind are:

Try fly vol list --all to make sure that the snapshots weren’t tied to a now-deleted volume. (This can happen when a Machine is auto-migrated, if I understand correctly.)
Check all volumes for snapshots, more broadly, if you haven’t already. Sometimes an individual physical host machine gets glitchy, but the others in your cluster are ok.
Fork to a completely different region. A few regions have been having capacity crunches lately, so it may be that your attempted forks keep landing on the same shakey server.

Hope this helps a little!

hoelzro · March 5, 2026, 4:48am

Thanks for the tips! Unfortunately, nothing panned out for me