Recovery options for bad volumes?

Hello! I recently ran into an issue with one of my apps - I think a problem with the underlying machine and/or volume is responsible, and I’d like to find out what my options are (if any) in terms of data recovery.

The symptoms I ran into were:

  • Trying to hit the application was resulting in 504s.
  • fly deploy failed to complete due to timeouts (fly deploy printed repeated 408 errors). I did a fly machine destroy and retried to force creation of a new machine, but this had the same problem, leaving machines in a created state.
  • fly machine run node:22.10.0-slim --shell -v $VOL_ID:/data times out to try to create a machine by hand, with the same “machine stuck in created” behavior. Using fly machine run node:22.10.0-slim --shell to create a machine without attaching a volume, or creating a machine and attaching a brand-new volume I just created, does work.
  • There are multiple other volumes for the app stuck in an enabling_remote_export state - I’m guessing these correspond to snapshot creation.
  • The volume doesn’t seem to have any snapshots - it looks like the automatic snapshotting will still delete snapshots, even if new ones aren’t getting successfully created. :grimacing:
  • Forking the volume did create a new volume, but it’s stuck in an enabling_remote_export state.

I fortunately have a three month-old backup, so not all is lost - sadly my backup process was on a scheduled Fly machine that decided to just…stop running its daily schedule three months ago.

I did read the warning about using volumes, and I’m wishing I’d had better monitoring (guess what I set up yesterday!), but I’m hoping folks here can give me some ideas on anything I could try to recover my data. Thanks!

Hm… It looks like you’ve tried all the classics already, :adhesive_bandage:… The only other ideas that really come to mind are:

  • Try fly vol list --all to make sure that the snapshots weren’t tied to a now-deleted volume. (This can happen when a Machine is auto-migrated, if I understand correctly.)
  • Check all volumes for snapshots, more broadly, if you haven’t already. Sometimes an individual physical host machine gets glitchy, but the others in your cluster are ok.
  • Fork to a completely different region. A few regions have been having capacity crunches lately, so it may be that your attempted forks keep landing on the same shakey server.

Hope this helps a little!

Thanks for the tips! Unfortunately, nothing panned out for me :frowning:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.