Can't load site anymore

I had an instance running for months and now the site won’t load anymore. The certificates are still valid for another 2 days, and the volume is not full, metrics are showing nothing. I tried restarting the instance but that stalled too with no information. Logs show nothing.

What can I do?
Thanks

I tracked down the issue: your volume was on a host that went down hard a few weeks ago. We migrated all volumes from that host, but it appears yours fell through the cracks.

If a volume doesn’t exist on any host, our scheduler can’t place your app.

I’m consulting with the team on what our options are here, we should have a backup of your volume.

To get back up more quickly, if your data can be re-created easily, I’d recommend creating a new volume for your app. Placement should then be possible and your app will come back up. Sorry if this sounds insane, we value the data hosted in volumes by our users, just giving you a quick workaround if it turns out your data is disposable (it is for some of our users).

1 Like

Thank you, I do need my data back though, I don’t have any other backups. When do you think I should receive that?

I checked with the team and that drive failure happened over a month ago: Fly.io Status - YYZ - application host failure and this means your backup has been rotated out by now.

Unfortunately, we can’t recover the data.

If it’s not data you can reproduce, it is safer to have redundancy in your setup (at least 2 volumes, ideally in different regions).

This is extremely disappointing and highly unprofessional. I would’ve expected some email or notification at some point that some drives went down that were linked to me. I will be moving off of fly because of this lack of communication and disregard for the importance of user’s data.

I’m sorry to hear that.

That’s definitely something we need to work on. We have a status page, but it doesn’t include personalized information and not everybody is subscribing to its updates.

We care a lot about our user’s data. Unfortunately your volume fell between the cracks and the migration process wasn’t completed successfully when the server crashed and we migrated all volumes from it. Volumes are only as resilient as the drive they’re on, that’s a physical NVMe drive on a specific host, we do not offer redundancy automatically. Users are expected to implement their own redundancy (often that means running a PostgreSQL cluster with at least 2 nodes, other times it’s a SQLite database replicated with Litestream or LiteFS), this is harder with things that aren’t databases or don’t have built-in replication capabilities.

Still, I understand your decision. We’re going to adjust the next time this happens.

2 Likes

@jerome, thank you for providing the detailed information about the volume resiliency. It would be cool to frame the expectations somewhere in the documentation, and maybe to have an option for an automatic volume redundancy (for extra money).

Running periodic backups is one thing, but having built-in volume redundancy would allow customers to quickly and automatically recover in case of a failure.

Just FYI, Azure has three redundancy models:

  • ZRS - zero redundancy
  • LRS - local redundancy. 3 nodes on different machines at the same data center
  • GRS - global redundancy, like LRS but plus another data center

The data redundancy mode is just a knob in the storage settings. Customers can select any storage redundancy according to their business needs.

Yes, not the easiest feat. to implement but maybe a thing to consider in the future using something like CEPH.

1 Like

Different kinds of redundancies for volumes is in our mid-to-long-term plans. It’s very expensive :smiley:

This is as much info as we have right now since we haven’t been spending a lot of time on volumes as of late (not since backups, restores and rewriting a lot of code to make the whole process more resilient).

These events that result in decommissioning of hosts is a cause of deployment failures for us due to zombie machines lurking around seemingly forever. I hope there’s work going on to address this issue, as from what I observe, in a fleet as large as Fly’s, such events are going to be the norm (for instance, for the past few months, I have had to deal with zombies at least once every month). The debt always comes due; whether anyone chooses to believe in it or not is immaterial.

Re: Volumes: Apart from retaining backups for longer than a month, please let users export backups in the interim? flyctl import / export volume snapshots · Issue #1296 · superfly/flyctl · GitHub

And, thanks for the thoughtful and reassuring responses above. Appreciate it