EZE host down for 11+ hours - app & DB completely offline

Hi,
My app and database in Fly.io (EZE) have been offline for over 11 hours due to a hardware failure on the host. The status page says it may take a while to fix, but there’s no ETA.
I’ve read the guide on moving to another host/region, but in my case the DB is self-managed and I’m struggling to restore the volume or snapshot in another region to get out of downtime.

Has anyone gone through something similar and can point me in the right direction?
Also, any update from Fly on when this will be resolved, or is there a more direct contact for urgent issues? We really need more info to plan next steps.

Hi… Sorry to hear you’ve been having trouble, :adhesive_bandage:… If it was a single-Machine database, then the following is generally the resolution:

https://fly.io/docs/postgres/managing/backup-and-restore/#restoring-from-a-snapshot

(Many people have used it successfully in the past.)

Otherwise, it’s best to restore from one of the surviving replicas, using the volume-forking technique (with explicit volume ID).

Or, if you did attempt one of those already, it would be prudent to post the exact command that you tried as well as the full error message that you got in return. The </> button in the toolbar can be used to get an area suitable for pasting code, terminal output, etc.

Hope this helps!


Aside: This recent thread on eze specifically might also turn out to be relevant:

https://community.fly.io/t/failed-to-create-volume-no-capacity-available-in-eze/25509

Hello, first of all, thank you very much for your prompt response.

I just tried both options (I had already tried other things, but those two links didn’t work).

Unfortunately, I’m getting a timeout on both (I’ve attached the response). Do you think they’ll fix this at some point? Or should I give up on my DB and my app?

Even when the new DB is created in another region.

fly postgres create --snapshot-id vs_V7o3B8D986bbujl0pmoXXXXX --image-ref postgres:15

Unmanaged Fly Postgres is not supported by Fly.io Support and users are responsible for operations, management, and disaster recovery. If you’d like a managed, supported solution, try ‘fly mpg’ (Managed Postgres).Please visit  for more information about Managed Postgres.

Machine 7811076c53d008 is created ==> Monitoring health checks   Waiting for 7811076c53d008 to become healthy (stopped, 0/3) Error: context deadline exceeded

And the forking

fly pg create --initial-cluster-size 3 --fork-from abogalia-db:vol_4m8zjy6p0ojXXXXX -n abogalia6

Provisioning 1 of 1 machines with image flyio/postgres-flex:17.2@sha256:f4301ae20d193ab3c3539eb9df9a8f8d3736dd331aeec1bfb2e34b539dc353c5Waiting for machine to start…Error: timeout reached waiting for machine’s state to changeThe machine e82d930b444438 took more than 5m0s to reach “started”

I know it was my mistake for not having the DB in another region, or for not having it self-managed by fly.io, but unfortunately, without my DB and my users’ information, my startup is finished.
Is there any other solution to download the snapshot and set up a Postgres application elsewhere (GCP, AWS, or even my local server), at least so my users can use the app again?

you can add paid support to email us directly. It won’t get the host back up quicker (our infrastructure team is already doing all they can), but we will be able to help with using the snapshot to restore your database.

1 Like

No, if the snapshot still exists then you’ll typically be able to get the data back, except for the most recent ~24 hours, one way or another.

(It looks like you may have just had a typo in the --image-ref in your attempt above.)

@lillian’s suggestion is best if you have an actual business running on Fly.io, though…

Hi @lillian, thank you so much for responding. This is the first response from someone on the fly.io team in 12 hours. At least it calms me down a bit to know that they ARE reviewing it.
Do you know if they have an ETA?

On the other hand, after more than 12 hours of being offline, it doesn’t seem right to me to have to pay for an answer or help, but I’ll do it. My only question is, since it’s urgent and not my fault, if they might respond within 36 hours. Why would I pay $30 and then have to wait 36 hours for a response (not to mention that we’ll probably have to ping-pong the answers? It’ll feel like forever).

@lillian @mayailurus

It’s funny, I’m trying to pay for paid support and it fails with these errors.

This product is not available at this time. Please contact billing@fly.io

We had trouble setting up your product. Please contact billing@fly.io.

I want to cry!

1 Like

Hrm… This really is a string of bad luck…

(Definitely email that address, though.)

Same here,
i’m trying to redeploy the instances as they suggested, for some reason it allows me to so in one project but in the other I get this cli error:

Error: returned error 500: {"data":{},"errors":[{"message":"You hit a Fly API error with request ID: 01K2N0MXGD5CH35CVNP0PCYH79-iad","extensions":{"code":"SERVER_ERROR","fly_request_id":"01K2N0MXGD5CH35CVNP0PCYH79-iad"}}]}

Hi @Rolon In the project he left you, how did you do it?

That’s the frustrating part, I didn’t do anything differently. I created a new volume from a snapshot, force-killed the non-responding Fly machine, and redeployed, it worked without any problem. Later, I did the same with my database project/instance and got the error I mentioned.

As a general tip, if you say more about what you are attempting, then people have a better chance of noticing a difference or a gotcha that they themselves ran into once. E.g., what is the database, what was the exact command that failed, …?

(Or if you’d like someone to poke around directly in your app settings, etc., then use @lillian’s suggestion. We here in the community forum generally can’t do that.)

1 Like

Yes you’re right, basically i followed this guide:

Hey sorry for these poor/misleading error messages, given that I think the problem was your organisation didn’t have an associated payment method yet.

It looks like you figured out the solution to get things working as I can see that that org has a support email now.

I’ll push a fix for this odd error message.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.