Hi,
My app and database in Fly.io (EZE) have been offline for over 11 hours due to a hardware failure on the host. The status page says it may take a while to fix, but there’s no ETA.
I’ve read the guide on moving to another host/region, but in my case the DB is self-managed and I’m struggling to restore the volume or snapshot in another region to get out of downtime.
Has anyone gone through something similar and can point me in the right direction?
Also, any update from Fly on when this will be resolved, or is there a more direct contact for urgent issues? We really need more info to plan next steps.
(Many people have used it successfully in the past.)
Otherwise, it’s best to restore from one of the surviving replicas, using the volume-forking technique (with explicit volume ID).
Or, if you did attempt one of those already, it would be prudent to post the exact command that you tried as well as the full error message that you got in return. The </> button in the toolbar can be used to get an area suitable for pasting code, terminal output, etc.
Hope this helps!
Aside: This recent thread on eze specifically might also turn out to be relevant:
Hello, first of all, thank you very much for your prompt response.
I just tried both options (I had already tried other things, but those two links didn’t work).
Unfortunately, I’m getting a timeout on both (I’ve attached the response). Do you think they’ll fix this at some point? Or should I give up on my DB and my app?
Even when the new DB is created in another region.
fly postgres create --snapshot-id vs_V7o3B8D986bbujl0pmoXXXXX --image-ref postgres:15
Unmanaged Fly Postgres is not supported by Fly.io Support and users are responsible for operations, management, and disaster recovery. If you’d like a managed, supported solution, try ‘fly mpg’ (Managed Postgres).Please visit for more information about Managed Postgres.
Machine 7811076c53d008 is created ==> Monitoring health checks Waiting for 7811076c53d008 to become healthy (stopped, 0/3) Error: context deadline exceeded
And the forking
fly pg create --initial-cluster-size 3 --fork-from abogalia-db:vol_4m8zjy6p0ojXXXXX -n abogalia6
Provisioning 1 of 1 machines with image flyio/postgres-flex:17.2@sha256:f4301ae20d193ab3c3539eb9df9a8f8d3736dd331aeec1bfb2e34b539dc353c5Waiting for machine to start…Error: timeout reached waiting for machine’s state to changeThe machine e82d930b444438 took more than 5m0s to reach “started”
I know it was my mistake for not having the DB in another region, or for not having it self-managed by fly.io, but unfortunately, without my DB and my users’ information, my startup is finished.
Is there any other solution to download the snapshot and set up a Postgres application elsewhere (GCP, AWS, or even my local server), at least so my users can use the app again?
you can add paid support to email us directly. It won’t get the host back up quicker (our infrastructure team is already doing all they can), but we will be able to help with using the snapshot to restore your database.
Hi @lillian, thank you so much for responding. This is the first response from someone on the fly.io team in 12 hours. At least it calms me down a bit to know that they ARE reviewing it.
Do you know if they have an ETA?
On the other hand, after more than 12 hours of being offline, it doesn’t seem right to me to have to pay for an answer or help, but I’ll do it. My only question is, since it’s urgent and not my fault, if they might respond within 36 hours. Why would I pay $30 and then have to wait 36 hours for a response (not to mention that we’ll probably have to ping-pong the answers? It’ll feel like forever).
Same here,
i’m trying to redeploy the instances as they suggested, for some reason it allows me to so in one project but in the other I get this cli error:
Error: returned error 500: {"data":{},"errors":[{"message":"You hit a Fly API error with request ID: 01K2N0MXGD5CH35CVNP0PCYH79-iad","extensions":{"code":"SERVER_ERROR","fly_request_id":"01K2N0MXGD5CH35CVNP0PCYH79-iad"}}]}
That’s the frustrating part, I didn’t do anything differently. I created a new volume from a snapshot, force-killed the non-responding Fly machine, and redeployed, it worked without any problem. Later, I did the same with my database project/instance and got the error I mentioned.
As a general tip, if you say more about what you are attempting, then people have a better chance of noticing a difference or a gotcha that they themselves ran into once. E.g., what is the database, what was the exact command that failed, …?
(Or if you’d like someone to poke around directly in your app settings, etc., then use @lillian’s suggestion. We here in the community forum generally can’t do that.)