fly pg suspended - no machine found?

Hello
I had a long running service running. The app and the pg database.
It appears the pg run out of space and crashed based on the email.

I’m trying to restart it, but I can’t find machine name.
The pg is marked as “suspended”, all ip’s are assign, but there is no machines listed under machines.
Where did it? How can I get it up and running?

a) Either fly removed the machine or somehow deleted it? or there is a bug and its hidden somewhere? (could someone recover it)
or
b) the default pg instance doesn’t list machine under machines tab, there is a different way to get the machine id since its PostgreSQL database?

Thanks
Lucas

Can you share the app name?

Hello,
The app name is:
wwwhww

Thanks
Lucas

Hi @lucasmanual ! I’ve been checking what happened with your Postgres instance.

Unfortunately, your application was allocated to a host that had hardware issues, we tried to recover it but that was impossible and was finally decommissioned on May 3rd.

Based on what I can see in our records, it looks like your Postgres database was deployed as a single-instance. This means the only copy of your data was in the volume attached to the host that was decommissioned. I’m afraid that at that time, volume snapshots were retained for 5 days and they are no longer available.

I hope you have backups for your data and can be resotred in a new Postgres instance.

I’d recommend to create the new cluster using one of the Production (High Availability) options to ensure your data can survive a host failure.

Kind regards,
Andrés.

Hello
Can you speak more on “volume snapshot retain for 5 days”.

a) Assuming host had a permanent hardware problem, wouldn’t it be standard procedure to move the “image” to a different instance that doesn’t have a hardware problem and restart it? (I assume that image can be run elsewhere?

b) If instance dies due to hardware failure, wouldn’t it make more sense to keep it for 30 days or keep it indefinitely and charge customers for non-attached instance volume storage fees?

c) More hypothetical but I assume the same thing can happen with my docker image of the web application correct? Since I only run 1 instance of it?

d) Same hypothetical, assuming I have replication turn on and I have total of 2 instances, you are saying you can’t guarantee uptime if both instances have hardware problems?

Thanks for clarifying.
Lucas

Hi @lucasmanual ! Sure, let me try to answer your questions below.

Fly Postgres is not managed Postgres. We take daily snapshot for volumes, this is also the case for Fly Postgres volumes. At the time of the incident, each snapshot was retained for 5 days, then discarded. Because of this, after 5 days the server being stopped, all the snapshots were discarded.
This changed recently as you can see in this Fresh Produce post

The platform allows for easily starting new Machines in other hosts, but Fly Volumes are attached to the physical host and data is not replicated elsewhere. In this regard, Fly is different from other providers where the storage is traditionally network attached storage.

This makes a lot of sense and, because these kind of incidents, we built it. Now you can specify --snapshot-retention for your snapshots

We recommend run multiple instances of your application, ideally in different regions, to increase availability.
That said, hardware fails and even when unlikely, if you have two instances of your application running on different hosts and both hosts fails at the exact same time, we can not guarantee uptime.
I’d say, having multiple hosts failing at the exact same point in time is rare, what I’d recommend is, as soon as you receive an email notification about a host being unavailable, scale your app with one more machine right away, this will place the new machine in a different host. Once the host issue is resolved you can scale down.

I hope my answers makes sense, please let me know if you need further clarification.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.