Database is down. Fly dashboard is down

lukasalvarezdev · May 24, 2024, 5:14pm

I got this error a few hours ago

Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.

But I can’t even scale the machines. Neither the fly.io console nor the dashboard is working. It’s the second time it happens in two days. I can’t literally do anything and my whole app is down. Email suppport takes hours to respond so hoping someone sees this sooner

jnankin · May 24, 2024, 5:38pm

We are also experiencing a database outage and an incident was opened over 20 hours ago. Public status page says everything is green, but our database is completely dead. All we have is email support. But machines are not visible in the console and our volume says 0 space is used (or it says volumes cant be displayed). This is highly concerning. Our app is completely unusable.

jnankin · May 24, 2024, 5:39pm

Screenshot_20240524_123854

lukasalvarezdev · May 24, 2024, 5:39pm

Update: I followed this guide Troubleshoot apps when a host is unavailable · Fly Docs, created a new volume and my database is still down.

jnankin · May 24, 2024, 5:41pm

jnankin · May 24, 2024, 5:48pm

Also just saw this from 5h ago: Hardware failure in LHR two of my DBs down for 24 days as a result

There’s still no response. hope we dont have to wait 24 days too…

john-fly · May 24, 2024, 5:56pm

Hello everyone. There are a couple points that have come up here and I’ll try to get them all:

Can you share more about what you’re seeing when you say “Fly dashboard is down”? We don’t have other indicators that that’s the case.
Aside from the possibility that the dashboard is down, the Status Page has not been updated because there is not a problem with the Fly Platform globally or regionally. There is an issue with a single host server in dfw which is what both of you are seeing.
This host has a networking issue which will be fixed shortly.
Because hardware issues can strike unexpectedly, we encourage everyone to run their PG clusters in a high-availability configuration, which means a minimum of three nodes in a cluster. Less than that, and a single host going offline will knock your cluster offline because it loses quorum; that is what’s happening here.
When the host comes back online, you can add more nodes to your PG cluster with fly machine clone. Bringing your total Machine count up to 3 will allow you to withstand a single-host issue such as this.

jnankin · May 24, 2024, 5:59pm

pretty sure you have screenshots above.

john-fly · May 24, 2024, 6:12pm

Okay, the screenshots above are what the dashboard is currently designed to look like when a host goes offline. I just wanted to make sure that fly.io/dashboard was not itself returning a 500 error code, which looks like it is not the case.

jnankin · May 24, 2024, 6:12pm

Couple of additional points:

We’d have to pay 3x the price to avoid 20 hours of downtime? Even if a single host goes down, 20 hours to get something back online seems pretty excessive for any modern day hosting service.
There was not even a way to know if our data still existed and snapshots were not available either so we couldn’t scale up a new database in this case from a snapshotted volume. Since we no longer trust that saved data won’t disappear, we now have to result to downloading snapshots ourselves going forward. Seeing a volume as using 0mb and also no access to snapshots is terrifying.

john-fly · May 24, 2024, 6:30pm

Regarding outage time, this host server wasn’t down for 20 hours. It went offline at approximately 2024-05-23T20:50:00Z and was restored to service by 2024-05-23T21:15:00Z. Then it went offline just now at 2024-05-24T16:45:00Z and was restored to service at 2024-05-24T18:00:00Z (evidently there’s some networking trouble on this host). When we began maintenance on the host, we restored the message from yesterday with the begin date of 20 hours ago. We should have set up a new maintenance message; I apologize for the confusion.

Regarding the dashboard UI during a host disconnection, that’s a good suggestion; thank you. I will pass that along to the design team that the dashboard UI could be clearer when a host is disconnected.

Finally, regarding clustering, the Fly Platform is not a traditional VPS product. The Fly Platform is designed to make it easy to launch multiple VMs, move them around, scale them, etc. If you know you don’t intend to grow beyond a single VM, you will probably be better served at this time by a more traditional VPS provider.

system · May 31, 2024, 6:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Production database randomly going down and not recovering	6	1146	July 18, 2022
PostgreSQL Database in Failing State Questions / Help postgres	4	748	July 18, 2022
Fly down?	20	1857	January 24, 2023
It's been 38hs and my instance is still experiencing an outage	8	497	October 4, 2023
Fly Postgres machine crashed, won't start or stop postgres	8	86	February 10, 2025

Database is down. Fly dashboard is down

Related topics