Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
But I can’t even scale the machines. Neither the fly.io console nor the dashboard is working. It’s the second time it happens in two days. I can’t literally do anything and my whole app is down. Email suppport takes hours to respond so hoping someone sees this sooner
We are also experiencing a database outage and an incident was opened over 20 hours ago. Public status page says everything is green, but our database is completely dead. All we have is email support. But machines are not visible in the console and our volume says 0 space is used (or it says volumes cant be displayed). This is highly concerning. Our app is completely unusable.
Hello everyone. There are a couple points that have come up here and I’ll try to get them all:
Can you share more about what you’re seeing when you say “Fly dashboard is down”? We don’t have other indicators that that’s the case.
Aside from the possibility that the dashboard is down, the Status Page has not been updated because there is not a problem with the Fly Platform globally or regionally. There is an issue with a single host server in dfw which is what both of you are seeing.
This host has a networking issue which will be fixed shortly.
Because hardware issues can strike unexpectedly, we encourage everyone to run their PG clusters in a high-availability configuration, which means a minimum of three nodes in a cluster. Less than that, and a single host going offline will knock your cluster offline because it loses quorum; that is what’s happening here.
Okay, the screenshots above are what the dashboard is currently designed to look like when a host goes offline. I just wanted to make sure that fly.io/dashboard was not itself returning a 500 error code, which looks like it is not the case.
We’d have to pay 3x the price to avoid 20 hours of downtime? Even if a single host goes down, 20 hours to get something back online seems pretty excessive for any modern day hosting service.
There was not even a way to know if our data still existed and snapshots were not available either so we couldn’t scale up a new database in this case from a snapshotted volume. Since we no longer trust that saved data won’t disappear, we now have to result to downloading snapshots ourselves going forward. Seeing a volume as using 0mb and also no access to snapshots is terrifying.
Regarding outage time, this host server wasn’t down for 20 hours. It went offline at approximately 2024-05-23T20:50:00Z and was restored to service by 2024-05-23T21:15:00Z. Then it went offline just now at 2024-05-24T16:45:00Z and was restored to service at 2024-05-24T18:00:00Z (evidently there’s some networking trouble on this host). When we began maintenance on the host, we restored the message from yesterday with the begin date of 20 hours ago. We should have set up a new maintenance message; I apologize for the confusion.
Regarding the dashboard UI during a host disconnection, that’s a good suggestion; thank you. I will pass that along to the design team that the dashboard UI could be clearer when a host is disconnected.
Finally, regarding clustering, the Fly Platform is not a traditional VPS product. The Fly Platform is designed to make it easy to launch multiple VMs, move them around, scale them, etc. If you know you don’t intend to grow beyond a single VM, you will probably be better served at this time by a more traditional VPS provider.