I’m facing a critical issue with my Postgres cluster (raiz-production) in the GRU region. My application is completely down due to this.
Current State:
My cluster has 2 healthy replicas, but no primary leader.
The application cannot function as there is no writable primary.
Problem Analysis: The root cause seems to be an internal DNS resolution failure. The machines in the cluster cannot find each other on the internal network, which prevents a leader from being elected and also prevents new machines from being provisioned.
I managed to get this specific error from the logs of one of the replicas (e82d44df264e28):
failed to establish connection to primary: failed to connect to 'host=d891dddce06668.vm.raiz-production.internal...': hostname resolving error [lookup d891dddce06668.vm.raiz-production.internal on [fdaa::3]:53: no such host]
History: This started with an unhealthy primary. I have already destroyed the unhealthy machine, but the cluster never recovered because of this underlying networking issue. All attempts to manually failover or clone new machines are failing.
Could you please investigate the internal networking and DNS resolution for my app raiz-production in the GRU region? I’m completely stuck.
DNS isn’t failing, it legitimately can’t resolve the address for your primary machine since you’ve destroyed it. Postgres-flex needs a 3 node cluster in order to failover. If you’ve destroyed the primary machine, then it won’t be able to elect a leader and your remaining nodes will continue looking for the primary.
Destroying an active primary is a fairly strong measure. What issues were you seeing on the primary machine that led you to that?
You might be able to manually recover your cluster despite this if the replica nodes are caught up by manually promoting one of them. However it’s entirely possible that destroying the live primary caused inconsistencies with the replica nodes.
If you haven’t deleted the volume from your former primary, my recommendation is to fork a new cluster from that volume. If you have destroyed it, restoring from a backup or volume snapshot may be easiest, provided you have a recent one. The guides to do so are here: Fork a volume from a Postgres app · Fly Docs
You are right, destroying the primary was a strong measure, but we were in a critical situation with the application completely down.
The primary machine (d8d9779fe36748) was consistently in a critical state, with only 1/3 or 2/3 health checks passing. The logs were showing vm critical: context deadline exceeded (Client.Timeout exceeded while awaiting headers). This was causing the entire application to hang on loading screens.
We tried to restart and failover, but the cluster was unresponsive. The primary was so unstable that it was preventing any recovery action. Destroying it was the last resort to break the stalemate.
I’m now checking if the volume from the old primary still exists, as you suggested. I will post the output of fly volumes list here shortly.