Unhealthy DB cluster, multiple zombie

iamliamnorton · April 25, 2024, 3:03am

I have a HA postgres cluster in IAD region. I would have assumed if anything went wrong with a node it could re-elect a new primary.

Overnight, this cluster is not healthy though, it has 2 zombie nodes in the primary region.

I’ve tried restarting them, with no luck, this operation just times out:

flyctl machine restart machine-id --app my-app

I’ve tried doing a failover:

fly postgres failover --app my-app

But I get this error:

Error: no active leader found

I then was going to try to bring up a whole new cluster and replace it, but I can’t do a DB dump as the node is not responsive:

fly proxy 15432:5432 --app my-app

then in another window:

pg_dump postgres://postgres:$DB_PASS@localhost:15432/my_db --verbose --format=custom > ./latest.dump

But this fails with:

pg_dump: error: connection to server at “localhost” (::1), port 15432 failed: Connection refused

I’ve used this method successfully many times in the past.

How can I get my DB cluster back to a health state? Anyone been here before and can help?

iamliamnorton · April 25, 2024, 3:32am

Not sure why it got into this state, no CPU/Disk issues, very confusing.

dylan-aidkr · April 25, 2024, 4:00am

this happend to me too, after last incident(Fly.io Status - Elevated errors and connectivity problems)
primary became zombie, other one is replica.
waiting for the solution…

iamliamnorton · April 25, 2024, 4:02am

OK that timing lines up perfectly. Thanks for showing me this!

That incident says it’s resolved, so I’ve got a support ticket open with fly.io … really hoping to hear from them soon

Sophic · April 25, 2024, 4:32am

I experienced the same thing yesterday 4/23. I was able to get both of my pg machines back up with a restart. However, every time they would scale to zero they would not start back up successfully because of a zombie.lock. For now I was able to disable scaling to zero and they have stayed healthy.
All that to say, I am also experiencing this issue as of 4/23 morning

iamliamnorton · April 25, 2024, 9:37am

I got my cluster healthy again by doing the following:

Creating 2 new replica’s in the primary region

fly machine clone healthy-machine-id --region iad --app my-app

Force destroying the 2 zombie nodes (This leaves the data intact in a detached volume)

fly m destroy --force zombie-machine-id --app my-app

Attaching new instances to the detached volumes

fly machine clone healthy-machine-id --region iad --app my-app --attach-volume zombie-volume-id

These zombie machines came back and were still in the zombie state, but at some point in this process a leader election took place and one of the new nodes I’d created was elected as the primary.

I ended up force destroying the zombies again, but I have been left with a health cluster now.

I hope this helps others in the same situation.

uncvrd · April 27, 2024, 4:22am

Really appreciate you sharing your solution, unfortunately I had to take a couple extra steps and documented them here. Hope it helps someone Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster

iamliamnorton · April 27, 2024, 10:23am

Here is the response from fly.io support for this issue. This might help others

For zombie locked clusters, the appropriate troubleshooting steps would be roughly:

Find last elected primary node
Remove all machines other than previous primary
SSH in and remove the old nodes from repmgr (as described here)
SSH in and remove the zombie lock files from the machine
Restart the node
If healthy, scale cluster back up

It’s unlikely failover/restart will work when the cluster is in any sort of bad state or zombie lock situation.

In general the zombie.lock occurs when the member has been fenced as Flex is unable to confirm that the booting/running primary is the actual primary. This can happen in the case of a network partition, when the nodes lose contact with each other. We have some info about it here: postgres-flex/docs/fencing.md at master · fly-apps/postgres-flex · GitHub

If you would prefer to migrate to a new cluster, you could also try restoring your postgres using a volume fork. You can do so using the fly pg create --fork-from , specifying the volume ID of the previous primary.

system · May 4, 2024, 10:24am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster Show & Tell postgres	9	1033	April 1, 2025
A solution for Zombie locks in a Postgres cluster postgres	6	384	May 29, 2024
Fly Postgres machine crashed, won't start or stop postgres	8	60	February 10, 2025
Postgres cluster broken since last Fly migration Questions / Help postgres	1	126	July 2, 2024
Postgres leader check failure Questions / Help	5	336	November 10, 2021

Unhealthy DB cluster, multiple zombie

Related topics