Unhealthy DB cluster, multiple zombie

I have a HA postgres cluster in IAD region. I would have assumed if anything went wrong with a node it could re-elect a new primary.

Overnight, this cluster is not healthy though, it has 2 zombie nodes in the primary region.

I’ve tried restarting them, with no luck, this operation just times out:

flyctl machine restart machine-id --app my-app

I’ve tried doing a failover:

fly postgres failover --app my-app

But I get this error:

Error: no active leader found

I then was going to try to bring up a whole new cluster and replace it, but I can’t do a DB dump as the node is not responsive:

fly proxy 15432:5432 --app my-app

then in another window:

pg_dump postgres://postgres:$DB_PASS@localhost:15432/my_db --verbose --format=custom > ./latest.dump

But this fails with:

pg_dump: error: connection to server at “localhost” (::1), port 15432 failed: Connection refused

I’ve used this method successfully many times in the past.

How can I get my DB cluster back to a health state? Anyone been here before and can help?

1 Like

Not sure why it got into this state, no CPU/Disk issues, very confusing.

this happend to me too, after last incident(Fly.io Status - Elevated errors and connectivity problems)
primary became zombie, other one is replica.
waiting for the solution…

1 Like

OK that timing lines up perfectly. Thanks for showing me this!

That incident says it’s resolved, so I’ve got a support ticket open with fly.io … really hoping to hear from them soon :frowning:

1 Like

I experienced the same thing yesterday 4/23. I was able to get both of my pg machines back up with a restart. However, every time they would scale to zero they would not start back up successfully because of a zombie.lock. For now I was able to disable scaling to zero and they have stayed healthy.
All that to say, I am also experiencing this issue as of 4/23 morning

2 Likes

I got my cluster healthy again by doing the following:

  1. Creating 2 new replica’s in the primary region

fly machine clone healthy-machine-id --region iad --app my-app

  1. Force destroying the 2 zombie nodes (This leaves the data intact in a detached volume)

fly m destroy --force zombie-machine-id --app my-app

  1. Attaching new instances to the detached volumes

fly machine clone healthy-machine-id --region iad --app my-app --attach-volume zombie-volume-id

These zombie machines came back and were still in the zombie state, but at some point in this process a leader election took place and one of the new nodes I’d created was elected as the primary.

I ended up force destroying the zombies again, but I have been left with a health cluster now.

I hope this helps others in the same situation.

1 Like

Really appreciate you sharing your solution, unfortunately I had to take a couple extra steps and documented them here. Hope it helps someone Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster

1 Like

Here is the response from fly.io support for this issue. This might help others


For zombie locked clusters, the appropriate troubleshooting steps would be roughly:

  • Find last elected primary node

  • Remove all machines other than previous primary

  • SSH in and remove the old nodes from repmgr (as described here)

  • SSH in and remove the zombie lock files from the machine

  • Restart the node

  • If healthy, scale cluster back up

It’s unlikely failover/restart will work when the cluster is in any sort of bad state or zombie lock situation.

In general the zombie.lock occurs when the member has been fenced as Flex is unable to confirm that the booting/running primary is the actual primary. This can happen in the case of a network partition, when the nodes lose contact with each other. We have some info about it here: postgres-flex/docs/fencing.md at master · fly-apps/postgres-flex · GitHub

If you would prefer to migrate to a new cluster, you could also try restoring your postgres using a volume fork. You can do so using the fly pg create --fork-from , specifying the volume ID of the previous primary.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.