Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster

Hi all - due to the recent downtime, my 3-node (…erm HA???) Postgres Cluster became unreachable with 2 Zombie Nodes and 1 Replica Node. With no primary to clone from, I was stuck. I spent an ungodly long amount of time trying to figure out how to fix this, and now that I have, I figured I would share my solution for those in the future that deal with this…and for me lol.

I’m sad that the touted Fly “Highly Available” solution failed to keep my database, highly available, but alas, here we are. If any Fly support or expert repmgr wizzes here have suggestions for improving this, please let me know and I will update. Anyways, let’s commence!

  • Run Fly Status

fly status --app [APP-NAME]

  • Destroy the zombie nodes

fly machine destroy [MACHINE-ID] --force

  • SSH in to the remaining replica node

fly ssh console --app [APP-NAME]

  • Switch user to Postgres

su postgres

  • View the broken cluster

repmgr -f /data/repmgr.conf cluster show

  • Promote your remaining server as Primary

repmgr -f /data/repmgr.conf standby promote

  • Force the Status of this new Node to Primary

repmgr -f /data/repmgr.conf primary register --force

  • Unregister the old primary and the other standby node

repmgr -f /data/repmgr.conf standby unregister --node-id [NODE-ID]

  • You should now only see one Primary / Running node in your cluster now

  • Restart your machine

fly machine restart [MACHINE-ID]

  • I had to wait a few minutes for this machine to go from zombieprimary, which was scary because this was previously considered a replica, but the zombie.lock file in /data disappeared after a few minutes.

  • Delete the old volumes from the other two broken machines (ALERT: NOT THE VOLUME THAT IS CURRENTLY ATTACHED, this will be the only volume left with your data, please be careful)

fly volumes destroy [VOLUME-ID]

  • Create two new volumes in separate zones (important since we want to make sure if a zone dies we have drives in other zones…the instructions from fly to just clone the machine, clone a volume in the same zone which we dont want…)

fly volumes create pg_data --app [APP-NAME] --region [REGION] --require-unique-zone --size [YOUR-SIZE]

  • Create new machines referencing the new volumes youve created and the running primary

fly machine clone [MACHINE] --attach-volume [VOLUME-ID] --region [REGION] --app [APP-ID]

Whew, you should be back up and running. Ideally I would like a Fly CLI command to switch the leader instead of having to do all this nonsense. :point_right: :point_left:

11 Likes

This saved my bacon right now!! Bookmarking for the future. Thank you for these detailed steps! :bowing_man:

You’re welcome! Glad it helped :slight_smile:

This is amazing. Thanks for the concise guide, and 100% Fly team, we need a command to promote primaries and cleanup any detached and or failed nodes!

1 Like

Thanks for the guide.

I get an error trying to unregister the faulty primary:

ERROR: node [NODE-ID] is not a standby server.

Any advice?

hmm can you try running this step again

repmgr -f /data/repmgr.conf cluster show

and print the output here?

Managed to figure it out. Because the fault node was a primary, changing your command to unregister to the following worked:

repmgr -f /data/repmgr.conf primary unregister --node-id [NODE-ID]

Thanks for your guide. Was in a wild panic.

1 Like

You saved me!
There were a couple of times I had to specify the app using <command> -a <pg-db-name> because it was applying the changes to the application I was using and not the database app.
But it ended up working, Thanks!!