Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster

uncvrd · April 27, 2024, 4:09am

Hi all - due to the recent downtime, my 3-node (…erm HA???) Postgres Cluster became unreachable with 2 Zombie Nodes and 1 Replica Node. With no primary to clone from, I was stuck. I spent an ungodly long amount of time trying to figure out how to fix this, and now that I have, I figured I would share my solution for those in the future that deal with this…and for me lol.

I’m sad that the touted Fly “Highly Available” solution failed to keep my database, highly available, but alas, here we are. If any Fly support or expert repmgr wizzes here have suggestions for improving this, please let me know and I will update. Anyways, let’s commence!

Run Fly Status

fly status --app [APP-NAME]

Destroy the zombie nodes

fly machine destroy [MACHINE-ID] --force

SSH in to the remaining replica node

fly ssh console --app [APP-NAME]

Switch user to Postgres

su postgres

View the broken cluster

repmgr -f /data/repmgr.conf cluster show

Promote your remaining server as Primary

repmgr -f /data/repmgr.conf standby promote

Force the Status of this new Node to Primary

repmgr -f /data/repmgr.conf primary register --force

Unregister the old primary and the other standby node

repmgr -f /data/repmgr.conf standby unregister --node-id [NODE-ID]

You should now only see one Primary / Running node in your cluster now
Restart your machine

fly machine restart [MACHINE-ID]

I had to wait a few minutes for this machine to go from zombie → primary, which was scary because this was previously considered a replica, but the zombie.lock file in /data disappeared after a few minutes.
Delete the old volumes from the other two broken machines (ALERT: NOT THE VOLUME THAT IS CURRENTLY ATTACHED, this will be the only volume left with your data, please be careful)

fly volumes destroy [VOLUME-ID]

Create two new volumes in separate zones (important since we want to make sure if a zone dies we have drives in other zones…the instructions from fly to just clone the machine, clone a volume in the same zone which we dont want…)

fly volumes create pg_data --app [APP-NAME] --region [REGION] --require-unique-zone --size [YOUR-SIZE]

Create new machines referencing the new volumes youve created and the running primary

fly machine clone [MACHINE] --attach-volume [VOLUME-ID] --region [REGION] --app [APP-ID]

Whew, you should be back up and running. Ideally I would like a Fly CLI command to switch the leader instead of having to do all this nonsense.

bluecoffeecreative · May 14, 2024, 3:55am

This saved my bacon right now!! Bookmarking for the future. Thank you for these detailed steps!

uncvrd · May 15, 2024, 9:00pm

You’re welcome! Glad it helped

AshGuy · May 17, 2024, 12:03am

This is amazing. Thanks for the concise guide, and 100% Fly team, we need a command to promote primaries and cleanup any detached and or failed nodes!

Helmut · May 17, 2024, 10:23am

Thanks for the guide.

I get an error trying to unregister the faulty primary:

ERROR: node [NODE-ID] is not a standby server.

Any advice?

uncvrd · May 17, 2024, 10:04pm

hmm can you try running this step again

repmgr -f /data/repmgr.conf cluster show

and print the output here?

Helmut · May 18, 2024, 11:54am

Managed to figure it out. Because the fault node was a primary, changing your command to unregister to the following worked:

repmgr -f /data/repmgr.conf primary unregister --node-id [NODE-ID]

Thanks for your guide. Was in a wild panic.

praptolium · June 19, 2024, 2:01pm

You saved me!
There were a couple of times I had to specify the app using <command> -a <pg-db-name> because it was applying the changes to the application I was using and not the database app.
But it ended up working, Thanks!!

Accent24 · January 25, 2025, 6:43am

@uncvrd Thank you very much!!! I’ve been struggling for several weeks from now.

It all started out with my app down. PG vm just stopped responding and start. I did create a new app and lunch it with the image of old DB but then in a dew days it stopped working again.

Your guideline helped to fixed my PG cluster and I managed to copy it to a different regions as well! However, I failed to do

repmgr -f /data/repmgr.conf standby unregister --node-id [NODE-ID]

I am not sure why but it kept telling me that ‘No standby clusters available’

wobbleburger · April 1, 2025, 5:29am

Lifesaver!!!

tdoermann · May 8, 2025, 7:23pm

You saved us! Thank you so much.

tjhorner · May 9, 2025, 12:22am

Just recovered my cluster with your guide, thank you! This should really be part of the official docs.

shreddish · June 5, 2025, 4:27pm

massive save! not sure why there isn’t official docs like this. additionally why there isn’t a way for you to do this from fly.io api - it seems very easy to get into zombie states when im trying to add/remove replicas

mayailurus · June 5, 2025, 5:10pm

Glad to hear you got something working again!

I don’t disagree, in ideal terms, but these images have all been officially deprecated now, after having spent ~2 years in a kind of limbo, and the flyctl postgres commands themselves will be removed entirely at some point.

Unmanaged Fly Postgres is deprecated in favor of fly mpg (Managed Postgres). Please visit https://fly.io/docs/mpg/overview/ for more information about Managed Postgres.

So, things will generally be getting more do-it-yourself and increasingly rough sailing over time, , rather than less…

Topic		Replies	Views
A solution for Zombie locks in a Postgres cluster postgres	6	413	May 29, 2024
Unhealthy DB cluster, multiple zombie postgres	8	715	May 4, 2024
Postgres cluster deploy aborted after image update from 14.4 v0.0.28 to 14.4 v0.0.31	4	242	October 4, 2022
Postgres cluster broken since last Fly migration Questions / Help postgres	1	131	July 2, 2024
Primary unreachable from replica, but is running fine Questions / Help postgres	1	39	February 22, 2025

Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster

Related topics