Hi all - due to the recent downtime, my 3-node (…erm HA???) Postgres Cluster became unreachable with 2 Zombie Nodes and 1 Replica Node. With no primary to clone from, I was stuck. I spent an ungodly long amount of time trying to figure out how to fix this, and now that I have, I figured I would share my solution for those in the future that deal with this…and for me lol.
I’m sad that the touted Fly “Highly Available” solution failed to keep my database, highly available, but alas, here we are. If any Fly support or expert repmgr
wizzes here have suggestions for improving this, please let me know and I will update. Anyways, let’s commence!
- Run Fly Status
fly status --app [APP-NAME]
- Destroy the zombie nodes
fly machine destroy [MACHINE-ID] --force
- SSH in to the remaining replica node
fly ssh console --app [APP-NAME]
- Switch user to Postgres
su postgres
- View the broken cluster
repmgr -f /data/repmgr.conf cluster show
- Promote your remaining server as Primary
repmgr -f /data/repmgr.conf standby promote
- Force the Status of this new Node to Primary
repmgr -f /data/repmgr.conf primary register --force
- Unregister the old primary and the other standby node
repmgr -f /data/repmgr.conf standby unregister --node-id [NODE-ID]
-
You should now only see one Primary / Running node in your cluster now
-
Restart your machine
fly machine restart [MACHINE-ID]
-
I had to wait a few minutes for this machine to go from
zombie
→primary
, which was scary because this was previously considered a replica, but thezombie.lock
file in/data
disappeared after a few minutes. -
Delete the old volumes from the other two broken machines (ALERT: NOT THE VOLUME THAT IS CURRENTLY ATTACHED, this will be the only volume left with your data, please be careful)
fly volumes destroy [VOLUME-ID]
- Create two new volumes in separate zones (important since we want to make sure if a zone dies we have drives in other zones…the instructions from fly to just clone the machine, clone a volume in the same zone which we dont want…)
fly volumes create pg_data --app [APP-NAME] --region [REGION] --require-unique-zone --size [YOUR-SIZE]
- Create new machines referencing the new volumes youve created and the running primary
fly machine clone [MACHINE] --attach-volume [VOLUME-ID] --region [REGION] --app [APP-ID]
Whew, you should be back up and running. Ideally I would like a Fly CLI command to switch the leader instead of having to do all this nonsense.