A solution for Zombie locks in a Postgres cluster

adelvalle · May 21, 2024, 5:51pm

Recently I had two incidents (15 days apart) where the PG cluster entered a Zombie lock state. This was the production cluster and as such the application was unavailable for customers.

The proposed solutions found here and here didn’t apply to my case, as the repmgr database was also wiped out (don’t know why). So the solution came from forking a new cluster using a volume.

These solutions, in all cases, require manual intervention from the admin or account owner, and in my case, is one man operation.

So in what I am interested in is in knowing if there is a definite and automatic solution to this issue, needless to say, having these incidents randomly ocurring like this fail the purpose of an HA PG cluster (because the failover mechanism stops working altogether).

I have only one cluster in one region because of costs, and I don’t even know if having a second one in another region helps. I know Fly is not managed PG, but the reason for the lock (or failure to select or identify a cluster leader) comes from a service developed by Fly, not from PG.

shaun · May 21, 2024, 6:24pm

A couple recommendations:

Ensure your volumes are in separate zones

You can confirm this by running fly volumes list. If you happen to have 2/3 volumes on the same host, there’s a 50% chance an outage will break quorum.

Work to understand how fencing is handled in this implementation.

This is probably the best resource: postgres-flex/docs/fencing.md at master · fly-apps/postgres-flex · GitHub

Test common failure scenarios and ensure things respond the way you’d expect them to.

Setting up a staging environment and manually breaking the cluster is a good way to get a feel for how things work. If you come across a situation where things are not recovering when you feel they should, offering something that’s reproducible will greatly help us provide specific guidance and/or address any potential bugs.

uncvrd · May 21, 2024, 7:03pm

it’s also worth noting that the documentation by fly does not create volumes in new zones when horizontally scaling as described here Horizontal Scaling · Fly Docs

Simply cloning the machine will create volumes in the same zone which seems to be causing some confusion. Which is why I needed to define these two steps separately in my solution

fly volumes create pg_data --app [APP-NAME] --region [REGION] --require-unique-zone --size [YOUR-SIZE]

fly machine clone [MACHINE] --attach-volume [VOLUME-ID] --region [REGION] --app [APP-ID]

shaun · May 21, 2024, 7:06pm

That’s a good note! I will take a look at that!

shaun · May 21, 2024, 7:09pm

@uncvrd Looks like this has been updated recently. Are you running the latest version of flyctl?

uncvrd · May 22, 2024, 1:58am

This was the case on April 26th when my Fly cluster failed, but if this has been patched since then, that’s great news!

system · May 29, 2024, 1:58am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster Show & Tell postgres	11	1078	May 9, 2025
Unhealthy DB cluster, multiple zombie postgres	8	688	May 4, 2024
Postgres cluster deploy aborted after image update from 14.4 v0.0.28 to 14.4 v0.0.31	4	242	October 4, 2022
Replace flapping pg cluster member Questions / Help postgres	7	384	April 25, 2023
Managed Postgres cluster offline... why? Questions / Help postgres	2	85	May 12, 2025

A solution for Zombie locks in a Postgres cluster

Ensure your volumes are in separate zones

Work to understand how fencing is handled in this implementation.

Test common failure scenarios and ensure things respond the way you’d expect them to.

Related topics