Recently I had two incidents (15 days apart) where the PG cluster entered a Zombie lock state. This was the production cluster and as such the application was unavailable for customers.
The proposed solutions found here and here didn’t apply to my case, as the repmgr database was also wiped out (don’t know why). So the solution came from forking a new cluster using a volume.
These solutions, in all cases, require manual intervention from the admin or account owner, and in my case, is one man operation.
So in what I am interested in is in knowing if there is a definite and automatic solution to this issue, needless to say, having these incidents randomly ocurring like this fail the purpose of an HA PG cluster (because the failover mechanism stops working altogether).
I have only one cluster in one region because of costs, and I don’t even know if having a second one in another region helps. I know Fly is not managed PG, but the reason for the lock (or failure to select or identify a cluster leader) comes from a service developed by Fly, not from PG.
You can confirm this by running fly volumes list. If you happen to have 2/3 volumes on the same host, there’s a 50% chance an outage will break quorum.
Work to understand how fencing is handled in this implementation.
Test common failure scenarios and ensure things respond the way you’d expect them to.
Setting up a staging environment and manually breaking the cluster is a good way to get a feel for how things work. If you come across a situation where things are not recovering when you feel they should, offering something that’s reproducible will greatly help us provide specific guidance and/or address any potential bugs.
it’s also worth noting that the documentation by fly does not create volumes in new zones when horizontally scaling as described here Horizontal Scaling · Fly Docs
Simply cloning the machine will create volumes in the same zone which seems to be causing some confusion. Which is why I needed to define these two steps separately in my solution