I experienced the same thing yesterday 4/23. I was able to get both of my pg machines back up with a restart. However, every time they would scale to zero they would not start back up successfully because of a zombie.lock. For now I was able to disable scaling to zero and they have stayed healthy.
All that to say, I am also experiencing this issue as of 4/23 morning
I got my cluster healthy again by doing the following:
Creating 2 new replica’s in the primary region
fly machine clone healthy-machine-id --region iad --app my-app
Force destroying the 2 zombie nodes (This leaves the data intact in a detached volume)
fly m destroy --force zombie-machine-id --app my-app
Attaching new instances to the detached volumes
fly machine clone healthy-machine-id --region iad --app my-app --attach-volume zombie-volume-id
These zombie machines came back and were still in the zombie state, but at some point in this process a leader election took place and one of the new nodes I’d created was elected as the primary.
I ended up force destroying the zombies again, but I have been left with a health cluster now.
Here is the response from fly.io support for this issue. This might help others
For zombie locked clusters, the appropriate troubleshooting steps would be roughly:
Find last elected primary node
Remove all machines other than previous primary
SSH in and remove the old nodes from repmgr (as described here)
SSH in and remove the zombie lock files from the machine
Restart the node
If healthy, scale cluster back up
It’s unlikely failover/restart will work when the cluster is in any sort of bad state or zombie lock situation.
In general the zombie.lock occurs when the member has been fenced as Flex is unable to confirm that the booting/running primary is the actual primary. This can happen in the case of a network partition, when the nodes lose contact with each other. We have some info about it here: postgres-flex/docs/fencing.md at master · fly-apps/postgres-flex · GitHub
If you would prefer to migrate to a new cluster, you could also try restoring your postgres using a volume fork. You can do so using the fly pg create --fork-from , specifying the volume ID of the previous primary.