I have a failing Postgres cluster member (non-leader) in my primary region. Fly vm stop won’t shut it down, and it’s failing two of its health checks.
What I want to do is stop it, remove the faulty volume, clone the leader volume, and attach it to the vm / a new vm which will then become a replica.
Alas, I am on the non-machines Postgres setup, so I’m not sure what to do.
Also, if someone from fly is reading this: can you please follow through on support tickets. I opened a ticket on Friday when I realised the machine was misbehaving, and was told it was being escalated. It’s now Tuesday morning here, and my db isn’t writable because my read replica setup is trying to broadcast writes to the faulty machine (this last part is arguably my own fault, and I could fail over to a new primary region, but maybe let’s just fix one problem at a time)
Hey @shaun, I deleted the volume and tried to restart the cluster. It’s now stuck in performing a failover, and the web UI shows the read replica unable to find keeper info (db cluster seems to be in read-only mode):
2023-04-18T00:09:57.028 app[f02ac5e0] ams [info] sentinel | 2023-04-18T00:09:57.027Z WARN cmd/sentinel.go:276 no keeper info available {"db": "51866363", "keeper": "28df288fe2"}
Those should clear on their own, however, there seems to be an edge case where they don’t and it can lead to unnecessary WAL retention. To remove these, you can ssh into your VM you can use the stolonctl to remove the inactive keepers.
fly ssh console --app <app-name>
# List the cluster status
stolonctl status
# Remove the member(s) that are no longer running.
stolonctl removekeeper <keeper-id>
That showed the db was in read-only mode, and I could reset it, but it wouldn’t stick. I wondered whether the db was in read-only mode due to a failed migration to v2, using migrate-to-v2 (which is what caused all this in the first place), so I’ve scaled down to a single region, run migrate-to-v2 again, cleaned up the stale keepers again, and now I have a healthy, writable v2 cluster containing a leader and replica.
When I list the volumes, I see two new volumes of type pg_data_machines (one for each running vm), and two original volumes (pg_data). Is it safe two remove the older volumes since they’re no longer attached? Also, the new volumes aren’t showing as encrypted, whereas the old ones are.