Replace flapping pg cluster member

I have a failing Postgres cluster member (non-leader) in my primary region. Fly vm stop won’t shut it down, and it’s failing two of its health checks.

What I want to do is stop it, remove the faulty volume, clone the leader volume, and attach it to the vm / a new vm which will then become a replica.

Alas, I am on the non-machines Postgres setup, so I’m not sure what to do.

Also, if someone from fly is reading this: can you please follow through on support tickets. I opened a ticket on Friday when I realised the machine was misbehaving, and was told it was being escalated. It’s now Tuesday morning here, and my db isn’t writable because my read replica setup is trying to broadcast writes to the faulty machine (this last part is arguably my own fault, and I could fail over to a new primary region, but maybe let’s just fix one problem at a time)

Hey @shugel,

You should be able to just delete the faulty volume and Nomad should automatically allocate the new one.

Hey @shaun, I deleted the volume and tried to restart the cluster. It’s now stuck in performing a failover, and the web UI shows the read replica unable to find keeper info (db cluster seems to be in read-only mode):

2023-04-18T00:09:57.028 app[f02ac5e0] ams [info] sentinel | 2023-04-18T00:09:57.027Z WARN cmd/sentinel.go:276 no keeper info available {"db": "51866363", "keeper": "28df288fe2"}

Those should clear on their own, however, there seems to be an edge case where they don’t and it can lead to unnecessary WAL retention. To remove these, you can ssh into your VM you can use the stolonctl to remove the inactive keepers.

fly ssh console --app <app-name>

# List the cluster status
stolonctl status

# Remove the member(s) that are no longer running.
stolonctl removekeeper <keeper-id>

That’s cleared the keeper error, but the entire cluster still appears to be in read-only mode:

If I try to write through the app, I get this on the db side (lhr is the primary region):

app[1fb3c3cf] lhr [info] keeper | 2023-04-18 00:50:44.169 UTC [4041] ERROR: cannot execute UPDATE in a read-only transaction

I also can’t commit transactions through e.g. DataGrip if I connect to the db locally using Wireguard.

Did you do anything to enable read-only mode on your primary?

Connect to your database using fly pg connect --database <db-name>.

Run the following to see if your database is set to read-only:

show default_transaction_read_only;

If this is set to true, you can adjust this by running:

SET default_transaction_read_only=false;

That showed the db was in read-only mode, and I could reset it, but it wouldn’t stick. I wondered whether the db was in read-only mode due to a failed migration to v2, using migrate-to-v2 (which is what caused all this in the first place), so I’ve scaled down to a single region, run migrate-to-v2 again, cleaned up the stale keepers again, and now I have a healthy, writable v2 cluster containing a leader and replica.

When I list the volumes, I see two new volumes of type pg_data_machines (one for each running vm), and two original volumes (pg_data). Is it safe two remove the older volumes since they’re no longer attached? Also, the new volumes aren’t showing as encrypted, whereas the old ones are.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.