Replace flapping pg cluster member

shugel · April 17, 2023, 11:21pm

I have a failing Postgres cluster member (non-leader) in my primary region. Fly vm stop won’t shut it down, and it’s failing two of its health checks.

What I want to do is stop it, remove the faulty volume, clone the leader volume, and attach it to the vm / a new vm which will then become a replica.

Alas, I am on the non-machines Postgres setup, so I’m not sure what to do.

Also, if someone from fly is reading this: can you please follow through on support tickets. I opened a ticket on Friday when I realised the machine was misbehaving, and was told it was being escalated. It’s now Tuesday morning here, and my db isn’t writable because my read replica setup is trying to broadcast writes to the faulty machine (this last part is arguably my own fault, and I could fail over to a new primary region, but maybe let’s just fix one problem at a time)

shaun · April 17, 2023, 11:55pm

Hey @shugel,

You should be able to just delete the faulty volume and Nomad should automatically allocate the new one.

shugel · April 18, 2023, 12:11am

Hey @shaun, I deleted the volume and tried to restart the cluster. It’s now stuck in performing a failover, and the web UI shows the read replica unable to find keeper info (db cluster seems to be in read-only mode):

2023-04-18T00:09:57.028 app[f02ac5e0] ams [info] sentinel | 2023-04-18T00:09:57.027Z WARN cmd/sentinel.go:276 no keeper info available {"db": "51866363", "keeper": "28df288fe2"}

shaun · April 18, 2023, 12:38am

Those should clear on their own, however, there seems to be an edge case where they don’t and it can lead to unnecessary WAL retention. To remove these, you can ssh into your VM you can use the stolonctl to remove the inactive keepers.

fly ssh console --app <app-name>

# List the cluster status
stolonctl status

# Remove the member(s) that are no longer running.
stolonctl removekeeper <keeper-id>

shugel · April 18, 2023, 12:53am

That’s cleared the keeper error, but the entire cluster still appears to be in read-only mode:

If I try to write through the app, I get this on the db side (lhr is the primary region):

app[1fb3c3cf] lhr [info] keeper | 2023-04-18 00:50:44.169 UTC [4041] ERROR: cannot execute UPDATE in a read-only transaction

I also can’t commit transactions through e.g. DataGrip if I connect to the db locally using Wireguard.

shaun · April 18, 2023, 1:15am

Did you do anything to enable read-only mode on your primary?

Connect to your database using fly pg connect --database <db-name>.

Run the following to see if your database is set to read-only:

show default_transaction_read_only;

If this is set to true, you can adjust this by running:

SET default_transaction_read_only=false;

shugel · April 18, 2023, 10:13am

That showed the db was in read-only mode, and I could reset it, but it wouldn’t stick. I wondered whether the db was in read-only mode due to a failed migration to v2, using migrate-to-v2 (which is what caused all this in the first place), so I’ve scaled down to a single region, run migrate-to-v2 again, cleaned up the stale keepers again, and now I have a healthy, writable v2 cluster containing a leader and replica.

When I list the volumes, I see two new volumes of type pg_data_machines (one for each running vm), and two original volumes (pg_data). Is it safe two remove the older volumes since they’re no longer attached? Also, the new volumes aren’t showing as encrypted, whereas the old ones are.

system · April 25, 2023, 10:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
No Keeper Available, runaway WAL Questions / Help postgres	12	1149	July 31, 2023
Postgres cluster deploy aborted after image update from 14.4 v0.0.28 to 14.4 v0.0.31	4	242	October 4, 2022
Here's how to fix an unreachable (2 Zombie, 1 Replica) HA Postgres Cluster Show & Tell postgres	11	1078	May 9, 2025
Moving HA Postgres cluster to new region postgres	1	406	September 30, 2022
Postgres clusters periodically down across many of our organizations Questions / Help postgres	7	1643	October 13, 2022

Replace flapping pg cluster member

Related topics