Communication with Postgres Cluster dead?

I am seeing a bunch of this stuff:

2024-07-24T18:29:18.082 app[48ed16dc565468] ord [info] admin | [WARN] Failed to connect to fdaa:1:1dfc:a7b:96:6bf8:a15a:2

2024-07-24T18:29:28.082 app[48ed16dc565468] ord [info] admin | [WARN] Failed to connect to fdaa:1:1dfc:a7b:69:efe4:3049:2

2024-07-24T18:29:28.082 app[48ed16dc565468] ord [info] admin | Voting member(s): 3, Active: 1, Inactive: 2, Conflicts: 0

2024-07-24T18:29:38.151 app[81137ea99d16d8] ord [info] proxy | [WARNING] (434) : Server bk_db/pg1 was DOWN and now enters maintenance (unspecified DNS error).

2024-07-24T18:29:38.151 app[81137ea99d16d8] ord [info] proxy | [WARNING] (434) : Server bk_db/pg2 was DOWN and now enters maintenance (unspecified DNS error).

[PP02] could not proxy TCP data to/from instance: failed to copy (direction=client->server, error=Transport endpoint is not connected

My app server cannot connect and I cannot clone any of my DB hosts to try and move them to new machines.

The logs make this look like Postgres is in a bad state on multiple Machines. Cloning them probably won’t help, and may actually make things worse.

It appears 1 of the 3 original voting member is still working, but it’s in readonly mode.

What I would do is remove the unhealthy / inactive Machines and see if you can get the currently healthy one into a writable state (if it’s the only one running, it will be).

Then dig through the logs and see if there’s any indication of what caused this. My guess would be OOMs or something similar, you may need more RAM.

Oh I just saw JP helping in your support ticket, gonna close this in favor of that.