Postgres cluster broken since last Fly migration

We have a Fly Postgres cluster with 3 nodes running in production. Our app has been broken today since it can no longer connect to a primary instance (all nodes became replicas). Our resource usage (CPU load, memory, disk) has been well below the limits.

I’m guessing this has to do with the automated PG migration that was done at Jun 25 2024 22:22 UTC?

VERSION STATUS          DESCRIPTION     USER            DATE (UTC)              DOCKER IMAGE
v3      complete        Release         john@fly.io     Jun 25 2024 22:26       docker-hub-mirror.fly.io/flyio/postgres-flex:15.3
v2      failed          Release         john@fly.io     Jun 25 2024 22:22       docker-hub-mirror.fly.io/flyio/postgres-flex:15.3
v1      complete        Release         zafer@algora.io Mar 10 2024 17:37       registry-1.docker.io/flyio/postgres-flex:15.3

I have attempted to force a failover but Fly rejects that with no active leader found. I’m also not able to connect to the database with the CLI or pg_dump with a Fly proxy anymore. I have even created a completely new Postgres cluster with 1 primary node by forking one of the existing volumes, but that didn’t work either.

Seems like an issue with repmgr connection:

2024-06-25 10:53:59.358
repmgrd  | Is the server running on that host and accepting TCP/IP connections?
2024-06-25 10:53:59.358
repmgrd  | connection to server at "****:*:****:***:***:****:****:*", port 5433 failed: Connection refused
2024-06-25 10:53:59.358
repmgrd  | [2024-06-25 10:53:59] [DETAIL]
2024-06-25 10:53:59.358
repmgrd  | [2024-06-25 10:53:59] [ERROR] connection to database failed
2024-06-25 10:53:59.358
repmgrd  | [2024-06-25 10:53:59] [INFO] connecting to database "host=****:*:****:***:***:****:****:* port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-06-25 10:53:59.358
repmgrd  | [2024-06-25 10:53:59] [NOTICE] repmgrd (repmgrd 5.3.3) starting up

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.