We have a Fly Postgres cluster with 3 nodes running in production. Our app has been broken today since it can no longer connect to a primary instance (all nodes became replicas). Our resource usage (CPU load, memory, disk) has been well below the limits.
I’m guessing this has to do with the automated PG migration that was done at Jun 25 2024 22:22 UTC?
VERSION STATUS DESCRIPTION USER DATE (UTC) DOCKER IMAGE
v3 complete Release john@fly.io Jun 25 2024 22:26 docker-hub-mirror.fly.io/flyio/postgres-flex:15.3
v2 failed Release john@fly.io Jun 25 2024 22:22 docker-hub-mirror.fly.io/flyio/postgres-flex:15.3
v1 complete Release zafer@algora.io Mar 10 2024 17:37 registry-1.docker.io/flyio/postgres-flex:15.3
I have attempted to force a failover but Fly rejects that with no active leader found
. I’m also not able to connect to the database with the CLI or pg_dump
with a Fly proxy anymore. I have even created a completely new Postgres cluster with 1 primary node by forking one of the existing volumes, but that didn’t work either.
Seems like an issue with repmgr
connection:
2024-06-25 10:53:59.358
repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-06-25 10:53:59.358
repmgrd | connection to server at "****:*:****:***:***:****:****:*", port 5433 failed: Connection refused
2024-06-25 10:53:59.358
repmgrd | [2024-06-25 10:53:59] [DETAIL]
2024-06-25 10:53:59.358
repmgrd | [2024-06-25 10:53:59] [ERROR] connection to database failed
2024-06-25 10:53:59.358
repmgrd | [2024-06-25 10:53:59] [INFO] connecting to database "host=****:*:****:***:***:****:****:* port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-06-25 10:53:59.358
repmgrd | [2024-06-25 10:53:59] [NOTICE] repmgrd (repmgrd 5.3.3) starting up