Both primary and replica are healthy, the primary is still running and the app is able to read and write on primary. However, the replica could not access the primary. Both fly postgres connect
and ssh into the primary has resulted in timeout, while ssh into replica is working.
Response from repmgr -f /data/repmgr.conf cluster show
WARNING: following issues were detected
- unable to connect to node "2865551b90d7d8" (ID: 1165987592)'s upstream node "18570e5b201018" (ID: 1709151374)
- unable to determine if node "2865551b90d7d8" (ID: 1165987592) is attached to its upstream node "18570e5b201018" (ID: 1709151374)
- unable to connect to node "18570e5b201018" (ID: 1709151374)
- node "18570e5b201018" (ID: 1709151374) is registered as an active primary but is unreachable
The following log came from replica
2025-02-15T16:48:59.517 app[2865551b90d7d8] ewr [info] repmgrd | Running...
2025-02-15T16:48:59.524 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
2025-02-15T16:48:59.524 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [INFO] connecting to database "host=2865551b90d7d8.vm.***db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [NOTICE] starting monitoring of node "2865551b90d7d8" (ID: 1165987592)
2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [INFO] "connection_check_type" set to "ping"
2025-02-15T16:49:03.726 app[2865551b90d7d8] ewr [info] failed post-init: failed to migrate node name: failed to establish connection to primary: failed to connect to `host=18570e5b201018.vm.***db.internal user=repmgr database=repmgr`: dial error (timeout: context deadline exceeded). Retrying...
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [ERROR] connection to database failed
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [DETAIL]
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | connection to server at "18570e5b201018.vm.***db.internal" (fdaa:***), port 5433 failed: timeout expired
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd |
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [DETAIL] attempted to connect using:
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=18570e5b201018.vm.***db.internal port=5433 fallback_application_name=repmgr options=-csearch_path=
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [ERROR] unable connect to upstream node (ID: 1709151374), terminating
2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [HINT] upstream node must be running before repmgrd can start
2025-02-15T16:49:04.549 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [INFO] repmgrd terminating...
2025-02-15T16:49:04.553 app[2865551b90d7d8] ewr [info] repmgrd | exit status 6
Does anyone know how to resolve this? It suddenly has this issue after it has been running for about 20 days. Thanks for the help