Primary unreachable from replica, but is running fine

Both primary and replica are healthy, the primary is still running and the app is able to read and write on primary. However, the replica could not access the primary. Both fly postgres connect and ssh into the primary has resulted in timeout, while ssh into replica is working.

Response from repmgr -f /data/repmgr.conf cluster show

WARNING: following issues were detected
  - unable to connect to node "2865551b90d7d8" (ID: 1165987592)'s upstream node "18570e5b201018" (ID: 1709151374)
  - unable to determine if node "2865551b90d7d8" (ID: 1165987592) is attached to its upstream node "18570e5b201018" (ID: 1709151374)
  - unable to connect to node "18570e5b201018" (ID: 1709151374)
  - node "18570e5b201018" (ID: 1709151374) is registered as an active primary but is unreachable

The following log came from replica

2025-02-15T16:48:59.517 app[2865551b90d7d8] ewr [info] repmgrd | Running...

2025-02-15T16:48:59.524 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [NOTICE] repmgrd (repmgrd 5.4.1) starting up

2025-02-15T16:48:59.524 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [INFO] connecting to database "host=2865551b90d7d8.vm.***db.internal port=5433 user=repmgr dbname=repmgr connect_timeout=5"

2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid

2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [NOTICE] starting monitoring of node "2865551b90d7d8" (ID: 1165987592)

2025-02-15T16:48:59.541 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:48:59] [INFO] "connection_check_type" set to "ping"

2025-02-15T16:49:03.726 app[2865551b90d7d8] ewr [info] failed post-init: failed to migrate node name: failed to establish connection to primary: failed to connect to `host=18570e5b201018.vm.***db.internal user=repmgr database=repmgr`: dial error (timeout: context deadline exceeded). Retrying...

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [ERROR] connection to database failed

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [DETAIL]

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | connection to server at "18570e5b201018.vm.***db.internal" (fdaa:***), port 5433 failed: timeout expired

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd |

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [DETAIL] attempted to connect using:

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=18570e5b201018.vm.***db.internal port=5433 fallback_application_name=repmgr options=-csearch_path=

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [ERROR] unable connect to upstream node (ID: 1709151374), terminating

2025-02-15T16:49:04.548 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [HINT] upstream node must be running before repmgrd can start

2025-02-15T16:49:04.549 app[2865551b90d7d8] ewr [info] repmgrd | [2025-02-15 16:49:04] [INFO] repmgrd terminating...

2025-02-15T16:49:04.553 app[2865551b90d7d8] ewr [info] repmgrd | exit status 6

Does anyone know how to resolve this? It suddenly has this issue after it has been running for about 20 days. Thanks for the help

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.