Connection Issues on Fly Postgres Region Singapore

Hi, my DB suddenly got this status:

500 Internal Server Error failed to connect to local node: failed to connect to host=fdaa:2:5d0c:a7b:18a:cd3a:c69a:2 user=repmgr database=repmgr: dial error (dial tcp [fdaa:2:5d0c:a7b:18a:cd3a:c69a:2]:5433: connect: connection refused)

When I try to restart, I got:

fly postgres restart --app mydbapp
Error: no active leader found

Any one can help?

A bit OOT, at the same time, I noticed compilation time is also really slow for the command “mix deps.compile”, when it’s usually done in a couple of seconds/minutes.

More information on health checks in the primary node:

pg
500 Internal Server Error
[✓] connections: 12 used, 3 reserved, 500 max (31.14ms)
[✗] cluster-locks: zombie.lock detected (27.97µs)
[✓] disk-capacity: 25.8% - readonly mode will be enabled at 90.0% (22.89µs)

vm
[✓] checkDisk: 28.99 GB (74.2%) free space on /data/ (61.27µs)
[✓] checkLoad: load averages: 3.31 1.40 0.12 (95.32µs)
[✓] memory: system spent 0s of the last 60s waiting on memory (47.07µs)
[✓] cpu: system spent 1.54s of the last 60s waiting on cpu (32.7µs)
[✓] io: system spent 0s of the last 60s waiting on io (27.52µs)

role
zombie

I also already tried turning on and off the machines, still no luck.

2024-01-05T08:37:37.095 app[4d8917eec74e98] sin [info] failed post-init: unrecoverable zombie. Retrying…

2024-01-05T08:37:37.095 app[4d8917eec74e98] sin [info] [ERROR] Manual intervention required.

2024-01-05T08:37:37.095 app[4d8917eec74e98] sin [info] [ERROR] If a new primary has been established, consider adding a new replica with fly machines clone <primary-machine-id> and then remove this member.

2024-01-05T08:37:37.095 app[4d8917eec74e98] sin [info] [ERROR] Sleeping for 5 minutes.

2024-01-05T08:37:38.214 app[4d8917eec74e98] sin [info] postgres | 2024-01-05 08:37:38.213 UTC [381] LOG: checkpoint starting: time

2024-01-05T08:37:38.341 app[4d8917eec74e98] sin [info] postgres | 2024-01-05 08:37:38.340 UTC [381] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.104 s, sync=0.001 s, total=0.128 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=1 kB

2024-01-05T08:37:46.310 app[4d8917eec74e98] sin [info] repmgrd | [2024-01-05 08:37:46] [INFO] monitoring primary node “fdaa:2:5d0c:a7b:188:d9ba:4480:2” (ID: 1946318029) in normal state

2024-01-05T08:37:46.551 app[4d8917eec74e98] sin [info] monitor | [WARN] Failed to connect to fdaa:2:5d0c:a7b:18a:cd3a:c69a:2

2024-01-05T08:37:46.551 app[4d8917eec74e98] sin [info] monitor | Voting member(s): 2, Active: 1, Inactive: 1, Conflicts: 0

2024-01-05T08:38:16.621 app[4d8917eec74e98] sin [info] monitor | [WARN] Failed to restart haproxy on member fdaa:2:5d0c:a7b:18a:cd3a:c69a:2: Get “http://[fdaa:2:5d0c:a7b:18a:cd3a:c69a:2]:5500/commands/admin/haproxy/restart”: dial tcp [fdaa:2:5d0c:a7b:18a:cd3a:c69a:2]:5500: i/o timeout

2024-01-05T08:38:16.665 app[4d8917eec74e98] sin [info] monitor | clusterStateMonitorTick failed with: primary has been quarantined: unable to confirm we are the true primary

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | [NOTICE] (1413) : haproxy version is 2.8.3-1~bpo12+1

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | [NOTICE] (1413) : path to executable is /usr/sbin/haproxy

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | [ALERT] (1413) : Current worker (1415) exited with code 143 (Terminated)

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | [WARNING] (1413) : All workers exited. Exiting… (0)

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | Process exited 0

2024-01-05T08:38:16.669 app[4d8917eec74e98] sin [info] proxy | restarting in 1s [attempt 2]

2024-01-05T08:38:17.670 app[4d8917eec74e98] sin [info] proxy | Running…

2024-01-05T08:38:17.855 app[4d8917eec74e98] sin [info] proxy | [NOTICE] (2318) : New worker (2320) forked

2024-01-05T08:38:17.855 app[4d8917eec74e98] sin [info] proxy | [NOTICE] (2318) : Loading success.

2024-01-05T08:38:17.864 app[4d8917eec74e98] sin [info] proxy | [WARNING] (2320) : bk_db/pg1 changed its IP from (none) to fdaa:2:5d0c:a7b:188:d9ba:4480:2 by flydns/dns1.

2024-01-05T08:38:17.864 app[4d8917eec74e98] sin [info] proxy | [WARNING] (2320) : Server bk_db/pg1 (‘sin.ctrid-tsdb.internal’) is UP/READY (resolves again).

2024-01-05T08:38:17.864 app[4d8917eec74e98] sin [info] proxy | [WARNING] (2320) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2024-01-05T08:38:19.684 app[4d8917eec74e98] sin [info] proxy | [WARNING] (2320) : Backup Server bk_db/pg is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 10ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

2024-01-05T08:38:19.875 app[4d8917eec74e98] sin [info] proxy | [WARNING] (2320) : Server bk_db/pg1 is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 15ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

2024-01-05T08:38:19.875 app[4d8917eec74e98] sin [info] proxy | [ALERT] (2320) : backend ‘bk_db’ has no server available!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.