fly postgres attach - Error: no active leader found

Hi,

I use review apps in my CI pipeline. For review apps, I have a postgres app that is configured to scale to zero. After creating a new review app in my deploy script, I need to attach it to the cluster. Since the db might have scaled to zero, I wake every machine in the posgres app and then wait for them to move to the “started” state by running a couple of commands:

flyctl machine list --app ${{ env.DB_APP_NAME }} --json | jq '.[0].id' | xargs -n 1 flyctl machine start --app ${{ env.DB_APP_NAME }}

then

until flyctl machine list --app ${{ env.DB_APP_NAME }} --json | jq '.[0].state' | grep -q "started"; do echo "waiting on postgres machine startup"; sleep 1; done

Then I try to attach the new app and see the error “Error: no active leader found”. If I run the postgres attach command from flyctl after the pipeline fails, it works as expected. What is the best way anyone has found to handle this scenario? Just put in a wait and hope for the best? Keep trying the attach command until it succeeds? Thanks for your help.

I added some wait time between commands and that seems to work, but I’m still wondering if there is a better way?

Hm… What do DB_APP_NAME’s logs show during that time?

Is it three consecutive leader elections, :dragon:?

(I think multi-node clusters are mostly designed to be run non-stop…)

There’s only one node, but the logs look like this after some light editing

2024-10-27T00:54:08.801 app[6e825e4c05d448] sea [info] 2024-10-27T00:54:08.801866536 [01JA9FXHS1991MG57NRQPYRGG5:main] Running Firecracker v1.7.0
2024-10-27T00:54:09.207 app[6e825e4c05d448] sea [info] [ 0.264722] PCI: Fatal: No config space access function found
2024-10-27T00:54:09.539 app[6e825e4c05d448] sea [info] INFO Starting init (commit: 04656915)...
2024-10-27T00:54:09.596 app[6e825e4c05d448] sea [info] INFO Checking filesystem on /data
2024-10-27T00:54:09.599 app[6e825e4c05d448] sea [info] /dev/vdc: clean, 2246/65280 files, 30541/261120 blocks
2024-10-27T00:54:09.601 app[6e825e4c05d448] sea [info] INFO Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2024-10-27T00:54:09.606 app[6e825e4c05d448] sea [info] INFO Resized /data to 1069547520 bytes
2024-10-27T00:54:09.628 app[6e825e4c05d448] sea [info] INFO Preparing to run: `start` as root
2024-10-27T00:54:09.640 app[6e825e4c05d448] sea [info] INFO [fly api proxy] listening at /.fly/api
2024-10-27T00:54:09.683 runner[6e825e4c05d448] sea [info] Machine started in 995ms
2024-10-27T00:54:09.789 app[6e825e4c05d448] sea [info] 2024/10/27 00:54:09 INFO SSH listening listen_address=[*********:2]:22 dns_server=[fdaa::3]:53
2024-10-27T00:54:09.987 app[6e825e4c05d448] sea [info] Configured scale to zero with duration of 1h0m0s
2024-10-27T00:54:09.988 app[6e825e4c05d448] sea [info] postgres | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] proxy | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] repmgrd | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] monitor | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] admin | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] exporter | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] monitor | Waiting for Postgres to be ready...
2024-10-27T00:54:10.201 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
2024-10-27T00:54:10.201 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [INFO] connecting to database "host=*********:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [ERROR] connection to database failed
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [DETAIL]
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | connection to server at "*********:2", port 5433 failed: Connection refused
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd |
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [DETAIL] attempted to connect using:
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=*********:2 port=5433 fallback_application_name=repmgr options=-csearch_path=
2024-10-27T00:54:10.206 app[6e825e4c05d448] sea [info] repmgrd | exit status 6
2024-10-27T00:54:10.206 app[6e825e4c05d448] sea [info] repmgrd | restarting in 5s [attempt 1]
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres database is warning. Your database might be malfunctioning.
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres vm is warning. Your instance might be hitting resource limits.
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres role is warning. Your cluster's membership might be affected.
2024-10-27T00:54:10.273 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : New worker (382) forked
2024-10-27T00:54:10.275 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : Loading success.
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : bk_db/pg1 changed its IP from (none) to *********:2 by flydns/dns1.
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : Server bk_db/pg1 ('sea.************.internal') is UP/READY (resolves again).
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-10-27T00:54:10.294 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.291 UTC [345] LOG: starting PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg24.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0, 64-bit
2024-10-27T00:54:10.295 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.294 UTC [345] LOG: listening on IPv4 address "0.0.0.0", port 5433
2024-10-27T00:54:10.295 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.294 UTC [345] LOG: listening on IPv6 address "::", port 5433
2024-10-27T00:54:10.297 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.297 UTC [345] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2024-10-27T00:54:10.304 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.303 UTC [386] LOG: database system was shut down at 2024-10-26 21:33:52 UTC
2024-10-27T00:54:10.319 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.319 UTC [345] LOG: database system is ready to accept connections
2024-10-27T00:54:10.847 health[6e825e4c05d448] sea [error] Health check for your postgres database has failed. Your database is malfunctioning.
2024-10-27T00:54:11.064 app[6e825e4c05d448] sea [info] Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : haproxy version is 2.8.5-1ubuntu3
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : path to executable is /usr/sbin/haproxy
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [ALERT] (347) : Current worker (382) exited with code 143 (Terminated)
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [WARNING] (347) : All workers exited. Exiting... (0)
2024-10-27T00:54:11.197 app[6e825e4c05d448] sea [info] proxy | Process exited 0
2024-10-27T00:54:11.197 app[6e825e4c05d448] sea [info] proxy | restarting in 1s [attempt 1]
2024-10-27T00:54:12.197 app[6e825e4c05d448] sea [info] proxy | Running...
2024-10-27T00:54:12.238 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (406) : New worker (408) forked
2024-10-27T00:54:12.238 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (406) : Loading success.
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : bk_db/pg1 changed its IP from (none) to *********:2 by flydns/dns1.
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : Server bk_db/pg1 ('sea.********.internal') is UP/READY (resolves again).
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-10-27T00:54:15.157 app[6e825e4c05d448] sea [info] monitor | Postgres is ready to accept connections. Starting monitor...
2024-10-27T00:54:15.207 app[6e825e4c05d448] sea [info] repmgrd | Running...
2024-10-27T00:54:15.213 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
2024-10-27T00:54:15.213 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [INFO] connecting to database "host=*********:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] starting monitoring of node "*********:2" (ID: 1328154310)
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [INFO] "connection_check_type" set to "ping"
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] monitoring cluster primary "*********:2" (ID: 1328154310)
2024-10-27T00:54:15.699 health[6e825e4c05d448] sea [info] Health check for your postgres role is now passing.
2024-10-27T00:54:16.904 health[6e825e4c05d448] sea [info] Health check for your postgres vm is now passing.
2024-10-27T00:54:25.974 health[6e825e4c05d448] sea [info] Health check for your postgres database is now passing. 

Thanks… There’s a 6 second delay before the role health check passes, and I suspect that’s what’s tripping you up…

(I tried with a throwaway database and also saw a gap between fly m start announcing started and health checks actually passing—albeit not such a large one.)

Maybe try polling fly checks list -j instead?

It might also be worth checking whether the Consul API has a way to subscribe to leadership† events, which would allow you to avoid polling altogether.

(LiteFS does have such a stream, for example.)

Hope this helps a little!


†It’s not clear to me whether Consul does actually know the leader in PG Flex; it would be convenient here if it did.

Great, thank you for the insights. I appreciate the help. I will play around with this some more and see if I can make it work.

Update: waiting on the health checks seems to be pretty consistent

until flyctl checks list --app ${{ env.DB_APP_NAME }} --json | jq -e '[.[][] | select(.status!="passing")] | any | not'; do echo "waiting on postgres machine startup"; sleep 1; done
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.