fly postgres attach - Error: no active leader found

scottiemc7 · October 26, 2024, 7:10pm

Hi,

I use review apps in my CI pipeline. For review apps, I have a postgres app that is configured to scale to zero. After creating a new review app in my deploy script, I need to attach it to the cluster. Since the db might have scaled to zero, I wake every machine in the posgres app and then wait for them to move to the “started” state by running a couple of commands:

flyctl machine list --app ${{ env.DB_APP_NAME }} --json | jq '.[0].id' | xargs -n 1 flyctl machine start --app ${{ env.DB_APP_NAME }}

then

until flyctl machine list --app ${{ env.DB_APP_NAME }} --json | jq '.[0].state' | grep -q "started"; do echo "waiting on postgres machine startup"; sleep 1; done

Then I try to attach the new app and see the error “Error: no active leader found”. If I run the postgres attach command from flyctl after the pipeline fails, it works as expected. What is the best way anyone has found to handle this scenario? Just put in a wait and hope for the best? Keep trying the attach command until it succeeds? Thanks for your help.

scottiemc7 · October 26, 2024, 9:58pm

I added some wait time between commands and that seems to work, but I’m still wondering if there is a better way?

mayailurus · October 26, 2024, 10:07pm

Hm… What do DB_APP_NAME’s logs show during that time?

Is it three consecutive leader elections, ?

(I think multi-node clusters are mostly designed to be run non-stop…)

scottiemc7 · October 27, 2024, 1:00am

There’s only one node, but the logs look like this after some light editing

2024-10-27T00:54:08.801 app[6e825e4c05d448] sea [info] 2024-10-27T00:54:08.801866536 [01JA9FXHS1991MG57NRQPYRGG5:main] Running Firecracker v1.7.0
2024-10-27T00:54:09.207 app[6e825e4c05d448] sea [info] [ 0.264722] PCI: Fatal: No config space access function found
2024-10-27T00:54:09.539 app[6e825e4c05d448] sea [info] INFO Starting init (commit: 04656915)...
2024-10-27T00:54:09.596 app[6e825e4c05d448] sea [info] INFO Checking filesystem on /data
2024-10-27T00:54:09.599 app[6e825e4c05d448] sea [info] /dev/vdc: clean, 2246/65280 files, 30541/261120 blocks
2024-10-27T00:54:09.601 app[6e825e4c05d448] sea [info] INFO Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2024-10-27T00:54:09.606 app[6e825e4c05d448] sea [info] INFO Resized /data to 1069547520 bytes
2024-10-27T00:54:09.628 app[6e825e4c05d448] sea [info] INFO Preparing to run: `start` as root
2024-10-27T00:54:09.640 app[6e825e4c05d448] sea [info] INFO [fly api proxy] listening at /.fly/api
2024-10-27T00:54:09.683 runner[6e825e4c05d448] sea [info] Machine started in 995ms
2024-10-27T00:54:09.789 app[6e825e4c05d448] sea [info] 2024/10/27 00:54:09 INFO SSH listening listen_address=[*********:2]:22 dns_server=[fdaa::3]:53
2024-10-27T00:54:09.987 app[6e825e4c05d448] sea [info] Configured scale to zero with duration of 1h0m0s
2024-10-27T00:54:09.988 app[6e825e4c05d448] sea [info] postgres | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] proxy | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] repmgrd | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] monitor | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] admin | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] exporter | Running...
2024-10-27T00:54:10.144 app[6e825e4c05d448] sea [info] monitor | Waiting for Postgres to be ready...
2024-10-27T00:54:10.201 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
2024-10-27T00:54:10.201 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [INFO] connecting to database "host=*********:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [ERROR] connection to database failed
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [DETAIL]
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | connection to server at "*********:2", port 5433 failed: Connection refused
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd |
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:10] [DETAIL] attempted to connect using:
2024-10-27T00:54:10.202 app[6e825e4c05d448] sea [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=*********:2 port=5433 fallback_application_name=repmgr options=-csearch_path=
2024-10-27T00:54:10.206 app[6e825e4c05d448] sea [info] repmgrd | exit status 6
2024-10-27T00:54:10.206 app[6e825e4c05d448] sea [info] repmgrd | restarting in 5s [attempt 1]
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres database is warning. Your database might be malfunctioning.
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres vm is warning. Your instance might be hitting resource limits.
2024-10-27T00:54:10.248 health[6e825e4c05d448] sea [warn] Health check for your postgres role is warning. Your cluster's membership might be affected.
2024-10-27T00:54:10.273 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : New worker (382) forked
2024-10-27T00:54:10.275 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : Loading success.
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : bk_db/pg1 changed its IP from (none) to *********:2 by flydns/dns1.
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : Server bk_db/pg1 ('sea.************.internal') is UP/READY (resolves again).
2024-10-27T00:54:10.287 app[6e825e4c05d448] sea [info] proxy | [WARNING] (382) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-10-27T00:54:10.294 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.291 UTC [345] LOG: starting PostgreSQL 16.4 (Ubuntu 16.4-1.pgdg24.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0, 64-bit
2024-10-27T00:54:10.295 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.294 UTC [345] LOG: listening on IPv4 address "0.0.0.0", port 5433
2024-10-27T00:54:10.295 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.294 UTC [345] LOG: listening on IPv6 address "::", port 5433
2024-10-27T00:54:10.297 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.297 UTC [345] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433"
2024-10-27T00:54:10.304 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.303 UTC [386] LOG: database system was shut down at 2024-10-26 21:33:52 UTC
2024-10-27T00:54:10.319 app[6e825e4c05d448] sea [info] postgres | 2024-10-27 00:54:10.319 UTC [345] LOG: database system is ready to accept connections
2024-10-27T00:54:10.847 health[6e825e4c05d448] sea [error] Health check for your postgres database has failed. Your database is malfunctioning.
2024-10-27T00:54:11.064 app[6e825e4c05d448] sea [info] Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : haproxy version is 2.8.5-1ubuntu3
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (347) : path to executable is /usr/sbin/haproxy
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [ALERT] (347) : Current worker (382) exited with code 143 (Terminated)
2024-10-27T00:54:11.196 app[6e825e4c05d448] sea [info] proxy | [WARNING] (347) : All workers exited. Exiting... (0)
2024-10-27T00:54:11.197 app[6e825e4c05d448] sea [info] proxy | Process exited 0
2024-10-27T00:54:11.197 app[6e825e4c05d448] sea [info] proxy | restarting in 1s [attempt 1]
2024-10-27T00:54:12.197 app[6e825e4c05d448] sea [info] proxy | Running...
2024-10-27T00:54:12.238 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (406) : New worker (408) forked
2024-10-27T00:54:12.238 app[6e825e4c05d448] sea [info] proxy | [NOTICE] (406) : Loading success.
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : bk_db/pg1 changed its IP from (none) to *********:2 by flydns/dns1.
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : Server bk_db/pg1 ('sea.********.internal') is UP/READY (resolves again).
2024-10-27T00:54:12.241 app[6e825e4c05d448] sea [info] proxy | [WARNING] (408) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-10-27T00:54:15.157 app[6e825e4c05d448] sea [info] monitor | Postgres is ready to accept connections. Starting monitor...
2024-10-27T00:54:15.207 app[6e825e4c05d448] sea [info] repmgrd | Running...
2024-10-27T00:54:15.213 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
2024-10-27T00:54:15.213 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [INFO] connecting to database "host=*********:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5"
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] starting monitoring of node "*********:2" (ID: 1328154310)
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [INFO] "connection_check_type" set to "ping"
2024-10-27T00:54:15.227 app[6e825e4c05d448] sea [info] repmgrd | [2024-10-27 00:54:15] [NOTICE] monitoring cluster primary "*********:2" (ID: 1328154310)
2024-10-27T00:54:15.699 health[6e825e4c05d448] sea [info] Health check for your postgres role is now passing.
2024-10-27T00:54:16.904 health[6e825e4c05d448] sea [info] Health check for your postgres vm is now passing.
2024-10-27T00:54:25.974 health[6e825e4c05d448] sea [info] Health check for your postgres database is now passing.

mayailurus · October 27, 2024, 1:42pm

Thanks… There’s a 6 second delay before the role health check passes, and I suspect that’s what’s tripping you up…

(I tried with a throwaway database and also saw a gap between fly m start announcing started and health checks actually passing—albeit not such a large one.)

Maybe try polling fly checks list -j instead?

It might also be worth checking whether the Consul API has a way to subscribe to leadership† events, which would allow you to avoid polling altogether.

(LiteFS does have such a stream, for example.)

Hope this helps a little!

†It’s not clear to me whether Consul does actually know the leader in PG Flex; it would be convenient here if it did.

scottiemc7 · October 27, 2024, 5:44pm

Great, thank you for the insights. I appreciate the help. I will play around with this some more and see if I can make it work.

scottiemc7 · October 27, 2024, 9:03pm

Update: waiting on the health checks seems to be pretty consistent

until flyctl checks list --app ${{ env.DB_APP_NAME }} --json | jq -e '[.[][] | select(.status!="passing")] | any | not'; do echo "waiting on postgres machine startup"; sleep 1; done

system · November 3, 2024, 9:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot attach Postgres to app.	7	393	December 21, 2022
Lost access to postgres: Error no active leader found	1	1290	March 29, 2023
`fly postgres attach` command fails Build debugging postgres	17	596	March 3, 2022
Postgres is down, cannot restart. No active leader found postgres	22	5366	January 15, 2025
No leader found	13	855	October 18, 2023

fly postgres attach - Error: no active leader found

Related topics