fly clone machine fails after barman postgres recovery

I perform a Point in Time Recovery using barman and postgres flex (following instructions). Everything seems to work (although I had to manually start up barman -q cron for WAL receiving to work). I was able to clone my primary pg before the barman recovery but I can’t afterwards. Here’s my successful? check after the recovery:

➜ mydir git:(main) ✗ fly pg barman check -a mydir-dev-pg-flex
Connecting to fdaa:1:2ff7:a7b:1d4:6e63:fdce:2… complete
Server pg:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (no last_backup_maximum_age provided)
backup minimum size: OK (29.6 MiB)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (32.0 MiB)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 1 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
pg_basebackup supports tablespaces mapping: OK
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

Then I try to clone my primary pg and scale horizontally and it fails at 1/3.

➜ mydir git:(main) ✗ fly machine clone 3287356b764185 --region yul --app mydir-dev-pg-flex
Cloning Machine 3287356b764185 into region yul
Volume ‘pg_data’ will start empty
Provisioning a new Machine with image flyio/postgres-flex:15.6@sha256:92917e5770cef6666dddbf17061c2f95b9e19b8155be9a8ce8c35e09e5381167…
Machine 48ed431f717568 has been created…
Waiting for Machine 48ed431f717568 to start…
Waiting for 48ed431f717568 to become healthy (started, 1/3)
Error: error while watching health checks: context deadline exceeded

Machine status of failed clone reveals…

pg	critical	500 Internal Server Error failed to resolve local member role: no rows in result set	2024-04-11 01:59:53
vm	passing	[✓] checkDisk: 872.36 MB (88.5%) free space on /data/ (54.82µs) [✓] checkLoad: load averages: 0.00 0.00 0.00 (61.14µs) [✓] memory: system spent 0s of the last 60s waiting on memory (38.13µs) [✓] cpu: system spent 444ms of the last 60s waiting on cpu (26.95µs) [✓] io: system spent 288ms of the last 60s waiting on io (12.8µs)	2024-04-11 01:59:47
role	critical	500 Internal Server Error failed to check role: no rows in result set

Any help would be greatly appreciated. Not confident to move this to production if I can’t get basic things working in dev. Thanks!

And this is what my failing replica provides in the logs

$ fly logs -a mydir-dev-pg-flex -i 1857931f422e68
Waiting for logs…
2024-04-11T15:30:15.607 runner[1857931f422e68] yul [info] Pulling container image docker-hub-mirror.fly.io/flyio/postgres-flex:15.6
2024-04-11T15:30:15.860 runner[1857931f422e68] yul [info] Successfully prepared image docker-hub-mirror.fly.io/flyio/postgres-flex:15.6 (253.11342ms)
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Setting up volume ‘pg_data’
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Uninitialized volume ‘pg_data’, initializing…
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Encrypting volume
2024-04-11T15:30:21.593 runner[1857931f422e68] yul [info] Opening encrypted volume
2024-04-11T15:30:23.529 runner[1857931f422e68] yul [info] Formatting volume
2024-04-11T15:30:23.874 runner[1857931f422e68] yul [info] Configuring firecracker
2024-04-11T15:30:24.305 app[1857931f422e68] yul [info] [ 0.142977] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
2024-04-11T15:30:24.333 app[1857931f422e68] yul [info] [ 0.148635] PCI: Fatal: No config space access function found
2024-04-11T15:30:24.511 app[1857931f422e68] yul [info] INFO Starting init (commit: 5b8fb02)…
2024-04-11T15:30:24.576 app[1857931f422e68] yul [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-04-11T15:30:24.579 app[1857931f422e68] yul [info] INFO Resized /data to 42945478656 bytes
2024-04-11T15:30:24.579 app[1857931f422e68] yul [info] INFO Preparing to run: docker-entrypoint.sh start as root
2024-04-11T15:30:24.586 app[1857931f422e68] yul [info] INFO [fly api proxy] listening at /.fly/api
2024-04-11T15:30:24.592 app[1857931f422e68] yul [info] 2024/04/11 15:30:24 listening on [fdaa:1:2ff7:a7b:144:8311:82e5:2]:22 (DNS: [fdaa::3]:53)
2024-04-11T15:30:25.133 runner[1857931f422e68] yul [info] Machine created and started in 9.547s
2024-04-11T15:30:25.920 app[1857931f422e68] yul [info] Provisioning standby
2024-04-11T15:30:25.998 app[1857931f422e68] yul [info] repmgr -h fdaa:1:2ff7:a7b:1d4:8afa:2bf7:2 -p 5433 -d repmgr -U repmgr -f /data/repmgr.conf standby clone -c -F
2024-04-11T15:30:27.633 app[1857931f422e68] yul [info] postgres | Running…
2024-04-11T15:30:27.633 app[1857931f422e68] yul [info] monitor | Running…
2024-04-11T15:30:27.640 app[1857931f422e68] yul [info] proxy | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] exporter | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] admin | Running…
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [ERROR] connection to database failed
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [DETAIL]
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | connection to server at “fdaa:1:2ff7:a7b:144:8311:82e5:2”, port 5433 failed: Connection refused
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd |
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [DETAIL] attempted to connect using:
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 fallback_application_name=repmgr options=-csearch_path=
2024-04-11T15:30:27.688 app[1857931f422e68] yul [info] repmgrd | exit status 6
2024-04-11T15:30:27.688 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 1]
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring dead members
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring cluster state
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring replication slots
2024-04-11T15:30:27.713 app[1857931f422e68] yul [info] monitor | Pruning every 24h0m0s…
2024-04-11T15:30:27.754 app[1857931f422e68] yul [info] proxy | [NOTICE] (359) : New worker (381) forked
2024-04-11T15:30:27.754 app[1857931f422e68] yul [info] proxy | [NOTICE] (359) : Loading success.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : bk_db/pg1 changed its IP from (none) to fdaa:1:2ff7:a7b:1d4:8afa:2bf7:2 by flydns/dns1.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg1 (‘yyz.mydir-dev-pg-flex.internal’) is UP/READY (resolves again).
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : bk_db/pg2 changed its IP from (none) to fdaa:1:2ff7:a7b:1d4:d9f5:198b:2 by DNS cache.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 (‘yyz.mydir-dev-pg-flex.internal’) is UP/READY (resolves again).
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.797 UTC [346] LOG: starting PostgreSQL 15.6 (Debian 15.6-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.798 UTC [346] LOG: listening on IPv4 address “0.0.0.0”, port 5433
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.798 UTC [346] LOG: listening on IPv6 address “::”, port 5433
2024-04-11T15:30:27.800 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.800 UTC [346] LOG: listening on Unix socket “/var/run/postgresql/.s.PGSQL.5433”
2024-04-11T15:30:27.805 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.803 UTC [386] LOG: database system was interrupted; last known up at 2024-04-11 15:30:26 UTC
2024-04-11T15:30:27.950 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 16ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-04-11T15:30:28.408 app[1857931f422e68] yul [info] postgres | ERROR: Remote ‘barman get-wal’ command has failed!
2024-04-11T15:30:28.420 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.420 UTC [386] LOG: entering standby mode
2024-04-11T15:30:28.420 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.420 UTC [386] LOG: starting backup recovery with redo LSN 0/D000028, checkpoint LSN 0/D000060, on timeline ID 2
2024-04-11T15:30:28.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.637 UTC [406] FATAL: the database system is starting up
2024-04-11T15:30:28.637 app[1857931f422e68] yul [info] failed post-init: failed to establish connection to local node: failed to connect to host=fdaa:1:2ff7:a7b:144:8311:82e5:2 user=postgres database=postgres: server error (FATAL: the database system is starting up (SQLSTATE 57P03)). Retrying…
2024-04-11T15:30:28.898 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.897 UTC [386] LOG: restored log file “00000002.history” from archive
2024-04-11T15:30:29.576 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.575 UTC [417] FATAL: the database system is starting up
2024-04-11T15:30:29.578 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Backup Server bk_db/pg is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 7ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-04-11T15:30:29.583 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.583 UTC [386] LOG: restored log file “00000002000000000000000D” from archive
2024-04-11T15:30:29.597 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.597 UTC [386] LOG: redo starts at 0/D000028
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.636 UTC [421] FATAL: the database system is not yet accepting connections
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.636 UTC [421] DETAIL: Consistent recovery state has not been yet reached.
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] failed post-init: failed to establish connection to local node: failed to connect to host=fdaa:1:2ff7:a7b:144:8311:82e5:2 user=postgres database=postgres: server error (FATAL: the database system is not yet accepting connections (SQLSTATE 57P03)). Retrying…
2024-04-11T15:30:30.275 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.275 UTC [386] LOG: restored log file “00000002000000000000000E” from archive
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.282 UTC [386] LOG: completed backup recovery with redo LSN 0/D000028 and end LSN 0/D000100
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: consistent recovery state reached at 0/D000100
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [346] LOG: database system is ready to accept read-only connections
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: recovery stopping before commit of transaction 959, time 2024-04-11 15:30:27.57071+00
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: pausing at the end of recovery
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] HINT: Execute pg_wal_replay_resume() to promote.
2024-04-11T15:30:30.690 app[1857931f422e68] yul [info] Registering standby
2024-04-11T15:30:32.688 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:32.718 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:32.718 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [ERROR] no metadata record found for this node - terminating
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [HINT] check that ‘repmgr (primary|standby|witness) register’ was executed for this node
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [INFO] repmgrd terminating…
2024-04-11T15:30:32.743 app[1857931f422e68] yul [info] repmgrd | exit status 1
2024-04-11T15:30:32.743 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 2]
2024-04-11T15:30:42.788 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:42.820 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:42.820 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | exit status 1
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 4]
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [ERROR] no metadata record found for this node - terminating
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [HINT] check that ‘repmgr (primary|standby|witness) register’ was executed for this node
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [INFO] repmgrd terminating…

I’m clearly missing something here but a work-around is to create a new postgres app and import from the recovered primary connect string. then I can create a new barman and scale…however, i lose my old barman backup if I want to clean up my abandoned pg app. Poor experience :confused:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.