fly clone machine fails after barman postgres recovery

stephen1 · April 11, 2024, 2:16am

I perform a Point in Time Recovery using barman and postgres flex (following instructions). Everything seems to work (although I had to manually start up barman -q cron for WAL receiving to work). I was able to clone my primary pg before the barman recovery but I can’t afterwards. Here’s my successful? check after the recovery:

➜ mydir git:(main) ✗ fly pg barman check -a mydir-dev-pg-flex
Connecting to fdaa:1:2ff7:a7b:1d4:6e63:fdce:2… complete
Server pg:
PostgreSQL: OK
superuser or standard user with backup privileges: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (no last_backup_maximum_age provided)
backup minimum size: OK (29.6 MiB)
wal maximum age: OK (no last_wal_maximum_age provided)
wal size: OK (32.0 MiB)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 1 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
pg_basebackup supports tablespaces mapping: OK
systemid coherence: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK

Then I try to clone my primary pg and scale horizontally and it fails at 1/3.

➜ mydir git:(main) ✗ fly machine clone 3287356b764185 --region yul --app mydir-dev-pg-flex
Cloning Machine 3287356b764185 into region yul
Volume ‘pg_data’ will start empty
Provisioning a new Machine with image flyio/postgres-flex:15.6@sha256:92917e5770cef6666dddbf17061c2f95b9e19b8155be9a8ce8c35e09e5381167…
Machine 48ed431f717568 has been created…
Waiting for Machine 48ed431f717568 to start…
Waiting for 48ed431f717568 to become healthy (started, 1/3)
Error: error while watching health checks: context deadline exceeded

Machine status of failed clone reveals…

pg	critical	500 Internal Server Error failed to resolve local member role: no rows in result set	2024-04-11 01:59:53
vm	passing	[✓] checkDisk: 872.36 MB (88.5%) free space on /data/ (54.82µs) [✓] checkLoad: load averages: 0.00 0.00 0.00 (61.14µs) [✓] memory: system spent 0s of the last 60s waiting on memory (38.13µs) [✓] cpu: system spent 444ms of the last 60s waiting on cpu (26.95µs) [✓] io: system spent 288ms of the last 60s waiting on io (12.8µs)	2024-04-11 01:59:47
role	critical	500 Internal Server Error failed to check role: no rows in result set

Any help would be greatly appreciated. Not confident to move this to production if I can’t get basic things working in dev. Thanks!

stephen1 · April 11, 2024, 4:18pm

And this is what my failing replica provides in the logs

$ fly logs -a mydir-dev-pg-flex -i 1857931f422e68
Waiting for logs…
2024-04-11T15:30:15.607 runner[1857931f422e68] yul [info] Pulling container image docker-hub-mirror.fly.io/flyio/postgres-flex:15.6
2024-04-11T15:30:15.860 runner[1857931f422e68] yul [info] Successfully prepared image docker-hub-mirror.fly.io/flyio/postgres-flex:15.6 (253.11342ms)
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Setting up volume ‘pg_data’
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Uninitialized volume ‘pg_data’, initializing…
2024-04-11T15:30:15.881 runner[1857931f422e68] yul [info] Encrypting volume
2024-04-11T15:30:21.593 runner[1857931f422e68] yul [info] Opening encrypted volume
2024-04-11T15:30:23.529 runner[1857931f422e68] yul [info] Formatting volume
2024-04-11T15:30:23.874 runner[1857931f422e68] yul [info] Configuring firecracker
2024-04-11T15:30:24.305 app[1857931f422e68] yul [info] [ 0.142977] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
2024-04-11T15:30:24.333 app[1857931f422e68] yul [info] [ 0.148635] PCI: Fatal: No config space access function found
2024-04-11T15:30:24.511 app[1857931f422e68] yul [info] INFO Starting init (commit: 5b8fb02)…
2024-04-11T15:30:24.576 app[1857931f422e68] yul [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-04-11T15:30:24.579 app[1857931f422e68] yul [info] INFO Resized /data to 42945478656 bytes
2024-04-11T15:30:24.579 app[1857931f422e68] yul [info] INFO Preparing to run: docker-entrypoint.sh start as root
2024-04-11T15:30:24.586 app[1857931f422e68] yul [info] INFO [fly api proxy] listening at /.fly/api
2024-04-11T15:30:24.592 app[1857931f422e68] yul [info] 2024/04/11 15:30:24 listening on [fdaa:1:2ff7:a7b:144:8311:82e5:2]:22 (DNS: [fdaa::3]:53)
2024-04-11T15:30:25.133 runner[1857931f422e68] yul [info] Machine created and started in 9.547s
2024-04-11T15:30:25.920 app[1857931f422e68] yul [info] Provisioning standby
2024-04-11T15:30:25.998 app[1857931f422e68] yul [info] repmgr -h fdaa:1:2ff7:a7b:1d4:8afa:2bf7:2 -p 5433 -d repmgr -U repmgr -f /data/repmgr.conf standby clone -c -F
2024-04-11T15:30:27.633 app[1857931f422e68] yul [info] postgres | Running…
2024-04-11T15:30:27.633 app[1857931f422e68] yul [info] monitor | Running…
2024-04-11T15:30:27.640 app[1857931f422e68] yul [info] proxy | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] exporter | Running…
2024-04-11T15:30:27.641 app[1857931f422e68] yul [info] admin | Running…
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [ERROR] connection to database failed
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [DETAIL]
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | connection to server at “fdaa:1:2ff7:a7b:144:8311:82e5:2”, port 5433 failed: Connection refused
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | Is the server running on that host and accepting TCP/IP connections?
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd |
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:27] [DETAIL] attempted to connect using:
2024-04-11T15:30:27.687 app[1857931f422e68] yul [info] repmgrd | user=repmgr connect_timeout=5 dbname=repmgr host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 fallback_application_name=repmgr options=-csearch_path=
2024-04-11T15:30:27.688 app[1857931f422e68] yul [info] repmgrd | exit status 6
2024-04-11T15:30:27.688 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 1]
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring dead members
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring cluster state
2024-04-11T15:30:27.711 app[1857931f422e68] yul [info] monitor | Monitoring replication slots
2024-04-11T15:30:27.713 app[1857931f422e68] yul [info] monitor | Pruning every 24h0m0s…
2024-04-11T15:30:27.754 app[1857931f422e68] yul [info] proxy | [NOTICE] (359) : New worker (381) forked
2024-04-11T15:30:27.754 app[1857931f422e68] yul [info] proxy | [NOTICE] (359) : Loading success.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : bk_db/pg1 changed its IP from (none) to fdaa:1:2ff7:a7b:1d4:8afa:2bf7:2 by flydns/dns1.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg1 (‘yyz.mydir-dev-pg-flex.internal’) is UP/READY (resolves again).
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : bk_db/pg2 changed its IP from (none) to fdaa:1:2ff7:a7b:1d4:d9f5:198b:2 by DNS cache.
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 (‘yyz.mydir-dev-pg-flex.internal’) is UP/READY (resolves again).
2024-04-11T15:30:27.765 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.797 UTC [346] LOG: starting PostgreSQL 15.6 (Debian 15.6-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.798 UTC [346] LOG: listening on IPv4 address “0.0.0.0”, port 5433
2024-04-11T15:30:27.798 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.798 UTC [346] LOG: listening on IPv6 address “::”, port 5433
2024-04-11T15:30:27.800 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.800 UTC [346] LOG: listening on Unix socket “/var/run/postgresql/.s.PGSQL.5433”
2024-04-11T15:30:27.805 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:27.803 UTC [386] LOG: database system was interrupted; last known up at 2024-04-11 15:30:26 UTC
2024-04-11T15:30:27.950 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Server bk_db/pg2 is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 16ms. 1 active and 1 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-04-11T15:30:28.408 app[1857931f422e68] yul [info] postgres | ERROR: Remote ‘barman get-wal’ command has failed!
2024-04-11T15:30:28.420 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.420 UTC [386] LOG: entering standby mode
2024-04-11T15:30:28.420 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.420 UTC [386] LOG: starting backup recovery with redo LSN 0/D000028, checkpoint LSN 0/D000060, on timeline ID 2
2024-04-11T15:30:28.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.637 UTC [406] FATAL: the database system is starting up
2024-04-11T15:30:28.637 app[1857931f422e68] yul [info] failed post-init: failed to establish connection to local node: failed to connect to host=fdaa:1:2ff7:a7b:144:8311:82e5:2 user=postgres database=postgres: server error (FATAL: the database system is starting up (SQLSTATE 57P03)). Retrying…
2024-04-11T15:30:28.898 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:28.897 UTC [386] LOG: restored log file “00000002.history” from archive
2024-04-11T15:30:29.576 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.575 UTC [417] FATAL: the database system is starting up
2024-04-11T15:30:29.578 app[1857931f422e68] yul [info] proxy | [WARNING] (381) : Backup Server bk_db/pg is DOWN, reason: Layer7 invalid response, info: “HTTP content check did not match”, check duration: 7ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2024-04-11T15:30:29.583 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.583 UTC [386] LOG: restored log file “00000002000000000000000D” from archive
2024-04-11T15:30:29.597 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.597 UTC [386] LOG: redo starts at 0/D000028
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.636 UTC [421] FATAL: the database system is not yet accepting connections
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:29.636 UTC [421] DETAIL: Consistent recovery state has not been yet reached.
2024-04-11T15:30:29.637 app[1857931f422e68] yul [info] failed post-init: failed to establish connection to local node: failed to connect to host=fdaa:1:2ff7:a7b:144:8311:82e5:2 user=postgres database=postgres: server error (FATAL: the database system is not yet accepting connections (SQLSTATE 57P03)). Retrying…
2024-04-11T15:30:30.275 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.275 UTC [386] LOG: restored log file “00000002000000000000000E” from archive
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.282 UTC [386] LOG: completed backup recovery with redo LSN 0/D000028 and end LSN 0/D000100
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: consistent recovery state reached at 0/D000100
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [346] LOG: database system is ready to accept read-only connections
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: recovery stopping before commit of transaction 959, time 2024-04-11 15:30:27.57071+00
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] LOG: pausing at the end of recovery
2024-04-11T15:30:30.283 app[1857931f422e68] yul [info] postgres | 2024-04-11 15:30:30.283 UTC [386] HINT: Execute pg_wal_replay_resume() to promote.
2024-04-11T15:30:30.690 app[1857931f422e68] yul [info] Registering standby
2024-04-11T15:30:32.688 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:32.718 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:32.718 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [ERROR] no metadata record found for this node - terminating
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [HINT] check that ‘repmgr (primary|standby|witness) register’ was executed for this node
2024-04-11T15:30:32.742 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:32] [INFO] repmgrd terminating…
2024-04-11T15:30:32.743 app[1857931f422e68] yul [info] repmgrd | exit status 1
2024-04-11T15:30:32.743 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 2]
2024-04-11T15:30:42.788 app[1857931f422e68] yul [info] repmgrd | Running…
2024-04-11T15:30:42.820 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
2024-04-11T15:30:42.820 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [INFO] connecting to database “host=fdaa:1:2ff7:a7b:144:8311:82e5:2 port=5433 user=repmgr dbname=repmgr connect_timeout=5”
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | exit status 1
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | restarting in 5s [attempt 4]
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [ERROR] no metadata record found for this node - terminating
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [HINT] check that ‘repmgr (primary|standby|witness) register’ was executed for this node
2024-04-11T15:30:42.834 app[1857931f422e68] yul [info] repmgrd | [2024-04-11 15:30:42] [INFO] repmgrd terminating…

stephen1 · April 11, 2024, 5:56pm

I’m clearly missing something here but a work-around is to create a new postgres app and import from the recovered primary connect string. then I can create a new barman and scale…however, i lose my old barman backup if I want to clean up my abandoned pg app. Poor experience

system · April 18, 2024, 5:56pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Point in Time Recovery for Postgres Flex using Barman Fresh Produce postgres	24	4286	April 17, 2024
Can't clone Postgres machine postgres	14	381	February 12, 2024
More `not a cert` errors when SSHing within Postgres cluster Questions / Help postgres	6	254	March 22, 2024
Fresh Produce - Enhanced WAL Archiving and Remote Restores Fresh Produce postgres	30	2226	September 5, 2024
Out of date Postgres instructions? postgres	4	442	January 15, 2024

fly clone machine fails after barman postgres recovery

Related topics