My Postgres app went down over the night and I can’t recover it. I’d really like to restore this without creating a new app from a snapshot, because that would be 18 hours of data loss, minimum. I’ve tried:
- pg restart
- image update
- scaling up
- extending the volume (this didn’t work because
Error failed to extend volume: upstream service is unavailable
)
Now it’s stuck in a “Dead” state: Failed deployment v5 Failed due to unhealthy allocations.
Here are the logs from the last failed deployment:
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "maintenance_work_mem": "64MB",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_connections": "300",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_parallel_workers": "8",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_parallel_workers_per_gather": "2",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_worker_processes": "8",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "random_page_cost": "1.1",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "shared_buffers": "128MB",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "wal_compression": "on",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "work_mem": "8MB"
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] },
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "maxStandbysPerSender": 50,
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "deadKeeperRemovalInterval": "1h"
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] }
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] generated new config
2023-01-10T14:18:05.156 app[76f8ad74] ord [info] keeper | Running...
2023-01-10T14:18:05.163 app[76f8ad74] ord [info] sentinel | Running...
2023-01-10T14:18:05.163 app[76f8ad74] ord [info] proxy | Running...
2023-01-10T14:18:05.167 app[76f8ad74] ord [info] exporter | Running...
2023-01-10T14:18:05.386 app[76f8ad74] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2023-01-10T14:18:05.431 app[76f8ad74] ord [info] proxy | [WARNING] 009/141805 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2023-01-10T14:18:05.466 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141805 (545) : New worker #1 (569) forked
2023-01-10T14:18:06.156 app[76f8ad74] ord [info] checking stolon status
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : bk_db/pg1 changed its IP from (none) to fdaa:0:6d7c:a7b:9ada:1:198f:2 by flydns/dns1.
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : Server bk_db/pg1 ('ord.unify-db.internal') is UP/READY (resolves again).
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-10T14:18:12.248 app[76f8ad74] ord [info] proxy | [WARNING] 009/141812 (569) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [WARNING] 009/141812 (569) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141812 (569) : haproxy version is 2.2.9-2+deb11u3
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141812 (569) : path to executable is /usr/sbin/haproxy
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [ALERT] 009/141812 (569) : backend 'bk_db' has no server available!
2023-01-10T14:19:06.622 app[76f8ad74] ord [info] keeper | 2023-01-10T14:19:06.620Z FATAL cmd/keeper.go:2118 cannot create keeper: cannot create store: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] panic: error checking stolon status: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] : exit status 1
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] goroutine 9 [running]:
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] main.main.func2(0xc0000d0000, 0xc000084a80)
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] created by main.main
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b
2023-01-10T14:19:07.080 app[76f8ad74] ord [info] Starting clean up.
2023-01-10T14:19:07.081 app[76f8ad74] ord [info] Umounting /dev/vdc from /data
2023-01-10T14:19:14.178 runner[76f8ad74] ord [info] Starting instance
2023-01-10T14:19:23.801 runner[76f8ad74] ord [info] Configuring virtual machine
2023-01-10T14:19:23.803 runner[76f8ad74] ord [info] Pulling container image
2023-01-10T14:19:24.008 runner[76f8ad74] ord [info] Unpacking image
2023-01-10T14:19:24.021 runner[76f8ad74] ord [info] Preparing kernel init
2023-01-10T14:19:24.412 runner[76f8ad74] ord [info] Setting up volume 'pg_data'
2023-01-10T14:19:25.183 runner[76f8ad74] ord [info] Configuring firecracker
2023-01-10T14:19:25.222 runner[76f8ad74] ord [info] Starting virtual machine
2023-01-10T14:19:25.963 app[76f8ad74] ord [info] Starting init (commit: f447594)...
2023-01-10T14:19:26.020 app[76f8ad74] ord [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2023-01-10T14:19:26.033 app[76f8ad74] ord [info] Preparing to run: `docker-entrypoint.sh start` as root
2023-01-10T14:19:26.071 app[76f8ad74] ord [info] 2023/01/10 14:19:26 listening on [fdaa:0:6d7c:a7b:9ada:1:198f:2]:22 (DNS: [fdaa::3]:53)
2023-01-10T14:19:26.218 app[76f8ad74] ord [info] cluster spec filename /fly/cluster-spec.json
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] system memory: 512mb vcpu count: 1
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "initMode": "existing",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "existingConfig": {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "keeperUID": "9ada1198f2"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] },
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "pgParameters": {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_mode": "on",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_timeout": "60",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "effective_cache_size": "384MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "effective_io_concurrency": "200",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "maintenance_work_mem": "64MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_connections": "300",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_parallel_workers": "8",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_parallel_workers_per_gather": "2",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_worker_processes": "8",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "random_page_cost": "1.1",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "shared_buffers": "128MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "wal_compression": "on",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "work_mem": "8MB"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] },
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "maxStandbysPerSender": 50,
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "deadKeeperRemovalInterval": "1h"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] }
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] generated new config
2023-01-10T14:19:26.221 app[76f8ad74] ord [info] keeper | Running...
2023-01-10T14:19:26.224 app[76f8ad74] ord [info] sentinel | Running...
2023-01-10T14:19:26.224 app[76f8ad74] ord [info] proxy | Running...
2023-01-10T14:19:26.229 app[76f8ad74] ord [info] exporter | Running...
2023-01-10T14:19:26.526 app[76f8ad74] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2023-01-10T14:19:26.557 app[76f8ad74] ord [info] proxy | [WARNING] 009/141926 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2023-01-10T14:19:26.594 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141926 (545) : New worker #1 (569) forked
2023-01-10T14:19:27.220 app[76f8ad74] ord [info] checking stolon status
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : bk_db/pg1 changed its IP from (none) to fdaa:0:6d7c:a7b:9ada:1:198f:2 by flydns/dns1.
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : Server bk_db/pg1 ('ord.unify-db.internal') is UP/READY (resolves again).
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-10T14:19:33.377 app[76f8ad74] ord [info] proxy | [WARNING] 009/141933 (569) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:19:33.558 app[76f8ad74] ord [info] proxy | [WARNING] 009/141933 (569) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:19:33.558 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141933 (569) : haproxy version is 2.2.9-2+deb11u3
2023-01-10T14:19:33.559 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141933 (569) : path to executable is /usr/sbin/haproxy
2023-01-10T14:19:33.559 app[76f8ad74] ord [info] proxy | [ALERT] 009/141933 (569) : backend 'bk_db' has no server available!
2023-01-10T14:20:27.756 app[76f8ad74] ord [info] keeper | 2023-01-10T14:20:27.754Z FATAL cmd/keeper.go:2118 cannot create keeper: cannot create store: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:20:27.758 app[76f8ad74] ord [info] keeper | exit status 1
2023-01-10T14:20:27.758 app[76f8ad74] ord [info] keeper | restarting in 5s [attempt 1]
2023-01-10T14:20:28.132 app[76f8ad74] ord [info] Starting clean up.
2023-01-10T14:20:28.134 app[76f8ad74] ord [info] Umounting /dev/vdc from /data