Postgres DB Dead and Can't Create a New Deployment

My Postgres app went down over the night and I can’t recover it. I’d really like to restore this without creating a new app from a snapshot, because that would be 18 hours of data loss, minimum. I’ve tried:

  • pg restart
  • image update
  • scaling up
  • extending the volume (this didn’t work because Error failed to extend volume: upstream service is unavailable)

Now it’s stuck in a “Dead” state: Failed deployment v5 Failed due to unhealthy allocations.

Here are the logs from the last failed deployment:

2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "maintenance_work_mem": "64MB",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_connections": "300",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_parallel_workers": "8",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_parallel_workers_per_gather": "2",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "max_worker_processes": "8",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "random_page_cost": "1.1",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "shared_buffers": "128MB",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "wal_compression": "on",
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "work_mem": "8MB"
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] },
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "maxStandbysPerSender": 50,
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] "deadKeeperRemovalInterval": "1h"
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] }
2023-01-10T14:18:05.154 app[76f8ad74] ord [info] generated new config
2023-01-10T14:18:05.156 app[76f8ad74] ord [info] keeper | Running...
2023-01-10T14:18:05.163 app[76f8ad74] ord [info] sentinel | Running...
2023-01-10T14:18:05.163 app[76f8ad74] ord [info] proxy | Running...
2023-01-10T14:18:05.167 app[76f8ad74] ord [info] exporter | Running...
2023-01-10T14:18:05.386 app[76f8ad74] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2023-01-10T14:18:05.431 app[76f8ad74] ord [info] proxy | [WARNING] 009/141805 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2023-01-10T14:18:05.466 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141805 (545) : New worker #1 (569) forked
2023-01-10T14:18:06.156 app[76f8ad74] ord [info] checking stolon status
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : bk_db/pg1 changed its IP from (none) to fdaa:0:6d7c:a7b:9ada:1:198f:2 by flydns/dns1.
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : Server bk_db/pg1 ('ord.unify-db.internal') is UP/READY (resolves again).
2023-01-10T14:18:06.488 app[76f8ad74] ord [info] proxy | [WARNING] 009/141806 (569) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-10T14:18:12.248 app[76f8ad74] ord [info] proxy | [WARNING] 009/141812 (569) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [WARNING] 009/141812 (569) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5000ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141812 (569) : haproxy version is 2.2.9-2+deb11u3
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141812 (569) : path to executable is /usr/sbin/haproxy
2023-01-10T14:18:12.429 app[76f8ad74] ord [info] proxy | [ALERT] 009/141812 (569) : backend 'bk_db' has no server available!
2023-01-10T14:19:06.622 app[76f8ad74] ord [info] keeper | 2023-01-10T14:19:06.620Z FATAL cmd/keeper.go:2118 cannot create keeper: cannot create store: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] panic: error checking stolon status: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] : exit status 1
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] goroutine 9 [running]:
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] main.main.func2(0xc0000d0000, 0xc000084a80)
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] created by main.main
2023-01-10T14:19:06.630 app[76f8ad74] ord [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b
2023-01-10T14:19:07.080 app[76f8ad74] ord [info] Starting clean up.
2023-01-10T14:19:07.081 app[76f8ad74] ord [info] Umounting /dev/vdc from /data
2023-01-10T14:19:14.178 runner[76f8ad74] ord [info] Starting instance
2023-01-10T14:19:23.801 runner[76f8ad74] ord [info] Configuring virtual machine
2023-01-10T14:19:23.803 runner[76f8ad74] ord [info] Pulling container image
2023-01-10T14:19:24.008 runner[76f8ad74] ord [info] Unpacking image
2023-01-10T14:19:24.021 runner[76f8ad74] ord [info] Preparing kernel init
2023-01-10T14:19:24.412 runner[76f8ad74] ord [info] Setting up volume 'pg_data'
2023-01-10T14:19:25.183 runner[76f8ad74] ord [info] Configuring firecracker
2023-01-10T14:19:25.222 runner[76f8ad74] ord [info] Starting virtual machine
2023-01-10T14:19:25.963 app[76f8ad74] ord [info] Starting init (commit: f447594)...
2023-01-10T14:19:26.020 app[76f8ad74] ord [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2023-01-10T14:19:26.033 app[76f8ad74] ord [info] Preparing to run: `docker-entrypoint.sh start` as root
2023-01-10T14:19:26.071 app[76f8ad74] ord [info] 2023/01/10 14:19:26 listening on [fdaa:0:6d7c:a7b:9ada:1:198f:2]:22 (DNS: [fdaa::3]:53)
2023-01-10T14:19:26.218 app[76f8ad74] ord [info] cluster spec filename /fly/cluster-spec.json
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] system memory: 512mb vcpu count: 1
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "initMode": "existing",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "existingConfig": {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "keeperUID": "9ada1198f2"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] },
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "pgParameters": {
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_mode": "on",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "archive_timeout": "60",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "effective_cache_size": "384MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "effective_io_concurrency": "200",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "maintenance_work_mem": "64MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_connections": "300",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_parallel_workers": "8",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_parallel_workers_per_gather": "2",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "max_worker_processes": "8",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "random_page_cost": "1.1",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "shared_buffers": "128MB",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "wal_compression": "on",
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "work_mem": "8MB"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] },
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "maxStandbysPerSender": 50,
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] "deadKeeperRemovalInterval": "1h"
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] }
2023-01-10T14:19:26.219 app[76f8ad74] ord [info] generated new config
2023-01-10T14:19:26.221 app[76f8ad74] ord [info] keeper | Running...
2023-01-10T14:19:26.224 app[76f8ad74] ord [info] sentinel | Running...
2023-01-10T14:19:26.224 app[76f8ad74] ord [info] proxy | Running...
2023-01-10T14:19:26.229 app[76f8ad74] ord [info] exporter | Running...
2023-01-10T14:19:26.526 app[76f8ad74] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2023-01-10T14:19:26.557 app[76f8ad74] ord [info] proxy | [WARNING] 009/141926 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2023-01-10T14:19:26.594 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141926 (545) : New worker #1 (569) forked
2023-01-10T14:19:27.220 app[76f8ad74] ord [info] checking stolon status
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : bk_db/pg1 changed its IP from (none) to fdaa:0:6d7c:a7b:9ada:1:198f:2 by flydns/dns1.
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : Server bk_db/pg1 ('ord.unify-db.internal') is UP/READY (resolves again).
2023-01-10T14:19:27.610 app[76f8ad74] ord [info] proxy | [WARNING] 009/141927 (569) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-10T14:19:33.377 app[76f8ad74] ord [info] proxy | [WARNING] 009/141933 (569) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:19:33.558 app[76f8ad74] ord [info] proxy | [WARNING] 009/141933 (569) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T14:19:33.558 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141933 (569) : haproxy version is 2.2.9-2+deb11u3
2023-01-10T14:19:33.559 app[76f8ad74] ord [info] proxy | [NOTICE] 009/141933 (569) : path to executable is /usr/sbin/haproxy
2023-01-10T14:19:33.559 app[76f8ad74] ord [info] proxy | [ALERT] 009/141933 (569) : backend 'bk_db' has no server available!
2023-01-10T14:20:27.756 app[76f8ad74] ord [info] keeper | 2023-01-10T14:20:27.754Z FATAL cmd/keeper.go:2118 cannot create keeper: cannot create store: cannot create kv store: Put "https://consul-iad-2.fly-shared.net/v1/catalog/register?wait=5000ms": EOF
2023-01-10T14:20:27.758 app[76f8ad74] ord [info] keeper | exit status 1
2023-01-10T14:20:27.758 app[76f8ad74] ord [info] keeper | restarting in 5s [attempt 1]
2023-01-10T14:20:28.132 app[76f8ad74] ord [info] Starting clean up.
2023-01-10T14:20:28.134 app[76f8ad74] ord [info] Umounting /dev/vdc from /data

I tried scaling again to trigger a deployment, now I’m also seeing this in the logs:

2023-01-10T15:09:52.012 app[f3945c19] ord [info] exporter | ERRO[0013] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:6d7c:a7b:9ada:1:198f:2]:5433/postgres?sslmode=disable): dial tcp [fdaa:0:6d7c:a7b:9ada:1:198f:2]:5433: connect: connection refused source="postgres_exporter.go:1658"

The same problem here.

Have you had any luck fixing it? I haven’t been able to and I haven’t heard back from support yet.

I still have this issue and can’t solve it.

Mine is now working…though I didn’t do anything to fix it.

I’d love to know what happened to prevent it from occurring again and to find the best path to remediate.

Here are the last logs from the deployment:

2023-01-10T18:00:11.863 app[8717bc92] ord [info] Starting init (commit: b8364bb)...
2023-01-10T18:00:11.907 app[8717bc92] ord [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2023-01-10T18:00:11.935 app[8717bc92] ord [info] Preparing to run: `docker-entrypoint.sh start` as root
2023-01-10T18:00:11.980 app[8717bc92] ord [info] 2023/01/10 18:00:11 listening on [fdaa:0:6d7c:a7b:9ada:1:198f:2]:22 (DNS: [fdaa::3]:53)
2023-01-10T18:00:12.140 app[8717bc92] ord [info] cluster spec filename /fly/cluster-spec.json
2023-01-10T18:00:12.141 app[8717bc92] ord [info] system memory: 512mb vcpu count: 1
2023-01-10T18:00:12.141 app[8717bc92] ord [info] {
  2023-01-10T18:00:12.141 app[8717bc92] ord [info] "initMode": "existing",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "existingConfig": {
  2023-01-10T18:00:12.141 app[8717bc92] ord [info] "keeperUID": "9ada1198f2"
2023-01-10T18:00:12.141 app[8717bc92] ord [info] },
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "pgParameters": {
  2023-01-10T18:00:12.141 app[8717bc92] ord [info] "archive_command": "if [ $ENABLE_WALG ]; then /usr/local/bin/wal-g wal-push \"%p\"; fi",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "archive_mode": "on",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "archive_timeout": "60",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "effective_cache_size": "384MB",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "effective_io_concurrency": "200",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "maintenance_work_mem": "64MB",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "max_connections": "300",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "max_parallel_workers": "8",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "max_parallel_workers_per_gather": "2",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "max_worker_processes": "8",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "random_page_cost": "1.1",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "shared_buffers": "128MB",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "wal_compression": "on",
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "work_mem": "8MB"
2023-01-10T18:00:12.141 app[8717bc92] ord [info] },
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "maxStandbysPerSender": 50,
2023-01-10T18:00:12.141 app[8717bc92] ord [info] "deadKeeperRemovalInterval": "1h"
2023-01-10T18:00:12.141 app[8717bc92] ord [info] }
2023-01-10T18:00:12.141 app[8717bc92] ord [info] generated new config
2023-01-10T18:00:12.143 app[8717bc92] ord [info] keeper | Running...
2023-01-10T18:00:12.146 app[8717bc92] ord [info] sentinel | Running...
2023-01-10T18:00:12.146 app[8717bc92] ord [info] proxy | Running...
2023-01-10T18:00:12.152 app[8717bc92] ord [info] exporter | Running...
2023-01-10T18:00:12.606 app[8717bc92] ord [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"
2023-01-10T18:00:12.642 app[8717bc92] ord [info] proxy | [WARNING] 009/180012 (545) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.
2023-01-10T18:00:12.693 app[8717bc92] ord [info] proxy | [NOTICE] 009/180012 (545) : New worker #1 (568) forked
2023-01-10T18:00:13.142 app[8717bc92] ord [info] checking stolon status
2023-01-10T18:00:13.715 app[8717bc92] ord [info] proxy | [WARNING] 009/180013 (568) : bk_db/pg1 changed its IP from (none) to fdaa:0:6d7c:a7b:9ada:1:198f:2 by flydns/dns1.
2023-01-10T18:00:13.715 app[8717bc92] ord [info] proxy | [WARNING] 009/180013 (568) : Server bk_db/pg1 ('ord.unify-db.internal') is UP/READY (resolves again).
2023-01-10T18:00:13.715 app[8717bc92] ord [info] proxy | [WARNING] 009/180013 (568) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.
2023-01-10T18:00:19.461 app[8717bc92] ord [info] proxy | [WARNING] 009/180019 (568) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T18:00:19.645 app[8717bc92] ord [info] proxy | [WARNING] 009/180019 (568) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2023-01-10T18:00:19.645 app[8717bc92] ord [info] proxy | [NOTICE] 009/180019 (568) : haproxy version is 2.2.9-2+deb11u3
2023-01-10T18:00:19.645 app[8717bc92] ord [info] proxy | [NOTICE] 009/180019 (568) : path to executable is /usr/sbin/haproxy
2023-01-10T18:00:19.645 app[8717bc92] ord [info] proxy | [ALERT] 009/180019 (568) : backend 'bk_db' has no server available!
2023-01-10T18:03:53.732 app[8717bc92] ord [info] Starting init (commit: b8364bb)...
2023-01-10T18:03:53.799 app[8717bc92] ord [info] Mounting /dev/vdc at /data w/ uid: 0, gid: 0 and chmod 0755
2023-01-10T18:03:53.823 app[8717bc92] ord [info] Preparing to run: `docker-entrypoint.sh start` as root
2023-01-10T18:07:49.689 app[8717bc92] ord [info] sentinel | 2023-01-10T18:07:49.688Z ERROR cmd/sentinel.go:1895 cannot get keepers info {"error": "unexpected EOF"}
2023-01-10T18:12:16.896 app[8717bc92] ord [info] exporter | INFO[0502] Established new database connection to "fdaa:0:6d7c:a7b:9ada:1:198f:2:5433". source="postgres_exporter.go:970"
2023-01-10T18:12:16.923 app[8717bc92] ord [info] exporter | INFO[0502] Semantic Version Changed on "fdaa:0:6d7c:a7b:9ada:1:198f:2:5433": 0.0.0 -> 14.6.0 source="postgres_exporter.go:1539"
2023-01-10T18:12:16.963 app[8717bc92] ord [info] exporter | INFO[0502] Established new database connection to "fdaa:0:6d7c:a7b:9ada:1:198f:2:5433". source="postgres_exporter.go:970"
2023-01-10T18:12:16.974 app[8717bc92] ord [info] exporter | INFO[0502] Semantic Version Changed on "fdaa:0:6d7c:a7b:9ada:1:198f:2:5433": 0.0.0 -> 14.6.0 source="postgres_exporter.go:1539"
2023-01-10T18:17:03.949 app[8717bc92] ord [info] sentinel | 2023-01-10T18:17:03.948Z WARN cmd/sentinel.go:276 no keeper info available {"db": "97026b62", "keeper": "9ada1198f2"}
2023-01-10T18:24:04.684 app[8717bc92] ord [info] sentinel | 2023-01-10T18:24:04.683Z ERROR cmd/sentinel.go:1895 cannot get keepers info {"error": "unexpected end of JSON input"}
2023-01-10T18:33:26.687 app[8717bc92] ord [info] sentinel | 2023-01-10T18:33:26.686Z WARN cmd/sentinel.go:276 no keeper info available {"db": "97026b62", "keeper": "9ada1198f2"}
2023-01-10T18:43:28.624 app[8717bc92] ord [info] sentinel | 2023-01-10T18:43:28.621Z WARN cmd/sentinel.go:276 no keeper info available {"db": "97026b62", "keeper": "9ada1198f2"}
2023-01-10T18:43:48.777 app[8717bc92] ord [info] sentinel | 2023-01-10T18:43:48.775Z WARN cmd/sentinel.go:276 no keeper info available {"db": "97026b62", "keeper": "9ada1198f2"}
2023-01-10T18:43:48.782 app[8717bc92] ord [info] sentinel | 2023-01-10T18:43:48.781Z ERROR cmd/sentinel.go:1018 no eligible masters

And it’s down again :cry:

Instances should be coming back up. For those of you running single node clusters, i’d highly recommend spinning up an additional standby member if HA is important to you.

i’d highly recommend spinning up an additional standby member if HA is important to you.

Ok, will do. Thanks

My database still dead. I need to perfom some action to back up?

List your machines.

fly machines list --app <app-name>

Start the machine if it’s in a stopped state.

fly machines start <machine-id> --app <app-name>

I’m trying to follow these instructions to resurrect my own randomly dead postgres instance. When I issue the command fly machines list --app <my-app-name> the response I get is No machines are available on this app <my-app-name>. What am I doing wrong?

Hey @nickmjones, are you specifying the Postgres app name when calling the command?

For example, if you have a web app called my-app and a corresponding Postgres cluster called my-app-postgres, you’d have to run fly machines list --app my-app-postgres as that is the app with the machine running Postgres. Just want to make sure you’re not running the equivalent of fly machines list --app my-app in this example, as that wouldn’t show the machine running Postgres.