Postgres spontaneously died and cannot be started or restarted

A solo machines/v2 postgres database I am running suddenly died a few hours ago, and the replacement machine has hung in such a way that it is impossible to either start or stop it:


❯ bin/fly-db machine list
1 machines have been retrieved.
View them in the UI here (​https://fly.io/apps/starfruit-cafe-db/machines/)

starfruit-cafe-db
ID            	NAME              	STATE   	REGION	IMAGE                        	IP ADDRESS                      	VOLUME              	CREATED             	LAST UPDATED         
4d89002b6ede87	delicate-bush-7164	starting	sjc   	flyio/postgres:14.4 (v0.0.32)	fdaa:0:ddc7:a7b:2295:af86:710d:2	vol_8zmjnv8208kvywgx	2022-11-14T21:12:13Z	2022-11-24T02:43:04Z	

❯ bin/fly-db machine stop 4d89002b6ede87
Sending kill signal to machine 4d89002b6ede87...
Error could not stop machine 4d89002b6ede87: failed to stop VM 4d89002b6ede87: unable to stop machine, not currently started


❯ bin/fly-db machine kill 4d89002b6ede87
machine 4d89002b6ede87 was found and is currently in a starting state, attempting to kill...
Error could not kill machine 4d89002b6ede87: failed to kill VM 4d89002b6ede87: context deadline exceeded


❯ bin/fly-db machine remove 4d89002b6ede87
machine 4d89002b6ede87 was found and is currently in starting state, attempting to destroy...
Error could not destroy machine 4d89002b6ede87: failed to destroy VM 4d89002b6ede87: unable to destroy machine, not currently stopped


❯ bin/fly-db machine restart 4d89002b6ede87
Restarting machine 4d89002b6ede87
Error failed to restart machine 4d89002b6ede87: could not stop machine 4d89002b6ede87: failed to restart VM 4d89002b6ede87: failed to wait for machine to be started

This also means that even though my data is theoretically fine on a volume, I can’t get to any of it because the perma-stuck machine has the lease on the only volume with my data in it.

Are there some options here that I’m missing? The only thing I can see right now is to create a new volume from an 8-hour-old snapshot, and I’d really rather not do that if I can avoid it.

just in case it’s helpful, here’s the output from fly logs, which also look… weird. and don’t seem to have output even a single line for the last 5 hours or so.

fly logs
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:08.952 UTC [692] LOG:  server process (PID 30715) was terminated by signal 9: Killed
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:08.952 UTC [692] DETAIL:  Failed process was running: SELECT "statuses".* FROM "statuses" WHERE "statuses"."deleted_at" IS NULL AND "statuses"."id" = $1 ORDER BY "statuses"."id" DESC LIMIT $2
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.004 UTC [692] LOG:  terminating any other active server processes
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.062 UTC [25837] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.063 UTC [25839] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.064 UTC [25838] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.065 UTC [692] LOG:  all server processes terminated; reinitializing
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_locks pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23T14:48:09.057Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "write unix @->/tmp/.s.PGSQL.5433: write: broken pipe"}
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.120 UTC [25843] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_bgwriter pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.123 UTC [25842] LOG:  database system was interrupted; last known up at 2022-11-23 14:47:08 UTC
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.123 UTC [25844] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_database pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.137 UTC [25845] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_replication pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.140 UTC [25846] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_replication_slots pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.145 UTC [25848] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_archiver pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.155 UTC [25849] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_activity pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.164 UTC [25850] FATAL:  the database system is in recovery mode
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | INFO[754517] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_replication pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]exporter | ERRO[754517] queryNamespaceMappings returned 8 errors      source="postgres_exporter.go:1608"
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.211 UTC [25842] LOG:  database system was not properly shut down; automatic recovery in progress
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.212 UTC [25842] LOG:  redo starts at 25/90000028
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.215 UTC [25842] LOG:  redo done at 25/900219F8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2022-11-23T14:48:09Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 14:48:09.225 UTC [692] LOG:  database system is ready to accept connections
2022-11-23T14:48:10Z app[4d89002b6ede87] sjc [info]exporter | ERRO[754518] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:ddc7:a7b:2295:af86:710d:2]:5433/starfruit_cafe?sslmode=disable): driver: bad connection  source="postgres_exporter.go:1608"
2022-11-23T14:48:10Z app[4d89002b6ede87] sjc [info]exporter | INFO[754519] Established new database connection to "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433".  source="postgres_exporter.go:970"
2022-11-23T14:48:10Z app[4d89002b6ede87] sjc [info]exporter | INFO[754519] Semantic Version Changed on "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": 0.0.0 -> 14.4.0  source="postgres_exporter.go:1539"
2022-11-23T14:48:11Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/144811 (562) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5205ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2022-11-23T14:48:11Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/144811 (562) : Server bk_db/pg1 is DOWN, reason: Layer7 timeout, check duration: 5205ms. 0 active and 0 backup servers left. 5 sessions active, 0 requeued, 0 remaining in queue.
2022-11-23T14:48:11Z app[4d89002b6ede87] sjc [info]proxy    | [ALERT] 326/144811 (562) : backend 'bk_db' has no server available!
2022-11-23T14:48:15Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/144815 (562) : Server bk_db/pg1 is UP, reason: Layer7 check passed, code: 200, check duration: 15ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2022-11-23T14:48:15Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/144815 (562) : Backup Server bk_db/pg is UP, reason: Layer7 check passed, code: 200, check duration: 10ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
2022-11-23T18:15:57Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 18:15:57.245 UTC [31328] ERROR:  duplicate key value violates unique constraint "index_status_stats_on_status_id"
2022-11-23T18:15:57Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 18:15:57.245 UTC [31328] DETAIL:  Key (status_id)=(109247993926822370) already exists.
2022-11-23T18:15:57Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 18:15:57.245 UTC [31328] STATEMENT:  INSERT INTO "status_stats" ("status_id", "replies_count", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id"
2022-11-23T19:19:23Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/191923 (562) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5169ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
ing in queue.
2022-11-23T19:19:23Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/191923 (562) : Backup Server bk_db/pg is DOWN, reason: Layer7 timeout, check duration: 5169ms. 0 active and 0 backup servers left. 1 sessions active, 0 requeued, 0 remaining in queue.
2022-11-23T19:19:23Z app[4d89002b6ede87] sjc [info]proxy    | [ALERT] 326/191923 (562) : backend 'bk_db' has no server available!
2022-11-23T19:19:23Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 19:19:23.450 UTC [21368] LOG:  could not send data to client: Connection reset by peer
2022-11-23T19:19:23Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 19:19:23.452 UTC [21368] FATAL:  connection to client lost
2022-11-23T19:19:26Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/191926 (562) : Server bk_db/pg1 is UP, reason: Layer7 check passed, code: 200, check duration: 8ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2022-11-23T19:19:27Z app[4d89002b6ede87] sjc [info]proxy    | [WARNING] 326/191927 (562) : Backup Server bk_db/pg is UP, reason: Layer7 check passed, code: 200, check duration: 11ms. 1 active and 1 backup servers online. 0 sessions requeued, 0 total in queue.
2022-11-23T21:13:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 21:13:28.038 UTC [11753] DETAIL:  Key (status_id)=(108842590026449145) already exists.
keeper   |
2022-11-23T21:13:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 21:13:28.038 UTC [11753] DETAIL:  Key (status_id)=(108842590026449145) already exists.
2022-11-23T21:13:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-23 21:13:28.038 UTC [11753] STATEMENT:  INSERT INTO "status_stats" ("status_id", "replies_count", "created_at", "updated_at") VALUES ($1, $2, $3, $4) RETURNING "id"
2022-11-24T01:32:42Z app[4d89002b6ede87] sjc [info]sentinel | 2022-11-24T01:32:42.027Z	WARN	cmd/sentinel.go:276	no keeper info available	{"db": "279413ef", "keeper": "2295af86710d2"}
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:27.808 UTC [692] LOG:  server process (PID 25150) was terminated by signal 9: Killed
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:27.808 UTC [692] DETAIL:  Failed process was running: SELECT "preview_cards".* FROM "preview_cards" INNER JOIN "preview_cards_statuses" ON "preview_cards"."id" = "preview_cards_statuses"."preview_card_id" WHERE "preview_cards_statuses"."status_id" = $1
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:27.910 UTC [692] LOG:  terminating any other active server processes
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:27.994 UTC [17543] FATAL:  the database system is in recovery mode
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]exporter | WARN[793235] Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string on "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pq: the database system is in recovery mode  source="postgres_exporter.go:1712"
2022-11-24T01:33:27Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:27.998 UTC [692] LOG:  all server processes terminated; reinitializing
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.074 UTC [17551] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_database_conflicts pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.076 UTC [17549] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.080 UTC [17548] LOG:  PID 17519 in cancel request did not match any process
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24T01:33:28.081Z	ERROR	cmd/keeper.go:719	cannot get configured pg parameters	{"error": "context deadline exceeded"}
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.083 UTC [17547] LOG:  PID 0 in cancel request did not match any process
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.085 UTC [17552] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_locks pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.087 UTC [17546] LOG:  database system was interrupted; last known up at 2022-11-24 01:32:35 UTC
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.087 UTC [17553] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.090 UTC [17554] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.092 UTC [17555] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_bgwriter pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.106 UTC [17556] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_replication pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.110 UTC [17557] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_database pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.114 UTC [17558] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24T01:33:28.115Z	ERROR	cmd/keeper.go:1720	failed to check if restart is required	{"error": "pq: the database system is in recovery mode"}
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.119 UTC [17559] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_database pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.124 UTC [17560] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_replication pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.128 UTC [17561] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_replication_slots pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.132 UTC [17562] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_archiver pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.134 UTC [17563] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | INFO[793235] Error running query on database "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": pg_stat_activity pq: the database system is in recovery mode  source="postgres_exporter.go:1490"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]exporter | ERRO[793235] queryNamespaceMappings returned 10 errors     source="postgres_exporter.go:1608"
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.136 UTC [17564] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.149 UTC [17565] FATAL:  the database system is in recovery mode
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.197 UTC [17546] LOG:  database system was not properly shut down; automatic recovery in progress
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.198 UTC [17546] LOG:  redo starts at 27/D2000028
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.211 UTC [17546] LOG:  invalid record length at 27/D3019420: wanted 24, got 0
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.211 UTC [17546] LOG:  redo done at 27/D30193F8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
2022-11-24T01:33:28Z app[4d89002b6ede87] sjc [info]keeper   | 2022-11-24 01:33:28.226 UTC [692] LOG:  database system is ready to accept connections
2022-11-24T01:33:29Z app[4d89002b6ede87] sjc [info]exporter | ERRO[793236] Error opening connection to database (postgresql://flypgadmin:PASSWORD_REMOVED@[fdaa:0:ddc7:a7b:2295:af86:710d:2]:5433/starfruit_cafe?sslmode=disable): driver: bad connection  source="postgres_exporter.go:1608"
2022-11-24T01:33:39Z app[4d89002b6ede87] sjc [info]exporter | INFO[793246] Established new database connection to "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433".  source="postgres_exporter.go:970"
2022-11-24T01:33:39Z app[4d89002b6ede87] sjc [info]exporter | INFO[793246] Semantic Version Changed on "fdaa:0:ddc7:a7b:2295:af86:710d:2:5433": 0.0.0 -> 14.4.0  source="postgres_exporter.go:1539"

Well… I was finally able to get rid of the stuck machine by running fly machine remove --force ID. But now my postgres cluster has no machines, and the entire readme at GitHub - fly-apps/postgres-ha: Postgres + Stolon for HA clusters as Fly apps. only applies to nomad postgres apps, and this is a machines postgres app. So… how do I add replicas, without a machine to clone?

Hmm, so deleting all the machines is an issue. There’s no centralized store for machine app config, but it may be possible to restore the config and apply it to a new machine.

By pure coincidence, I happened to have the API output describing the machine on hand, so I was able to create another machine and point it at the same volume. Unfortunately, creating a new machine and pointing it at the old volume doesn’t work, and postgres never comes back up.

Trying to SSH in and start postgres by hand, I noticed that the volume has hardcoded postgres settings, like the ipv6 listen address, that are for the original machine, and not the new machine. In the end, I hand-fixed the postgres config files long enough to start up postgres and pg_dump.

I would love some kind of docs for how to recover from “you force-deleted a broken machine and now you only have a single volume and no VMs”, if that’s something that has a solution.

This sounds almost identical to what happened to me. My app wasn’t able to connect to the postgres database. I logged into the fly console and saw that it was suspended. When I tried to unsuspend the machine it was just hung. I figured if I deleted the machine it would create a new one. Here’s my full terminal log: fly postgres log · GitHub

Ok I was able to recover my database by creating a new database from the old database snapshots.

fly volumes list -a <app>
fly volumes snapshots list <volume id>
fly pg create --region ord --name <new name> --snapshot-id <snapshot id>