Postgres - failed to start VM after service interruption

I have a postgres machine refusing to start after a Fly.io service interruption.

The interruption notice (now resolved):

We are performing emergency maintenance on a host some of your apps instances are running on. Apps may be unavailable until the maintenance is completed.

The error when trying to start via flyctl:

Error: could not start machine 6e82535b70e258: failed to start VM 6e82535b70e258: failed_precondition: machine still active, refusing to start (Request ID: 01HXV9QQFV1E66V3PF17PMHAZW-lhr)

Further logs:

2024-05-14T10:21:26.604 app[6e82535b70e258] lhr [info] Starting init (commit: 08b4c2b)…
2024-05-14T10:21:26.623 app[6e82535b70e258] lhr [info] Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-05-14T10:21:26.628 app[6e82535b70e258] lhr [info] Preparing to run: docker-entrypoint.sh start as root
2024-05-14T10:21:26.640 runner[6e82535b70e258] lhr [info] Machine started in 250ms
2024-05-14T10:21:26.646 app[6e82535b70e258] lhr [info] 2024/05/14 10:21:26 listening on [fdaa:0:4e26:a7b:2809:25dc:1398:2]:22 (DNS: [fdaa::3]:53)
2024-05-14T10:21:26.741 app[6e82535b70e258] lhr [info] cluster spec filename /fly/cluster-spec.json
2024-05-14T10:21:26.743 app[6e82535b70e258] lhr [info] panic: error loading cluster spec: unexpected end of JSON input
2024-05-14T10:21:26.743 app[6e82535b70e258] lhr [info] goroutine 1 [running]:
2024-05-14T10:21:27.637 app[6e82535b70e258] lhr [info] Starting clean up.
2024-05-14T10:21:27.637 app[6e82535b70e258] lhr [info] Umounting /dev/vdb from /data
2024-05-14T10:21:28.642 app[6e82535b70e258] lhr [info] [ 2.155850] reboot: Restarting system

2 Likes

Having exactly the same happen here as well:

2024-05-14T19:12:14.159 app[5683040b797e38] lhr [info] Starting init (commit: b8364bb)...
2024-05-14T19:12:14.185 app[5683040b797e38] lhr [info] Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2024-05-14T19:12:14.191 app[5683040b797e38] lhr [info] Preparing to run: `docker-entrypoint.sh start` as root
2024-05-14T19:12:14.222 app[5683040b797e38] lhr [info] 2024/05/14 19:12:14 listening on [fdaa:0:806b:a7b:2809:babf:db6f:2]:22 (DNS: [fdaa::3]:53)
2024-05-14T19:12:14.243 runner[5683040b797e38] lhr [info] Machine started in 911ms
2024-05-14T19:12:14.317 app[5683040b797e38] lhr [info] cluster spec filename /fly/cluster-spec.json
2024-05-14T19:12:14.319 app[5683040b797e38] lhr [info] panic: error loading cluster spec: unexpected end of JSON input
2024-05-14T19:12:14.319 app[5683040b797e38] lhr [info] goroutine 1 [running]:
2024-05-14T19:12:14.319 app[5683040b797e38] lhr [info] main.main()
2024-05-14T19:12:14.319 app[5683040b797e38] lhr [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:69 +0x1bbb
2024-05-14T19:12:15.206 app[5683040b797e38] lhr [info] Starting clean up.
2024-05-14T19:12:15.207 app[5683040b797e38] lhr [info] Umounting /dev/vdb from /data
2024-05-14T19:12:16.211 app[5683040b797e38] lhr [info] [ 2.148485] reboot: Restarting system

We ended up forking the volume with the database data on it and creating a new machine. This is probably what Fly would recommend doing but we’ll probably be moving to another provider after this latest incident.

I couldn’t fork the volume, so I ended up creating a new postgres database based on a volume snapshot.

fly volumes list -a {POSTGRES_APP_NAME} # Find the Volume ID
fly volumes snapshots list {VOLUME_ID} # List snapshots of the Volume
fly postgres create --snapshot-id {VOLUME_SNAPSHOT_ID} # Create a new Postgres from the snapshot

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.