About 5 hours ago, my Fly Postgres app (restless-sea-2562-db) failed its health checks. My web app currently crashes because the database is unavailable. Here are the logs from that time:
2025-02-02T11:30:22Z app[78116dea573d38] iad [info]repmgrd | [2025-02-02 11:30:22] [INFO] monitoring primary node "fdaa:2:657c:a7b:107:b4e0:52f6:2" (ID: 1893110041) in normal state
2025-02-02T11:32:14Z app[78116dea573d38] iad [info]monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2025-02-02T11:35:22Z app[78116dea573d38] iad [info]repmgrd | [2025-02-02 11:35:22] [INFO] monitoring primary node "fdaa:2:657c:a7b:107:b4e0:52f6:2" (ID: 1893110041) in normal state
2025-02-02T11:37:14Z app[78116dea573d38] iad [info]monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2025-02-02T12:32:15Z health[78116dea573d38] iad [error]Health check for your postgres database has failed. Your database is malfunctioning.
2025-02-02T12:32:15Z health[78116dea573d38] iad [error]Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help.
2025-02-02T12:32:15Z health[78116dea573d38] iad [error]Health check for your postgres role has failed. Your cluster's membership is inconsistent.
2025-02-02T13:16:24Z proxy[78116dea573d38] iad [info]Starting machine
2025-02-02T13:16:24Z proxy[78116dea573d38] iad [error][PM02] could not wake up machine due to an internal communication error
2025-02-02T13:16:24Z proxy[78116dea573d38] iad [info]Starting machine
2025-02-02T13:16:24Z proxy[78116dea573d38] iad [error][PM02] could not wake up machine due to an internal communication error
2025-02-02T13:16:41Z proxy[78116dea573d38] iad [info]Starting machine
2025-02-02T13:16:41Z proxy[78116dea573d38] iad [error][PM02] could not wake up machine due to an internal communication error
2025-02-02T13:16:41Z proxy[78116dea573d38] iad [info]Starting machine
2025-02-02T13:16:41Z proxy[78116dea573d38] iad [error][PM02] could not wake up machine due to an internal communication error
2025-02-02T14:17:47Z proxy[78116dea573d38] iad [info]Starting machine
I just switched from a “Legacy Hobby” plan to “Pay-as-you-go,” thinking that might address the resource limits, but it doesn’t seem to have made a difference.
Neither fly machine stop nor fly machine restart does anything, they fail like this:
$ fly machine stop 78116dea573d38 -a restless-sea-2562-db
Sending kill signal to machine 78116dea573d38...
Error: could not stop machine 78116dea573d38: failed to stop VM 78116dea573d38: aborted: unable to stop machine, current state invalid, starting (Request ID: 01JK3WVTQP27WGE5WCV65E6C00-iad)
$ fly machine restart 78116dea573d38 -a restless-sea-2562-db
Restarting machine 78116dea573d38
Error: failed to restart machine 78116dea573d38: could not stop machine 78116dea573d38: failed to restart VM 78116dea573d38: internal: internal server error (Request ID: 01JK3WVZ6HM8X1RZNSTP15PQM6-iad)
I don’t think I’ve actually exceeded the resource limits – the db says its “1GB”, and clicking on the volume, it says it has used 129mb.
Is there anything I can do to get this machine running again and get my app working? Please let me know if you see anything else I can investigate on my side. Thanks!
there might be nothing you can do to resurrect that machine without Fly’s help since [PM02] Machine wake internal error should indicate an internal Fly.io error.
But also Fly Postgres are unmanaged databases, it’s up to you to be ready to manage it and related emergencies.
I gather you only had this single node rather than the recommended HA setup? what’s the status of the volume?
If I were in you I’d try and spin a new Postgres and recover the db from the last automated snapshot so that I could get things up again quickly.
Hey @mabis, thanks so much for taking a look. Yes, it was just a single node – it’s an infrequently used app and almost all the data is actually mirrored from elsewhere.
Could you tell me more about what you mean by “DB snapshot”? I checked out that doc about “Backup, Restores, & Snapshots” and it only describes using volume snapshots. Is there somewhere I can download a DB snapshot in the fly dashboard?
For good measure, I tried creating a new Postgres from the volume snapshot, but it failed to start:
2025-02-02T20:07:42.454 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.453 UTC [4064] WARNING: database "postgres" has a collation version mismatch
2025-02-02T20:07:42.454 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.453 UTC [4064] DETAIL: The database was created using collation version 2.31, but the operating system provides version 2.36.
2025-02-02T20:07:42.454 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.453 UTC [4064] HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE postgres REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
2025-02-02T20:07:42.465 app[e7843121cd1338] iad [info] Registering primary
2025-02-02T20:07:42.490 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.490 UTC [4064] ERROR: template database "template1" has a collation version mismatch
2025-02-02T20:07:42.490 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.490 UTC [4064] DETAIL: The template database was created using collation version 2.31, but the operating system provides version 2.36.
2025-02-02T20:07:42.490 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.490 UTC [4064] HINT: Rebuild all objects in the template database that use the default collation and run ALTER DATABASE template1 REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
2025-02-02T20:07:42.490 app[e7843121cd1338] iad [info] postgres | 2025-02-02 20:07:42.490 UTC [4064] STATEMENT: CREATE DATABASE repmgr OWNER repmgr;
2025-02-02T20:07:42.490 app[e7843121cd1338] iad [info] failed post-init: failed to enable repmgr: failed to create repmgr database: ERROR: template database "template1" has a collation version mismatch (SQLSTATE XX000). Retrying...
(It’s looping that error message :S)
I’m going to prepare a script to recreate everything I can. In the meantime, if you have anything else to share about DB snapshots, I’d appreciate it. Thanks again!
I corrected my initial reply, what you did is correct but perhaps the latest postgres image is newer than the one you used to create the original postgres? there is a paragraph in the docs about using the correct image.
In any case you could try and log into the new postgres machine and run the two ALTER queries suggested in the log:
ALTER DATABASE postgres REFRESH COLLATION VERSION:
ALTER DATABASE template1 REFRESH COLLATION VERSION;
Hey, thanks for this suggestion. It makes a lot of sense. But I’m not sure how to get a psql session. I got a console session with fly ssh console, and tried to run psql, but it said:
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
So, it’s not running. I tried a couple of other ways I know to start postgres but neither of them did anything:
$ fly ssh console -a db-restored
Connecting to fdaa:2:657c:a7b:375:d2b2:395a:2... complete
root@6e825640ae7428:/# postgres -D /usr/local/pgsql/data
-bash: postgres: command not found
root@6e825640ae7428:/# pg_ctl
-bash: pg_ctl: command not found
root@6e825640ae7428:/#
I also tried fly postgres connect but it said there was no active leader found:
$ fly postgres connect -a db-restored
Error: no active leader found
Is there a better way to get a console where I can run those commands? Thanks so much for your help!
@mabis and @mayailurus, thanks again for your help working through this. With your suggestions and some related links in this forum, I ended up recovering my data by using the technique described here: Can't recreate pg from snapshot: "The database was created using collation version 2.31, but the operating system provides version 2.36" . Basically, trying fly pg create but choosing a different--image-ref than the one reported by fly image show. (For me, fly image show returned flyio/postgres-flex 15.3, but I was able to recover my data with the same ref in the linked post, flyio/postgres-flex:15.1@sha256:4af8e07ae57ff7d31228b32ceebd34bf7508c131bc86f67c2025c669b56eff70. )
After that, I had a running machine with my data on it. (That snapshot was 18 hours old but the data doesn’t change much.)
I had to do a bit of massaging to get my Rails app to use the new machine:
Update the DATABASE_URL environment variable to use the new machine
Stop the machine … because it worked, but the application was looking in the wrong database
Copy data from one Postgres database to another. I think this was because of how I migrated from Heroku in the past: it was using a database named after the fly app, but Rails’s default is to use a database named after the Rails app. I probably could have changed a Rails config instead, but I figured I’d be doing myself a favor to use the Rails default. I did this using pgAdmin as described here: postgresql - Copy a table from one database to another in Postgres - Stack Overflow
Restart the machine now that the data was present
Now, everything is running smoothly again, with all my data in-tact.
@mabis and @mayailurus, thanks so much for helping me ! I thought I’d be spending the next couple of days trying to scrape data together from previous reports, etc, but now everything is working properly.
Next, I’ll get my fly setup improved so I’ll have redundancy if this happens again!