I see you’re going through some postgres issues. Thanks for taking some troubleshooting steps on your own and letting us know what’s not working. Have you tried to restart your machine by any chance?
I found a post that is similar to what you are going through and looks like someone else also fixed their issue by restarting their machine. Here’s that post for reference- PostgreSQL DB resource limits reached
Hi Kaelyn, thanks for your response. Restarting the machine was actually the first step I tried (apologies for failing to mention it). Ultimately, I had no success with that approach.
I know you mentioned it seems like it isn’t even picking up or recognizing the postgres db. If you run fly postgres list does anything show up? Here’s the doc for that as well- Backup, Restores, & Snapshots · Fly Docs
Error: failed to restart machine d568352a1d748e: could not stop machine d568352a1d748e: failed to restart VM d568352a1d748e: failed_precondition: machine still active, refusing to start
fly status -a ***-db
returns
ID STATE ROLE REGION CHECKS IMAGE CREATED UPDATED
d568352a1d748e stopped error ewr 3 total, 3 warning flyio/postgres:14.6 (v0.0.41) 2023-01-23T20:33:53Z 2023-09-13T21:26:36Z
fly machine status d568352a1d748e --app ***-db
returns
Machine ID: d568352a1d748e
Instance ID: 01HA6DFCVF742DS1EW3M3BAN6Y
State: stopped
VM
ID = d568352a1d748e
Instance ID = 01HA6DFCVF742DS1EW3M3BAN6Y
State = stopped
Image = flyio/postgres:14.6 (v0.0.41)
Name = solitary-meadow-2473
Private IP = fdaa:0:aa2b:a7b:ab3:e969:31e8:2
Region = ewr
Process Group = app
CPU Kind = shared
vCPUs = 1
Memory = 256
Created = 2023-01-23T20:33:53Z
Updated = 2023-09-13T21:36:30Z
Entrypoint =
Command =
Volume = vol_52en7r18289vk6yx
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
stopped exit flyd 2023-09-13T17:36:30.46-04:00 exit_code=2,oom_killed=false,requested_stop=false
started start flyd 2023-09-13T17:36:26.473-04:00
starting start flyd 2023-09-13T17:36:25.858-04:00
stopped exit flyd 2023-09-13T17:36:14.963-04:00 exit_code=2,oom_killed=false,requested_stop=false
started start flyd 2023-09-13T17:36:11.509-04:00
fly postgres list
does return the database…
NAME OWNER STATUS LATEST DEPLOY
***-db personal suspended
however, fly volumes list -a ***-db only returns the volume (vol_5vg87om6pk962kp4) created yesterday when I tried to scale the machines but cannot list or perform any operations on the older volume (vol_52en7r18289vk6yx) containing the DB data.
I’m only to see vol_52en7r18289vk6yx via the UI but cannot perform any actions against it or its snapshots.
Example:
fly volumes create pg_data \
--snapshot-id vs_p02D91PyaD4gwUe1Jb5 \
-a ***-db
returns
Error: failed creating volume: failed to create volume: EOF
There are a few directions I want to go in with this. As I’ve been looking into the issue further and looking at other similar posts… I think the best way to go about it would be from least intrusive to most intrusive when it comes to troubleshooting.
Are you up to date on flyctl version? If not you could update it?
I have found others having their postgres db being in a “suspended state” fixed by running fly machines start <machine-id> --app <app-name>. More info on that here. I know we tried restarting it already but I’m wondering if there is something that just “starting it” would do differently (you know technology can be weird sometimes )?
Lastly, going back to what you posted in the beginning, the last error message in the log:
I’m wondering if this is the real culprit, and you just need to scale your postgres db? If the above things I listed don’t help whatsoever I think this might be the route to take. It’s all documented here for you if so- Scale Postgres VMs · Fly Docs
Starting the machine results in the same crash-loop behavior.
My flyctl version is up to date
Scaling the machine with more memory had no effect. The DB is currently using 181 MB of 1 GB, so I don’t think the issue is around resource limits. Unless of course the resource limit is lower than the volume size due to being on the hobby plan.
The other consistent error message I’m seeing is:
Health check for your postgres role has failed. Your cluster's membership is inconsistent.
Not sure if this is relevant at all though. I also find it disconcerting that I’m unable to manipulate the volume via flyctl.
Alright I got a few more things for you to try if you don’t mind. I appreciate you trying the previous steps. Hopefully this here will work, I know its been a long one. This is going to be quite a bit of info, but here we go!
If you’re still able to access the running database at all , you can import your data into a new app running PG Flex. If connecting with fly postgres connect and running a pg_dump doesn’t work, you can try using the fly postgres import command to pull the data from your existing db to a new one. This might succeed even if the manual dump failed. More details on fly postgres import are available here.
. If that doesn’t work, then you can set your database into restoration mode. To do so,
SSH into the instance to remove the consul data and add a FLY_RESTORED_FROM environment variable (this can be set to any value). This triggers the restore logic, which will rebuild your consul configuration data. The consul endpoint can be found within the environment. This isn’t as risk free as the first option, so I’d recommend creating a backup from your most recent snapshot first.
One thing I’ll note: I have been unable to do anything around the volume, I can’t list it or its snapshots, so I suspect I’ll be unable to restore anything from the volume. The UI shows it but the CLI does not.