Postgres application fails to start

My postgres application, which has been running for close to a year now without major incident, suddenly stopped working.

Metrics show the following errors.

 2023-09-13T04:28:20.748 app[d568352a1d748e] ewr [info] exporter | INFO[0000] Starting Server: :9187 source="postgres_exporter.go:1837"

2023-09-13T04:28:20.767 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (538) : parsing [/fly/haproxy.cfg:38]: Missing LF on last line, file might have been truncated at position 96. This will become a hard error in HAProxy 2.3.

2023-09-13T04:28:20.776 app[d568352a1d748e] ewr [info] proxy | [NOTICE] 255/042820 (538) : New worker #1 (562) forked

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : bk_db/pg1 changed its IP from (none) to fdaa:0:aa2b:a7b:ab3:e969:31e8:2 by flydns/dns1.

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : Server bk_db/pg1 ('ewr.xxx-db.internal') is UP/READY (resolves again).

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : Server bk_db/pg1 administratively READY thanks to valid DNS answer.

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : bk_db/pg2 changed its IP from (none) to fdaa:0:aa2b:a7b:cc:8001:b376:2 by DNS cache.

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : Server bk_db/pg2 ('ewr.xxx-db.internal') is UP/READY (resolves again).

2023-09-13T04:28:20.780 app[d568352a1d748e] ewr [info] proxy | [WARNING] 255/042820 (562) : Server bk_db/pg2 administratively READY thanks to valid DNS answer.

2023-09-13T04:28:21.010 app[d568352a1d748e] ewr [info] keeper | 2023-09-13T04:28:21.009Z ERROR	cmd/keeper.go:811 error retrieving cluster data {"error": "invalid character '\\x00' in string literal"}

2023-09-13T04:28:21.017 app[d568352a1d748e] ewr [info] keeper | 2023-09-13T04:28:21.017Z ERROR	cmd/keeper.go:719 cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5433: connect: no such file or directory"}

2023-09-13T04:28:21.061 app[d568352a1d748e] ewr [info] keeper | 2023-09-13T04:28:21.061Z ERROR	cmd/keeper.go:1041 error retrieving cluster data {"error": "invalid character '\\x00' in string literal"}

2023-09-13T04:28:21.066 app[d568352a1d748e] ewr [info] sentinel | 2023-09-13T04:28:21.066Z ERROR	cmd/sentinel.go:1852 error retrieving cluster data {"error": "invalid character '\\x00' in string literal"}

2023-09-13T04:28:21.661 app[d568352a1d748e] ewr [info] checking stolon status

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] panic: error checking stolon status: cannot get cluster data: invalid character '\x00' in string literal

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] : exit status 1

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] goroutine 9 [running]:

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] main.main.func2(0xc0000d0000, 0xc000084a00)

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:81 +0x72c

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] created by main.main

2023-09-13T04:28:21.832 app[d568352a1d748e] ewr [info] /go/src/github.com/fly-examples/postgres-ha/cmd/start/main.go:72 +0x43b

2023-09-13T04:28:22.260 health[d568352a1d748e] ewr [info] Health check for your postgres vm is now passing.

2023-09-13T04:28:22.598 app[d568352a1d748e] ewr [info] Starting clean up.

2023-09-13T04:28:22.598 app[d568352a1d748e] ewr [info] Umounting /dev/vdb from /data

2023-09-13T04:28:23.602 app[d568352a1d748e] ewr [info] [ 3.117484] reboot: Restarting system

2023-09-13T04:28:31.261 health[d568352a1d748e] ewr [error] Health check for your postgres vm has failed. Your instance has hit resource limits. Upgrading your instance / volume size or reducing your usage might help. 

Seems similar to the issue reported in Postgres looping with panic: error checking stolon status.

I’ve tried restarting, creating a new machine and attaching the existing volume (vol_52en7r18289vk6yx); however, the volume is apparently not found.

I’ll add that I cannot perform any action around this volume:

  • Does not appear in fly volumes list -a app-name
  • fly volumes snapshots list vol_52en7r18289vk6yx says there are no snapshots

I can only see this volume in the UI.

Quite at a loss at this point. Any direction or advice would be greatly appreciated. Cheers!

Hey there,

I see you’re going through some postgres issues. Thanks for taking some troubleshooting steps on your own and letting us know what’s not working. Have you tried to restart your machine by any chance?

Here’s the doc on how to do so- fly machine restart · Fly Docs

I found a post that is similar to what you are going through and looks like someone else also fixed their issue by restarting their machine. Here’s that post for reference- PostgreSQL DB resource limits reached

Hi Kaelyn, thanks for your response. Restarting the machine was actually the first step I tried (apologies for failing to mention it). Ultimately, I had no success with that approach.

Any other suggesions?

1 Like

Oh no worries at all!! When you tried to restart did it spit out an error?

Below’s the requested info


fly machine restart d568352a1d748e -a ***-db

returns

Error: failed to restart machine d568352a1d748e: could not stop machine d568352a1d748e: failed to restart VM d568352a1d748e: failed_precondition: machine still active, refusing to start

fly status -a ***-db

returns

ID            	STATE  	ROLE 	REGION	CHECKS            	IMAGE                        	CREATED             	UPDATED
d568352a1d748e	stopped	error	ewr   	3 total, 3 warning	flyio/postgres:14.6 (v0.0.41)	2023-01-23T20:33:53Z	2023-09-13T21:26:36Z

fly machine status d568352a1d748e --app ***-db

returns

Machine ID: d568352a1d748e
Instance ID: 01HA6DFCVF742DS1EW3M3BAN6Y
State: stopped

VM
  ID            = d568352a1d748e
  Instance ID   = 01HA6DFCVF742DS1EW3M3BAN6Y
  State         = stopped
  Image         = flyio/postgres:14.6 (v0.0.41)
  Name          = solitary-meadow-2473
  Private IP    = fdaa:0:aa2b:a7b:ab3:e969:31e8:2
  Region        = ewr
  Process Group = app
  CPU Kind      = shared
  vCPUs         = 1
  Memory        = 256
  Created       = 2023-01-23T20:33:53Z
  Updated       = 2023-09-13T21:36:30Z
  Entrypoint    =
  Command       =
  Volume        = vol_52en7r18289vk6yx

Event Logs
STATE   	EVENT	SOURCE	TIMESTAMP                    	INFO
stopped 	exit 	flyd  	2023-09-13T17:36:30.46-04:00 	exit_code=2,oom_killed=false,requested_stop=false
started 	start	flyd  	2023-09-13T17:36:26.473-04:00
starting	start	flyd  	2023-09-13T17:36:25.858-04:00
stopped 	exit 	flyd  	2023-09-13T17:36:14.963-04:00	exit_code=2,oom_killed=false,requested_stop=false
started 	start	flyd  	2023-09-13T17:36:11.509-04:00

fly postgres list

does return the database…

NAME      	OWNER   	STATUS   	LATEST DEPLOY
***-db	personal	suspended

however, fly volumes list -a ***-db only returns the volume (vol_5vg87om6pk962kp4) created yesterday when I tried to scale the machines but cannot list or perform any operations on the older volume (vol_52en7r18289vk6yx) containing the DB data.

I’m only to see vol_52en7r18289vk6yx via the UI but cannot perform any actions against it or its snapshots.

Example:

fly volumes create pg_data \
  --snapshot-id vs_p02D91PyaD4gwUe1Jb5 \
  -a ***-db

returns

Error: failed creating volume: failed to create volume: EOF

Thank you so much for providing all that info!

There are a few directions I want to go in with this. As I’ve been looking into the issue further and looking at other similar posts… I think the best way to go about it would be from least intrusive to most intrusive when it comes to troubleshooting.

  1. Are you up to date on flyctl version? If not you could update it?

  2. I have found others having their postgres db being in a “suspended state” fixed by running fly machines start <machine-id> --app <app-name>. More info on that here. I know we tried restarting it already but I’m wondering if there is something that just “starting it” would do differently (you know technology can be weird sometimes :sweat_smile:)?

  3. Lastly, going back to what you posted in the beginning, the last error message in the log:

  • I’m wondering if this is the real culprit, and you just need to scale your postgres db? If the above things I listed don’t help whatsoever I think this might be the route to take. It’s all documented here for you if so- Scale Postgres VMs · Fly Docs

Hope this helps! If not please do reach back out.

Thanks for all the suggestions, Kaelyn!

  1. Starting the machine results in the same crash-loop behavior.

  2. My flyctl version is up to date

  3. Scaling the machine with more memory had no effect. The DB is currently using 181 MB of 1 GB, so I don’t think the issue is around resource limits. Unless of course the resource limit is lower than the volume size due to being on the hobby plan.

The other consistent error message I’m seeing is:

Health check for your postgres role has failed. Your cluster's membership is inconsistent.

Not sure if this is relevant at all though. I also find it disconcerting that I’m unable to manipulate the volume via flyctl.

Alright I got a few more things for you to try if you don’t mind. I appreciate you trying the previous steps. Hopefully this here will work, I know its been a long one. This is going to be quite a bit of info, but here we go!

If you’re still able to access the running database at all , you can import your data into a new app running PG Flex. If connecting with fly postgres connect and running a pg_dump doesn’t work, you can try using the fly postgres import command to pull the data from your existing db to a new one. This might succeed even if the manual dump failed. More details on fly postgres import are available here.

. If that doesn’t work, then you can set your database into restoration mode. To do so,

  1. Scale your app down to a single machine

  2. SSH into the instance to remove the consul data and add a FLY_RESTORED_FROM environment variable (this can be set to any value). This triggers the restore logic, which will rebuild your consul configuration data. The consul endpoint can be found within the environment. This isn’t as risk free as the first option, so I’d recommend creating a backup from your most recent snapshot first.

I’ll give this a try report back. Thank you!

One thing I’ll note: I have been unable to do anything around the volume, I can’t list it or its snapshots, so I suspect I’ll be unable to restore anything from the volume. The UI shows it but the CLI does not.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.