Postgres health checks perpetually failing

atomicsheekz · September 13, 2022, 3:11pm

For context, I noticed that one of the database replicas had 2/3 health checks that were critical yesterday and it hasn’t resolved since. I tried restarting the app but then it falls back into the same 2/3 health checks that are failing for that replica. I believe that some data was lost yesterday as well when a user was trying to onboard onto the app.

The app has plenty of memory and volume size, so I don’t think those are the issues. One of the replicas is a few deploy versions behind but has 3 healthy checks. I tried upgrading the Postgres image to the newer Fly Postgres image this morning and then ran into the same health check issue.

Would love some support, if possible.

atomicsheekz · September 14, 2022, 5:07am

I ended up solving this by spinning up a new database container using a volume snapshot and attaching it to the app container. Still not entirely sure how this issue occurred in the first place, but hopefully that’s helpful context for anyone who runs into the same thing.

alexpls · March 2, 2023, 9:02am

I’m seeing the same thing. Running fly checks list -a [your_app] reveals more information about the specific health check that’s failing. In my case it’s:

  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*--------------------------------------------------------
  pg   | passing | MACHINE_ID.... | 8m9s ago     | [✓] transactions: read/write (340.21µs)
       |         |                |              | [✓] connections: 16 used, 3 reserved, 80 max (6.73ms)
-------*---------*----------------*--------------*--------------------------------------------------------
  role | passing | MACHINE_ID.... | 8m3s ago     | leader
-------*---------*----------------*--------------*--------------------------------------------------------
  vm   | warning | MACHINE_ID.... | 8m32s ago    | waiting for status update
-------*---------*----------------*--------------*--------------------------------------------------------

… now to figure out why the vm is “waiting for status update”

dangra · March 2, 2023, 4:08pm

Hi @alexpls , you have been bitten a platform bug related to machine health checks, the good news is that we are working on a fix. The machine was running fine, the vm check was passing behind the scenes just not showing the real status to the CLI.

I forced a check update and that helped propagated its real status, now it shows all green:

~$ fly checks list
Health Checks for mailgrip-prod-db
  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  pg   | passing | 69e784776c0834 | 11m16s ago   | [✓] transactions: read/write (547.92µs)
       |         |                |              | [✓] connections: 12 used, 3 reserved, 80 max (5.12ms)
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  role | passing | 69e784776c0834 | 11m17s ago   | leader
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  vm   | passing | 69e784776c0834 | 10m30s ago   | [✓] checkDisk: 36.87 GB (94.2%) free space on /data/ (51.4µs)
       |         |                |              | [✓] checkLoad: load averages: 0.04 0.06 0.02 (120.14µs)
       |         |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (71.3µs)
       |         |                |              | [✓] cpu: system spent 330ms of the last 60s waiting on cpu (59.34µs)
       |         |                |              | [✓] io: system spent 0s of the last 60s waiting on io (66.02µs)
-------*---------*----------------*--------------*-------------------------------------------------------------------------

Topic		Replies	Views
PosgreSQL on Fly: 1 critical health check	10	635	December 20, 2021
PostgreSQL Database in Failing State Questions / Help postgres	4	755	July 18, 2022
Cluster leader failing health checks waiting for CPU Questions / Help	6	548	August 15, 2023
Postgres (PG) Database (DB) issue: "checking stolon status" and "Error opening connection to database" Questions / Help postgres	4	438	May 4, 2023
New PostgreSQL cluster doesn't achieve healthy state Questions / Help postgres	1	317	June 23, 2022

Postgres health checks perpetually failing

Related topics