Postgres health checks perpetually failing

For context, I noticed that one of the database replicas had 2/3 health checks that were critical yesterday and it hasn’t resolved since. I tried restarting the app but then it falls back into the same 2/3 health checks that are failing for that replica. I believe that some data was lost yesterday as well when a user was trying to onboard onto the app.

The app has plenty of memory and volume size, so I don’t think those are the issues. One of the replicas is a few deploy versions behind but has 3 healthy checks. I tried upgrading the Postgres image to the newer Fly Postgres image this morning and then ran into the same health check issue.

Would love some support, if possible.

I ended up solving this by spinning up a new database container using a volume snapshot and attaching it to the app container. Still not entirely sure how this issue occurred in the first place, but hopefully that’s helpful context for anyone who runs into the same thing.

I’m seeing the same thing. Running fly checks list -a [your_app] reveals more information about the specific health check that’s failing. In my case it’s:

  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*--------------------------------------------------------
  pg   | passing | MACHINE_ID.... | 8m9s ago     | [✓] transactions: read/write (340.21µs)
       |         |                |              | [✓] connections: 16 used, 3 reserved, 80 max (6.73ms)
-------*---------*----------------*--------------*--------------------------------------------------------
  role | passing | MACHINE_ID.... | 8m3s ago     | leader
-------*---------*----------------*--------------*--------------------------------------------------------
  vm   | warning | MACHINE_ID.... | 8m32s ago    | waiting for status update
-------*---------*----------------*--------------*--------------------------------------------------------

… now to figure out why the vm is “waiting for status update” :slight_smile:

Hi @alexpls , you have been bitten a platform bug related to machine health checks, the good news is that we are working on a fix. The machine was running fine, the vm check was passing behind the scenes just not showing the real status to the CLI.

I forced a check update and that helped propagated its real status, now it shows all green:

~$ fly checks list
Health Checks for mailgrip-prod-db
  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  pg   | passing | 69e784776c0834 | 11m16s ago   | [✓] transactions: read/write (547.92µs)
       |         |                |              | [✓] connections: 12 used, 3 reserved, 80 max (5.12ms)
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  role | passing | 69e784776c0834 | 11m17s ago   | leader
-------*---------*----------------*--------------*-------------------------------------------------------------------------
  vm   | passing | 69e784776c0834 | 10m30s ago   | [✓] checkDisk: 36.87 GB (94.2%) free space on /data/ (51.4µs)
       |         |                |              | [✓] checkLoad: load averages: 0.04 0.06 0.02 (120.14µs)
       |         |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (71.3µs)
       |         |                |              | [✓] cpu: system spent 330ms of the last 60s waiting on cpu (59.34µs)
       |         |                |              | [✓] io: system spent 0s of the last 60s waiting on io (66.02µs)
-------*---------*----------------*--------------*-------------------------------------------------------------------------
1 Like