Health check for your postgres database has failed. Your database is malfunctioning

We are frequently encountering this issue on our unmanaged PostgreSQL server and would like to identify the root cause of the recurring health check failures.

Fly checks :

Check Status Output Updated
pg passing [✓] connections: 178 used, 3 reserved, 300 max (551.58ms) [✓] cluster-locks: No active locks detected (30.22µs) [✓] disk-capacity: 20.0% - readonly mode will be enabled at 90.0% (25.12µs) 2026-01-09 07:01:12
vm critical 500 Internal Server Error [✓] checkDisk: 2.29 GB (80.0%) free space on /data/ (45.42µs) [✓] checkLoad: load averages: 0.49 0.33 0.25 (228.77µs) [✓] memory: system spent 0s of the last 60s waiting on memory (33.84µs) [✗] cpu: system spent 1.77s of the last 10 seconds waiting on cpu (23.14µs) [✓] io: system spent 0s of the last 60s waiting on io (21.19µs) 2026-01-08 08:22:02
role passing primary

image

i usually encounter this

in my case prolly one of this

  • cpu shared balance is gone
  • Out of memory
  • Out of disk
  • Our lack of cpu compute

I’d say that if you have to ask, it would be better to move onto a managed service. DBA stuff is expert-level stuff that most engineers (and all vibe-coders) should try to avoid.

For now, I dare say you will want to fix what you have. Consider posting these details in this thread:

  • How many machines are in the cluster
  • The disk space free/consumed on each one
  • Any useful graphs/metrics from Grafana

Your load averages 0.49 0.33 0.25 i.e. CPU look fine, but it’s the same VM section that failed the check. It’s not clear to me what is producing that 500 error.

Thanks for the pointers — sharing the details below:

Cluster size: 3 machines
Region: SIN
Machine size: shared-cpu-1x @ 1024MB on all nodes

Primary Machine2/3 checks passing fewer times becomes 1/3
Other two machines3/3 checks passing

Disk space info (ran df-h command after ssh) :

Machine 1 (Primary):

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  418M  2.3G  16% /data

Machine 2:

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  418M  2.3G  16% /data

Machine 3:

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  385M  2.4G  14% /data

Grafana Metrics (Primary Machine) attatched below:

Are you using shared-cpu ?

The below status is when the machine went down to 1/3 checks recently

Component Status Details
pg critical 500 Internal Server Error
✗ connections: Timed out (316.62ms)
- cluster-locks: Not processed
- disk-capacity: Not processed
vm critical 500 Internal Server Error
✓ checkDisk: 2.29 GB (80.0%) free space on /data
✓ checkLoad: load averages: 0.49 0.33 0.25
✓ memory: system spent 0s of the last 60s waiting on memory
✗ cpu: system spent 1.77s of the last 10 seconds waiting on cpu
✓ io: system spent 0s of the last 60s waiting on io
role passing primary

try to use performance cpu 2x first for couple minutes and then after that scale down to cpu 4x shared

cpu 1x shared is really useless for real project that have real user, for db atleast use 2x cpu
redis 1x cpu. if its QA env for budget just enable auto stop machine

1 Like

Thanks… will try this and share output

Scaled up to shared-cpu-2x@2048MB seems like all the checks are passing, will keep track of the machine for few more days

1 Like