Health check for your postgres database has failed. Your database is malfunctioning

Swasthik · January 9, 2026, 10:19am

We are frequently encountering this issue on our unmanaged PostgreSQL server and would like to identify the root cause of the recurring health check failures.

Fly checks :

Check	Status	Output	Updated
pg	passing	[✓] connections: 178 used, 3 reserved, 300 max (551.58ms) [✓] cluster-locks: No active locks detected (30.22µs) [✓] disk-capacity: 20.0% - readonly mode will be enabled at 90.0% (25.12µs)	2026-01-09 07:01:12
vm	critical	500 Internal Server Error [✓] checkDisk: 2.29 GB (80.0%) free space on /data/ (45.42µs) [✓] checkLoad: load averages: 0.49 0.33 0.25 (228.77µs) [✓] memory: system spent 0s of the last 60s waiting on memory (33.84µs) [✗] cpu: system spent 1.77s of the last 10 seconds waiting on cpu (23.14µs) [✓] io: system spent 0s of the last 60s waiting on io (21.19µs)	2026-01-08 08:22:02
role	passing	primary

erlangga · January 9, 2026, 6:54pm

i usually encounter this

in my case prolly one of this

cpu shared balance is gone
Out of memory
Out of disk
Our lack of cpu compute

halfer · January 9, 2026, 9:38pm

I’d say that if you have to ask, it would be better to move onto a managed service. DBA stuff is expert-level stuff that most engineers (and all vibe-coders) should try to avoid.

For now, I dare say you will want to fix what you have. Consider posting these details in this thread:

How many machines are in the cluster
The disk space free/consumed on each one
Any useful graphs/metrics from Grafana

Your load averages 0.49 0.33 0.25 i.e. CPU look fine, but it’s the same VM section that failed the check. It’s not clear to me what is producing that 500 error.

Swasthik · January 10, 2026, 7:43am

Thanks for the pointers — sharing the details below:

Cluster size: 3 machines
Region: SIN
Machine size: shared-cpu-1x @ 1024MB on all nodes

Primary Machine → 2/3 checks passing fewer times becomes 1/3
Other two machines → 3/3 checks passing

Disk space info (ran df-h command after ssh) :

Machine 1 (Primary):

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  418M  2.3G  16% /data

Machine 2:

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  418M  2.3G  16% /data

Machine 3:

Filesystem      Size  Used Avail Use% Mounted on
none            7.8G   29M  7.4G   1% /
/dev/vdb        7.8G   29M  7.4G   1% /.fly-upper-layer
shm             482M  1.1M  481M   1% /dev/shm
tmpfs           482M     0  482M   0% /sys/fs/cgroup
/dev/vdc        2.9G  385M  2.4G  14% /data

Grafana Metrics (Primary Machine) attatched below:

Swasthik · January 10, 2026, 7:45am

Are you using shared-cpu ?

Swasthik · January 10, 2026, 7:54am

The below status is when the machine went down to 1/3 checks recently

Component	Status	Details
pg	critical	500 Internal Server Error ✗ connections: Timed out (316.62ms) - cluster-locks: Not processed - disk-capacity: Not processed
vm	critical	500 Internal Server Error ✓ checkDisk: 2.29 GB (80.0%) free space on `/data` ✓ checkLoad: load averages: 0.49 0.33 0.25 ✓ memory: system spent 0s of the last 60s waiting on memory ✗ cpu: system spent 1.77s of the last 10 seconds waiting on cpu ✓ io: system spent 0s of the last 60s waiting on io
role	passing	primary

erlangga · January 10, 2026, 7:59am

try to use performance cpu 2x first for couple minutes and then after that scale down to cpu 4x shared

cpu 1x shared is really useless for real project that have real user, for db atleast use 2x cpu
redis 1x cpu. if its QA env for budget just enable auto stop machine

Swasthik · January 10, 2026, 8:48am

Thanks… will try this and share output

Swasthik · January 16, 2026, 2:54pm

Scaled up to shared-cpu-2x@2048MB seems like all the checks are passing, will keep track of the machine for few more days