We are frequently encountering this issue on our unmanaged PostgreSQL server and would like to identify the root cause of the recurring health check failures.
Fly checks :
Check
Status
Output
Updated
pg
passing
[✓] connections: 178 used, 3 reserved, 300 max (551.58ms) [✓] cluster-locks: No active locks detected (30.22µs) [✓] disk-capacity: 20.0% - readonly mode will be enabled at 90.0% (25.12µs)
2026-01-09 07:01:12
vm
critical
500 Internal Server Error [✓] checkDisk: 2.29 GB (80.0%) free space on /data/ (45.42µs) [✓] checkLoad: load averages: 0.49 0.33 0.25 (228.77µs) [✓] memory: system spent 0s of the last 60s waiting on memory (33.84µs) [✗] cpu: system spent 1.77s of the last 10 seconds waiting on cpu (23.14µs) [✓] io: system spent 0s of the last 60s waiting on io (21.19µs)
I’d say that if you have to ask, it would be better to move onto a managed service. DBA stuff is expert-level stuff that most engineers (and all vibe-coders) should try to avoid.
For now, I dare say you will want to fix what you have. Consider posting these details in this thread:
How many machines are in the cluster
The disk space free/consumed on each one
Any useful graphs/metrics from Grafana
Your load averages 0.49 0.33 0.25 i.e. CPU look fine, but it’s the same VM section that failed the check. It’s not clear to me what is producing that 500 error.
The below status is when the machine went down to 1/3 checks recently
Component
Status
Details
pg
critical
500 Internal Server Error ✗ connections: Timed out (316.62ms) - cluster-locks: Not processed - disk-capacity: Not processed
vm
critical
500 Internal Server Error ✓ checkDisk: 2.29 GB (80.0%) free space on /data ✓ checkLoad: load averages: 0.49 0.33 0.25 ✓ memory: system spent 0s of the last 60s waiting on memory ✗ cpu: system spent 1.77s of the last 10 seconds waiting on cpu ✓ io: system spent 0s of the last 60s waiting on io
try to use performance cpu 2x first for couple minutes and then after that scale down to cpu 4x shared
cpu 1x shared is really useless for real project that have real user, for db atleast use 2x cpu
redis 1x cpu. if its QA env for budget just enable auto stop machine