When running fly status on our production postgres database one of the instance has 1 critical health check.
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
aacdaf1e app 9 cdg run running (replica) 3 total, 3 passing 18 14h1m ago
764e0221 app 9 cdg run running (leader) 3 total, 2 passing, 1 critical 3 2021-12-14T03:07:21Z
➜ fly checks list --app prod-db
Health Checks for prod-db
NAME STATUS ALLOCATION REGION TYPE LAST UPDATED OUTPUT
pg passing 103c54b0 cdg HTTP 4s ago HTTP GET
http://172.19.4.26:5500/flycheck/pg:
200 OK Output: "[✓]
transactions: read/write
(194.39µs)\n[✓] replicationLag:
fdaa:0:2c1d:a7b:5adc:0:69e0:2
is lagging 683ns (191ns)\n[✓]
connections: 130 used, 3 reserved,
300 max (3.4ms)"
vm critical 103c54b0 cdg HTTP 1m50s ago HTTP GET
http://172.19.4.26:5500/flycheck/vm:
500 Internal Server Error
Output: "[✓] checkDisk: 45.26 GB
(92.4%!)(MISSING) free space on
/data/ (32.08µs)\n[✓] checkLoad:
load averages: 0.40 0.61 0.71
(54.51µs)\n[✓] memory: system spent
0s of the last 60s waiting on memory
(23.1µs)\n[✗] cpu: system spent
1.13s of the last 10 seconds waiting
on cpu (15.63µs)\n[✓] io: system
spent 0s of the last 60s waiting on
io (13.03µs)"
role passing 103c54b0 cdg HTTP 15m5s ago leader
vm passing e9c06da7 cdg HTTP 2m8s ago HTTP GET
http://172.19.2.2:5500/flycheck/vm:
200 OK Output: "[✓] checkDisk:
45.31 GB (92.5%) free space on
/data/ (32.36µs)\n[✓] checkLoad:
load averages: 0.00 0.01 0.00
(45.54µs)\n[✓] memory: system
spent 0s of the last 60s waiting on
memory (22.69µs)\n[✓] cpu: system
spent 0s of the last 60s waiting on
cpu (13.95µs)\n[✓] io: system spent
0s of the last 60s waiting on io
(17.12µs)"
pg passing e9c06da7 cdg HTTP 2m36s ago HTTP GET
http://172.19.2.2:5500/flycheck/pg:
200 OK Output: "[✓] transactions:
readonly (291.35µs)\n[✓]
replication: syncing from
fdaa:0:2c1d:a7b:5bd4:0:69da:2
(112.21µs)\n[✓] connections: 10
used, 3 reserved, 300 max (3.09ms)"
role passing e9c06da7 cdg HTTP 11m37s ago replica
@yannis If you’re experiencing performance issues and you’re looking for a quick fix, then that would be your best bet. However, I would also consider digging into why your CPU utilization is high.
@yannis also worth running fly image show on the postgres cluster to make sure it’s recent. We made some changes to that particular health check, so a fly image update might fix it.
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
f6efc237 app 14 cdg run running (replica) 3 total, 3 passing 0 7m15s ago
cb17d424 app 14 cdg run running (replica) 3 total, 2 passing, 1 critical 0 9m15s ago
Health Checks for prod-db
NAME STATUS ALLOCATION REGION TYPE LAST UPDATED OUTPUT
pg passing f6efc237 cdg HTTP 1m30s ago HTTP GET
http://172.19.1.106:5500/flycheck/pg:
200 OK Output: "[✓] transactions:
readonly (222.74µs)\n[✓]
replication: syncing from
fdaa:0:2c1d:a7b:5bd4:0:69da:2
(127.53µs)\n[✓] connections: 11 used,
3 reserved, 300 max (3.69ms)"
role passing f6efc237 cdg HTTP 5m0s ago replica
vm passing f6efc237 cdg HTTP 6m39s ago HTTP GET
http://172.19.1.106:5500/flycheck/vm:
200 OK Output: "[✓] checkDisk:
46.4 GB (94.8%) free space on
/data/ (39.95µs)\n[✓] checkLoad:
load averages: 0.00 0.00 0.00
(78.28µs)\n[✓] memory: system spent
0s of the last 60s waiting on memory
(34.1µs)\n[✓] cpu: system spent
132ms of the last 60s waiting on
cpu (25.2µs)\n[✓] io: system spent
168ms of the last 60s waiting on io
(32.56µs)"
role passing cb17d424 cdg HTTP 14s ago leader
pg passing cb17d424 cdg HTTP 1m23s ago HTTP GET
http://172.19.4.10:5500/flycheck/pg:
200 OK Output: "[✓]
transactions: read/write
(279.51µs)\n[✓] replicationLag:
fdaa:0:2c1d:a7b:5adc:0:69e0:2
is lagging 942ns (210ns)\n[✓]
connections: 112 used, 3 reserved,
300 max (3.45ms)"
vm critical cb17d424 cdg HTTP 5m16s ago HTTP GET
http://172.19.4.10:5500/flycheck/vm:
500 Internal Server Error
Output: "[✓] checkDisk: 45.23 GB
(92.4%!)(MISSING) free space on
/data/ (28.02µs)\n[✓] checkLoad:
load averages: 0.05 0.16 0.40
(42.56µs)\n[✓] memory: system spent
0s of the last 60s waiting on memory
(22.59µs)\n[✗] cpu: system spent
1.28s of the last 10 seconds waiting
on cpu (15.94µs)\n[✗] io: system
spent 1.13s of the last 10 seconds
waiting on io (14.81µs)"
It seems very strange to have no leader but 2 replicas.
App
Name = prod-db
Owner = ulysse
Version = 14
Status = running
Hostname = prod-db.fly.dev
Deployment Status
ID = 8262a5d4-aed0-ab37-719c-e18bb51a39e6
Version = v14
Status = successful
Description = Deployment completed successfully
Instances = 2 desired, 2 placed, 2 healthy, 0 unhealthy
Instances
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
f6efc237 app 14 cdg run running (replica) 3 total, 3 passing 0 9m56s ago
cb17d424 app 14 cdg run running (leader) 3 total, 2 passing, 1 critical 0 11m56s ago
The upgrade doesn’t fix the health check we will have to investigate further.
We can probably tune that health check more, but it probably means there’s not enough CPU for what postgres needs to do. If you’re comfortable with the performance, you can ignore that for now.