PosgreSQL on Fly: 1 critical health check

Hello !

When running fly status on our production postgres database one of the instance has 1 critical health check.

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED
aacdaf1e app     9       cdg    run     running (replica) 3 total, 3 passing             18       14h1m ago
764e0221 app     9       cdg    run     running (leader)  3 total, 2 passing, 1 critical 3        2021-12-14T03:07:21Z

But I don’t see anything in the logs. Any ideas?

Hey @yannis,

Could you try running fly checks list --app <app-name>. This will give some more information on what may going on.

➜  fly checks list --app prod-db
Health Checks for prod-db
NAME STATUS   ALLOCATION REGION TYPE LAST UPDATED OUTPUT
pg   passing  103c54b0   cdg    HTTP 4s ago       HTTP GET
                                                  http://172.19.4.26:5500/flycheck/pg:
                                                  200 OK Output: "[✓]
                                                  transactions: read/write
                                                  (194.39µs)\n[✓] replicationLag:
                                                  fdaa:0:2c1d:a7b:5adc:0:69e0:2
                                                  is lagging 683ns (191ns)\n[✓]
                                                  connections: 130 used, 3 reserved,
                                                  300 max (3.4ms)"
vm   critical 103c54b0   cdg    HTTP 1m50s ago    HTTP GET
                                                  http://172.19.4.26:5500/flycheck/vm:
                                                  500 Internal Server Error
                                                  Output: "[✓] checkDisk: 45.26 GB
                                                  (92.4%!)(MISSING) free space on
                                                  /data/ (32.08µs)\n[✓] checkLoad:
                                                  load averages: 0.40 0.61 0.71
                                                  (54.51µs)\n[✓] memory: system spent
                                                  0s of the last 60s waiting on memory
                                                  (23.1µs)\n[✗] cpu: system spent
                                                  1.13s of the last 10 seconds waiting
                                                  on cpu (15.63µs)\n[✓] io: system
                                                  spent 0s of the last 60s waiting on
                                                  io (13.03µs)"
role passing  103c54b0   cdg    HTTP 15m5s ago    leader
vm   passing  e9c06da7   cdg    HTTP 2m8s ago     HTTP GET
                                                  http://172.19.2.2:5500/flycheck/vm:
                                                  200 OK Output: "[✓] checkDisk:
                                                  45.31 GB (92.5%) free space on
                                                  /data/ (32.36µs)\n[✓] checkLoad:
                                                  load averages: 0.00 0.01 0.00
                                                  (45.54µs)\n[✓] memory: system
                                                  spent 0s of the last 60s waiting on
                                                  memory (22.69µs)\n[✓] cpu: system
                                                  spent 0s of the last 60s waiting on
                                                  cpu (13.95µs)\n[✓] io: system spent
                                                  0s of the last 60s waiting on io
                                                  (17.12µs)"
pg   passing  e9c06da7   cdg    HTTP 2m36s ago    HTTP GET
                                                  http://172.19.2.2:5500/flycheck/pg:
                                                  200 OK Output: "[✓] transactions:
                                                  readonly (291.35µs)\n[✓]
                                                  replication: syncing from
                                                  fdaa:0:2c1d:a7b:5bd4:0:69da:2
                                                  (112.21µs)\n[✓] connections: 10
                                                  used, 3 reserved, 300 max (3.09ms)"
role passing  e9c06da7   cdg    HTTP 11m37s ago   replica

@shaun I scaled the volumes from 10 go to 50 go but I still have

500 Internal Server Error
                                                  Output: "[✓] checkDisk: 45.26 GB
                                                  (92.4%!)(MISSING) free space on
                                                  /data/ (32.08µs)\n[✓]

I admit that the check output is not super easy on the eyes, however, it looks like it’s the CPU check that’s failing, not your disk check.

[✗] cpu: system spent 1.13s of the last 10 seconds waiting on cpu (15.63µs)

Yes in the rush I read “MISSING free space on /data/” :sweat_smile:
@shaun Does scaling up to dedicated-cpu-4x can solve the issue?

@yannis If you’re experiencing performance issues and you’re looking for a quick fix, then that would be your best bet. However, I would also consider digging into why your CPU utilization is high.

Some troubleshooting ideas:

1 Like

@yannis also worth running fly image show on the postgres cluster to make sure it’s recent. We made some changes to that particular health check, so a fly image update might fix it.

1 Like

@kurt We just updated the image.

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED
f6efc237 app     14      cdg    run     running (replica) 3 total, 3 passing             0        7m15s ago
cb17d424 app     14      cdg    run     running (replica) 3 total, 2 passing, 1 critical 0        9m15s ago
Health Checks for prod-db
NAME STATUS   ALLOCATION REGION TYPE LAST UPDATED OUTPUT
pg   passing  f6efc237   cdg    HTTP 1m30s ago    HTTP GET
                                                  http://172.19.1.106:5500/flycheck/pg:
                                                  200 OK Output: "[✓] transactions:
                                                  readonly (222.74µs)\n[✓]
                                                  replication: syncing from
                                                  fdaa:0:2c1d:a7b:5bd4:0:69da:2
                                                  (127.53µs)\n[✓] connections: 11 used,
                                                  3 reserved, 300 max (3.69ms)"
role passing  f6efc237   cdg    HTTP 5m0s ago     replica
vm   passing  f6efc237   cdg    HTTP 6m39s ago    HTTP GET
                                                  http://172.19.1.106:5500/flycheck/vm:
                                                  200 OK Output: "[✓] checkDisk:
                                                  46.4 GB (94.8%) free space on
                                                  /data/ (39.95µs)\n[✓] checkLoad:
                                                  load averages: 0.00 0.00 0.00
                                                  (78.28µs)\n[✓] memory: system spent
                                                  0s of the last 60s waiting on memory
                                                  (34.1µs)\n[✓] cpu: system spent
                                                  132ms of the last 60s waiting on
                                                  cpu (25.2µs)\n[✓] io: system spent
                                                  168ms of the last 60s waiting on io
                                                  (32.56µs)"
role passing  cb17d424   cdg    HTTP 14s ago      leader
pg   passing  cb17d424   cdg    HTTP 1m23s ago    HTTP GET
                                                  http://172.19.4.10:5500/flycheck/pg:
                                                  200 OK Output: "[✓]
                                                  transactions: read/write
                                                  (279.51µs)\n[✓] replicationLag:
                                                  fdaa:0:2c1d:a7b:5adc:0:69e0:2
                                                  is lagging 942ns (210ns)\n[✓]
                                                  connections: 112 used, 3 reserved,
                                                  300 max (3.45ms)"
vm   critical cb17d424   cdg    HTTP 5m16s ago    HTTP GET
                                                  http://172.19.4.10:5500/flycheck/vm:
                                                  500 Internal Server Error
                                                  Output: "[✓] checkDisk: 45.23 GB
                                                  (92.4%!)(MISSING) free space on
                                                  /data/ (28.02µs)\n[✓] checkLoad:
                                                  load averages: 0.05 0.16 0.40
                                                  (42.56µs)\n[✓] memory: system spent
                                                  0s of the last 60s waiting on memory
                                                  (22.59µs)\n[✗] cpu: system spent
                                                  1.28s of the last 10 seconds waiting
                                                  on cpu (15.94µs)\n[✗] io: system
                                                  spent 1.13s of the last 10 seconds
                                                  waiting on io (14.81µs)"

It seems very strange to have no leader but 2 replicas.

Ok we now have a leader.

App
  Name     = prod-db
  Owner    = ulysse
  Version  = 14
  Status   = running
  Hostname = prod-db.fly.dev

Deployment Status
  ID          = 8262a5d4-aed0-ab37-719c-e18bb51a39e6
  Version     = v14
  Status      = successful
  Description = Deployment completed successfully
  Instances   = 2 desired, 2 placed, 2 healthy, 0 unhealthy

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED
f6efc237 app     14      cdg    run     running (replica) 3 total, 3 passing             0        9m56s ago
cb17d424 app     14      cdg    run     running (leader)  3 total, 2 passing, 1 critical 0        11m56s ago

The upgrade doesn’t fix the health check we will have to investigate further.

We can probably tune that health check more, but it probably means there’s not enough CPU for what postgres needs to do. If you’re comfortable with the performance, you can ignore that for now.