I’m repeatedly seeing my postgres instance go down but I can’t see any reason for it. It’s a development configuration (single node) which I appreciate I could change to a high available configuration to prevent downtime, but I’m still worried about why it’s going down at all.
I have almost 0 load, and using < 20% of the volume capacity.
The last logs from the postgres instance are these…
2023-05-12T14:33:26Z app[4d89169da43038] lhr [info]monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2023-05-12T14:33:29Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:33:29.425 UTC [570] LOG: checkpoint starting: time
2023-05-12T14:33:29Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:33:29.930 UTC [570] LOG: checkpoint complete: wrote 6 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.502 s, sync=0.001 s, total=0.505 s; sync files=6, longest=0.001 s, average=0.001 s; distance=7 kB, estimate=157 kB
2023-05-12T14:33:35Z app[4d89169da43038] lhr [info]repmgrd | [2023-05-12 14:33:35] [INFO] monitoring primary node "fdaa:2:5df:a7b:13d:673f:af84:2" (ID: 776753358) in normal state
2023-05-12T14:38:26Z app[4d89169da43038] lhr [info]monitor | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2023-05-12T14:38:30Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:38:30.004 UTC [570] LOG: checkpoint starting: time
2023-05-12T14:38:30Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:38:30.106 UTC [570] LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.101 s, sync=0.001 s, total=0.103 s; sync files=1, longest=0.001 s, average=0.001 s; distance=2 kB, estimate=141 kB
2023-05-12T14:38:36Z app[4d89169da43038] lhr [info]repmgrd | [2023-05-12 14:38:36] [INFO] monitoring primary node "fdaa:2:5df:a7b:13d:673f:af84:2" (ID: 776753358) in normal state
2023-05-12T14:43:26Z app[4d89169da43038] lhr [info]Current connection count is 1
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]Starting clean up.
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]Umounting /dev/vdb from /data
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:28Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:28Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:29Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:31Z app[4d89169da43038] lhr [info][ 3605.187059] reboot: Restarting system
Looking at the checks just before/after it goes down I don’t see any sign of reaching any load capacity:
% fly checks list --config db.toml
Health Checks for blackwell-routines-service-staging-data
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*------------------------------------------------------------------------------
pg | passing | 4d89169da43038 | 51m39s ago | [✓] connections: 10 used, 3 reserved, 300 max (3.51ms)
| | | | [✓] cluster-locks: No active locks detected (5.86µs)
| | | | [✓] disk-capacity: 14.1% - readonly mode will be enabled at 90.0% (14.95µs)
-------*---------*----------------*--------------*------------------------------------------------------------------------------
role | passing | 4d89169da43038 | 51m28s ago | primary
-------*---------*----------------*--------------*------------------------------------------------------------------------------
vm | passing | 4d89169da43038 | 51m32s ago | [✓] checkDisk: 848.37 MB (85.9%) free space on /data/ (62µs)
| | | | [✓] checkLoad: load averages: 0.00 0.00 0.00 (64.27µs)
| | | | [✓] memory: system spent 0s of the last 60s waiting on memory (26.68µs)
| | | | [✓] cpu: system spent 210ms of the last 60s waiting on cpu (23.36µs)
| | | | [✓] io: system spent 492ms of the last 60s waiting on io (31.04µs)
-------*---------*----------------*--------------*------------------------------------------------------------------------------
% fly checks list --config db.toml
Health Checks for blackwell-routines-service-staging-data
NAME | STATUS | MACHINE | LAST UPDATED | OUTPUT
-------*---------*----------------*--------------*-----------------------------
pg | warning | 4d89169da43038 | 6m42s ago | the machine hasn't started
-------*---------*----------------*--------------*-----------------------------
role | warning | 4d89169da43038 | 6m42s ago | the machine hasn't started
-------*---------*----------------*--------------*-----------------------------
vm | warning | 4d89169da43038 | 6m42s ago | the machine hasn't started
-------*---------*----------------*--------------*-----------------------------
The instance doesn’t ever restart by itself, but running fly machine restart
resolves the issue for a short period of time before it goes down again.
It seems this is probably the same problem as here: Postgress db going down