Postgres repeatedly going down despite little load

I’m repeatedly seeing my postgres instance go down but I can’t see any reason for it. It’s a development configuration (single node) which I appreciate I could change to a high available configuration to prevent downtime, but I’m still worried about why it’s going down at all.

I have almost 0 load, and using < 20% of the volume capacity.

The last logs from the postgres instance are these…

2023-05-12T14:33:26Z app[4d89169da43038] lhr [info]monitor  | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2023-05-12T14:33:29Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:33:29.425 UTC [570] LOG:  checkpoint starting: time
2023-05-12T14:33:29Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:33:29.930 UTC [570] LOG:  checkpoint complete: wrote 6 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.502 s, sync=0.001 s, total=0.505 s; sync files=6, longest=0.001 s, average=0.001 s; distance=7 kB, estimate=157 kB
2023-05-12T14:33:35Z app[4d89169da43038] lhr [info]repmgrd  | [2023-05-12 14:33:35] [INFO] monitoring primary node "fdaa:2:5df:a7b:13d:673f:af84:2" (ID: 776753358) in normal state
2023-05-12T14:38:26Z app[4d89169da43038] lhr [info]monitor  | Voting member(s): 1, Active: 1, Inactive: 0, Conflicts: 0
2023-05-12T14:38:30Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:38:30.004 UTC [570] LOG:  checkpoint starting: time
2023-05-12T14:38:30Z app[4d89169da43038] lhr [info]postgres | 2023-05-12 14:38:30.106 UTC [570] LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.101 s, sync=0.001 s, total=0.103 s; sync files=1, longest=0.001 s, average=0.001 s; distance=2 kB, estimate=141 kB
2023-05-12T14:38:36Z app[4d89169da43038] lhr [info]repmgrd  | [2023-05-12 14:38:36] [INFO] monitoring primary node "fdaa:2:5df:a7b:13d:673f:af84:2" (ID: 776753358) in normal state
2023-05-12T14:43:26Z app[4d89169da43038] lhr [info]Current connection count is 1
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]Starting clean up.
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]Umounting /dev/vdb from /data
2023-05-12T14:43:27Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:28Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:28Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:29Z app[4d89169da43038] lhr [info]error umounting /data: EBUSY: Device or resource busy, retrying in a bit
2023-05-12T14:43:31Z app[4d89169da43038] lhr [info][ 3605.187059] reboot: Restarting system

Looking at the checks just before/after it goes down I don’t see any sign of reaching any load capacity:

% fly checks list --config db.toml
Health Checks for blackwell-routines-service-staging-data
  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT                                                                       
-------*---------*----------------*--------------*------------------------------------------------------------------------------
  pg   | passing | 4d89169da43038 | 51m39s ago   | [✓] connections: 10 used, 3 reserved, 300 max (3.51ms)                       
       |         |                |              | [✓] cluster-locks: No active locks detected (5.86µs)                         
       |         |                |              | [✓] disk-capacity: 14.1% - readonly mode will be enabled at 90.0% (14.95µs)  
-------*---------*----------------*--------------*------------------------------------------------------------------------------
  role | passing | 4d89169da43038 | 51m28s ago   | primary                                                                      
-------*---------*----------------*--------------*------------------------------------------------------------------------------
  vm   | passing | 4d89169da43038 | 51m32s ago   | [✓] checkDisk: 848.37 MB (85.9%) free space on /data/ (62µs)                 
       |         |                |              | [✓] checkLoad: load averages: 0.00 0.00 0.00 (64.27µs)                       
       |         |                |              | [✓] memory: system spent 0s of the last 60s waiting on memory (26.68µs)      
       |         |                |              | [✓] cpu: system spent 210ms of the last 60s waiting on cpu (23.36µs)         
       |         |                |              | [✓] io: system spent 492ms of the last 60s waiting on io (31.04µs)           
-------*---------*----------------*--------------*------------------------------------------------------------------------------
% fly checks list --config db.toml
Health Checks for blackwell-routines-service-staging-data
  NAME | STATUS  | MACHINE        | LAST UPDATED | OUTPUT                      
-------*---------*----------------*--------------*-----------------------------
  pg   | warning | 4d89169da43038 | 6m42s ago    | the machine hasn't started  
-------*---------*----------------*--------------*-----------------------------
  role | warning | 4d89169da43038 | 6m42s ago    | the machine hasn't started  
-------*---------*----------------*--------------*-----------------------------
  vm   | warning | 4d89169da43038 | 6m42s ago    | the machine hasn't started  
-------*---------*----------------*--------------*-----------------------------

The instance doesn’t ever restart by itself, but running fly machine restart resolves the issue for a short period of time before it goes down again.

It seems this is probably the same problem as here: Postgress db going down

1 Like

If you’re using a shared-cpu-1x single node cluster, the app automatically shuts down if there are no open connections in the last hour and usually starts up again if something tries to connect to it.

There’s a chance that there might be a setting in the machine running the postgres app that’s disabling autostart. A user noticed a similar issue, and made the postgres app start up on its own by running a machine update and setting the autostart flag to true → fly machines update <machine_id> --autostart=true --app <app_name>

Alternatively, you could try this fix and override the FLY_SCALE_TO_ZERO= secret so the app never scales to zero.

Thanks a lot!

I’ve decided to opt for a high availability cluster instead as having a stable environment is important to us even though it’s just a development environment.

I will keep this bookmarked though in case I set up a new environment and see if this helps :raised_hands:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.