database cluster unstable

We’ve restarted our database app a couple of times and it works for a couple of mins but then fails health checks, Here are the result of flyctl status

App
  Name     = database          
  Owner    = paypack           
  Version  = 4                 
  Status   = running           
  Hostname = database.fly.dev  

Instances
ID       VERSION REGION DESIRED STATUS                 HEALTH CHECKS       RESTARTS CREATED              
8ff3cc74 4       lhr    run     running (context dead) 3 total, 3 critical 3        2021-08-12T08:12:37Z 
b487a403 4       lhr    run     running (context dead) 3 total, 3 critical 4        2021-08-12T08:06:32Z 
flyctl vm status b487a403 --app database
Instance
  ID            = b487a403                
  Version       = 4                       
  Region        = lhr                     
  Desired       = run                     
  Status        = running (context dead)  
  Health Checks = 3 total, 3 critical     
  Restarts      = 4                       
  Created       = 2021-08-12T08:06:32Z    

Recent Events
TIMESTAMP            TYPE             MESSAGE                 
2021-08-12T08:06:29Z Received         Task received by client 
2021-08-12T08:11:48Z Task Setup       Building Task Directory 
2021-08-12T08:11:51Z Started          Task started by client  
2021-08-12T13:37:24Z Restart Signaled User requested restart  
2021-08-12T13:37:29Z Terminated       Exit Code: 0            
2021-08-12T13:37:29Z Restarting       Task restarting in 0s   
2021-08-12T13:37:32Z Started          Task started by client  
2021-08-12T15:08:18Z Restart Signaled User requested restart  
2021-08-12T15:09:20Z Terminated       Exit Code: 0            
2021-08-12T15:09:21Z Restarting       Task restarting in 0s   
2021-08-12T15:09:22Z Started          Task started by client  
2021-08-16T14:58:36Z Restart Signaled User requested restart  
2021-08-16T14:58:38Z Terminated       Exit Code: 0            
2021-08-16T14:58:38Z Restarting       Task restarting in 0s   
2021-08-16T14:58:40Z Started          Task started by client  
2021-08-16T15:38:39Z Restart Signaled User requested restart  
2021-08-16T15:43:42Z Terminated       Exit Code: 0            
2021-08-16T15:43:42Z Restarting       Task restarting in 0s   
2021-08-16T15:43:44Z Started          Task started by client  

Checks
ID   SERVICE STATE    OUTPUT                                                      
role app     critical context deadline exceeded                                   
pg   app     critical context deadline exceeded                                   
vm   app     critical [✗] system spent 8.6 of the last 10 seconds waiting for cpu 
                      [✓] 9 GB (92.0%) free space on /data/                       
                      [✓] load averages: 0.19 0.40 0.55                           
                      [✓] memory: 0.0s waiting over the last 60s                  
                      [✓] io: 0.0s waiting over the last 60s

Do you remember what the fly status output was showing before you ran restart? In general, fly restart is not good to run against a postgres (since it restarts all the vms at the same time). If one if misbehaving, fly vm stop <id> is the better bet.

I also noticed there are 4 volumes on your db app, but only two VMs running. This is unpredictable, since VMs can launch on any of the volumes. For a Postgres cluster you should run the same VM count as you have volumes.

If you’d like to remove volumes, run fly volumes delete <id> on the ones you want destroyed, then run fly scale count <num> to match the new number of volumes.

Can we delete any vm without any issue?

I would delete replicas first. If you run fly volumes list you can see which volume is being used by which VM.

It looks like it’s safe to delete the newest volumes on that DB.