Early look: PostgreSQL on Fly. We want your opinions.

I believe we had a failure earlier today, and I’m wondering if there’s anything to be concerned about esp. given our recent experience (Database reset, 2 days of data lost - #8 by kurt)

$ fly status -a production-db --all
App
  Name     = production-db          
  Owner    = enaia                  
  Version  = 11                     
  Status   = running                
  Hostname = production-db.fly.dev  

Instances
ID       PROCESS VERSION REGION DESIRED STATUS            HEALTH CHECKS                  RESTARTS CREATED              
f74e8482 app     11 ⇡    iad    run     running (replica) 3 total, 3 passing             0        4h24m ago            
e12ac95b app     11 ⇡    iad    stop    failed            3 total, 2 passing, 1 critical 0        2021-12-14T23:06:17Z 
3398f650 app     11 ⇡    iad    run     running (leader)  3 total, 3 passing             0        2021-12-14T23:05:13Z 
f88b0035 app     10      iad    stop    failed            3 total                        0        2021-12-14T22:58:45Z 

$ fly vm status f88b0035 -a production-db
Instance
  ID            = f88b0035              
  Process       =                       
  Version       = 10                    
  Region        = iad                   
  Desired       = stop                  
  Status        = failed                
  Health Checks = 3 total               
  Restarts      = 0                     
  Created       = 2021-12-14T22:58:45Z  

Recent Events
TIMESTAMP            TYPE            MESSAGE                                           
2021-12-14T22:58:36Z Received        Task received by client                           
2021-12-14T22:58:54Z Task Setup      Building Task Directory                           
2021-12-14T22:59:03Z Started         Task started by client                            
2021-12-14T22:59:05Z Terminated      Exit Code: 2                                      
2021-12-14T22:59:05Z Not Restarting  Policy allows no restarts                         
2021-12-14T22:59:05Z Alloc Unhealthy Unhealthy because of failed task                  
2021-12-14T22:59:06Z Killing         Sent interrupt. Waiting 5m0s before force killing 

Checks
ID   SERVICE STATE   OUTPUT 
pg   app     warning        
role app     warning        
vm   app     warning        

Recent Logs
$ fly vm status e12ac95b -a production-db
Instance
  ID            = e12ac95b                        
  Process       =                                 
  Version       = 11                              
  Region        = iad                             
  Desired       = stop                            
  Status        = failed                          
  Health Checks = 3 total, 2 passing, 1 critical  
  Restarts      = 0                               
  Created       = 2021-12-14T23:06:17Z            

Recent Events
TIMESTAMP            TYPE             MESSAGE                                           
2021-12-14T23:06:12Z Received         Task received by client                           
2021-12-14T23:06:30Z Task Setup       Building Task Directory                           
2021-12-14T23:06:40Z Started          Task started by client                            
2021-12-16T17:33:52Z Restart Signaled healthcheck: check "vm" unhealthy                 
2021-12-16T17:33:56Z Terminated       Exit Code: 0                                      
2021-12-16T17:33:56Z Not Restarting   Policy allows no restarts                         
2021-12-16T17:33:56Z Killing          Sent interrupt. Waiting 5m0s before force killing 

Checks
ID   SERVICE STATE    OUTPUT                                                                                                                                                                                                                                                                                                                                                                                                                                                  
pg   app     passing  HTTP GET http://172.19.0.66:5500/flycheck/pg: 200 OK Output: "[✓] transactions: read/write (3.73ms)\n[✓] replicationLag: fdaa:0:309a:a7b:ab9:0:30e5:2 is lagging 0s (100ns)\n[✓] connections: 29 used, 3 reserved, 300 max (8.86ms)"                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
vm   app     critical HTTP GET http://172.19.0.66:5500/flycheck/vm: 500 Internal Server Error Output: "[✓] checkDisk: 9.09 GB (92.9%!)(MISSING) free space on /data/ (977.26µs)\n[✓] checkLoad: load averages: 0.14 0.22 0.25 (412.55µs)\n[✗] memory: system spent 1.03s of the last 10 seconds waiting on memory (54.88µs)\n[✗] cpu: system spent 1.09s of the last 10 seconds waiting on cpu (16.35µs)\n[✓] io: system spent 3.95s of the last 60s waiting on io (14.81µs)" 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
role app     passing  leader                                                                                                                                                                                                                                                                                                                                                                                                                                                  

Recent Logs