One region is pretty 3 versions behind

I’m not sure how this happened, but my deploys now all fail healthchecks (so they don’t deploy) and one region is three versions behind the rest :grimacing:

App
  Name     = withered-frost-3196          
  Owner    = personal                     
  Version  = 38                           
  Status   = running                      
  Hostname = withered-frost-3196.fly.dev  

Deployment Status
  ID          = a1b97031-f670-caf5-e733-53b6c7e9ddc3                                                                                   
  Version     = v38                                                                                                                    
  Status      = failed                                                                                                                 
  Description = Failed due to unhealthy allocations - not rolling back to stable job version 38 as current job has same specification  
  Instances   = 6 desired, 5 placed, 4 healthy, 1 unhealthy                                                                            

Instances
ID       PROCESS VERSION REGION DESIRED STATUS  HEALTH CHECKS                  RESTARTS CREATED              
baf13a93 app     38 ⇡    syd    run     running 2 total, 1 passing, 1 critical 0        7m51s ago            
cf02ad82 app     38 ⇡    ams    run     running 2 total, 2 passing             0        8m40s ago            
5a2b8809 app     38 ⇡    dfw    run     running 2 total, 2 passing             0        9m27s ago            
3e6262cd app     38 ⇡    scl    run     running 2 total, 2 passing             0        10m23s ago           
340b3f41 app     38 ⇡    maa    run     running 2 total, 2 passing             0        11m38s ago           
86a2e6ae app     35      hkg    run     running 2 total, 2 passing             1        2021-12-04T18:42:02Z 

Any ideas?

This is likely due to a deploy that failed midway through without rolling back.

Will you try running fly vm stop 86a2e6ae? That should stop the old one and let a new one replace it.

Do you know what healthchecks are failing offhand?

I’ll try that.

I don’t. They’re working in production and they’re working locally. I don’t think the change that was made would impact the healthcheck.

Try running fly vm status baf13a93. That should show you exactly what healthcheck output is. Is it the Sydney VMs hanging every time?

Got:

Instance
  ID            = baf13a93                        
  Process       =                                 
  Version       = 38                              
  Region        = syd                             
  Desired       = run                             
  Status        = running                         
  Health Checks = 2 total, 1 passing, 1 critical  
  Restarts      = 6                               
  Created       = 7h37m ago                       

Recent Events
TIMESTAMP            TYPE            MESSAGE                                                  
2021-12-08T15:33:57Z Received        Task received by client                                  
2021-12-08T15:34:23Z Task Setup      Building Task Directory                                  
2021-12-08T15:34:37Z Started         Task started by client                                   
2021-12-08T15:39:23Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline 
2021-12-08T17:17:09Z Terminated      Exit Code: 1                                             
2021-12-08T17:17:09Z Restarting      Task restarting in 1.087801348s                          
2021-12-08T17:17:18Z Started         Task started by client                                   
2021-12-08T18:57:30Z Terminated      Exit Code: 1                                             
2021-12-08T18:57:30Z Restarting      Task restarting in 1.144576488s                          
2021-12-08T18:57:39Z Started         Task started by client                                   
2021-12-08T19:58:15Z Terminated      Exit Code: 1                                             
2021-12-08T19:58:15Z Restarting      Task restarting in 1.17965937s                           
2021-12-08T19:58:24Z Started         Task started by client                                   
2021-12-08T20:59:02Z Terminated      Exit Code: 1                                             
2021-12-08T20:59:02Z Restarting      Task restarting in 1.142763858s                          
2021-12-08T20:59:11Z Started         Task started by client                                   
2021-12-08T21:59:47Z Terminated      Exit Code: 1                                             
2021-12-08T21:59:47Z Restarting      Task restarting in 1.223401259s                          
2021-12-08T21:59:54Z Started         Task started by client                                   
2021-12-08T23:00:30Z Terminated      Exit Code: 1                                             
2021-12-08T23:00:30Z Restarting      Task restarting in 1.13937512s                           
2021-12-08T23:00:38Z Started         Task started by client                                   

Checks
ID                               SERVICE  STATE    OUTPUT                                                                                                                
d0378389af8ff348b2cfa2beaa464fc7 tcp-8080 passing  TCP connect 172.19.2.42:8080: Success                                                                                 
03833b6def760b24d9962af66e7ec077 tcp-8080 critical Get "http://172.19.2.42:8080/healthcheck": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 

Recent Logs

It looks like the healthcheck URL is timing out for some reason. Is that connecting to any databases? Do you see anything obvious in fly logs -i baf13a93?

Yeah, that’s hitting redis, postgres, and the app itself: kentcdodds.com/healthcheck.tsx at 31dacdb37865f874648c9f6ed131c079d7cc5a63 · kentcdodds/kentcdodds.com · GitHub

2021-12-08T23:14:21.341 app[baf13a93] syd [info] REDIS replicaClient (syd.kcd-redis.internal:6379) ERROR: Error: getaddrinfo ENOTFOUND syd.kcd-redis.internal
2021-12-08T23:14:21.341 app[baf13a93] syd [info]     at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:71:26) {
2021-12-08T23:14:21.341 app[baf13a93] syd [info]   errno: -3008,
2021-12-08T23:14:21.341 app[baf13a93] syd [info]   code: 'ENOTFOUND',
2021-12-08T23:14:21.341 app[baf13a93] syd [info]   syscall: 'getaddrinfo',
2021-12-08T23:14:21.341 app[baf13a93] syd [info]   hostname: 'syd.kcd-redis.internal'
2021-12-08T23:14:21.341 app[baf13a93] syd [info] }