Machines error blocking deploys: No responders available for request

We’ve hit the dreaded “no responders” error in the DFW region that prevents fly deploys to our Machines fleet.

fly deploy --config fly.machines.toml -a udns --strategy immediate --image registry.fly.io/udns:6b9a2e9f3c043a80edbddf0665877da974052c7e --verbose --auto-confirm
==> Verifying app config
--> Verified app config
==> Building image
Searching for image 'registry.fly.io/udns:6b9a2e9f3c043a80edbddf0665877da974052c7e' remotely...
image found: img_nlo943m9ylzpwxzd
Deploying with immediate strategy ✓
Error failed to get lease on VM 06e82557bd2987: nats: no responders available for request

Usually, force removing a Machine works, but in this case it doesn’t:

fly m remove 06e82557bd2987 -f
machine 06e82557bd2987 was found and is currently in started state, attempting to destroy...
Error could not destroy machine 06e82557bd2987: failed to destroy VM 06e82557bd2987: nats: no responders available for request

Machine status:

fly m status -d 06e82557bd2987                                                                                                                   
Machine ID: 06e82557bd2987
Instance ID: 01GHNJZ8C3DWNAG10Q5BWFN2PT
State: started

VM
  ID            = 06e82557bd2987                                 
  Instance ID   = 01GHNJZ8C3DWNAG10Q5BWFN2PT                     
  State         = started                                        
  Image         = udns:94b28531a4bfd38ad5f3a23e355b4f917d33894a  
  Name          = udns-dfw                                       
  Private IP    = fdaa:0:35f3:a7b:2203:b916:9f80:2               
  Region        = dfw                                            
  Process Group = app                                            
  Memory        = 256                                            
  CPUs          = 1                                              
  Created       = 2022-09-18T00:44:50Z                           
  Updated       = 2022-11-25T04:14:14Z                           
  Command       =                                                

Event Logs
STATE   	EVENT	SOURCE	TIMESTAMP                    	INFO 
started 	start	flyd  	2022-11-25T09:44:14.757+05:30	
starting	start	flyd  	2022-11-25T09:44:14.474+05:30	
stopped 	exit 	flyd  	2022-11-25T09:44:14.373+05:30	
started 	start	flyd  	2022-11-25T07:51:29.478+05:30	
starting	start	flyd  	2022-11-25T07:51:29.175+05:30	

I doubt this Machine (06e82557bd2987) recovers on its own:

  1. I am curious how and when Machines can enter this state?
  2. And if there’s a way to side-step it (fly m remove -f doesn’t work) or avoid it?

We’ve hit this before for Machines running in vin, and it required manual intervention by Fly’s super-operators.