We’ve hit the dreaded “no responders” error in the DFW region that prevents fly deploy
s to our Machines fleet.
fly deploy --config fly.machines.toml -a udns --strategy immediate --image registry.fly.io/udns:6b9a2e9f3c043a80edbddf0665877da974052c7e --verbose --auto-confirm
==> Verifying app config
--> Verified app config
==> Building image
Searching for image 'registry.fly.io/udns:6b9a2e9f3c043a80edbddf0665877da974052c7e' remotely...
image found: img_nlo943m9ylzpwxzd
Deploying with immediate strategy ✓
Error failed to get lease on VM 06e82557bd2987: nats: no responders available for request
Usually, force removing a Machine works, but in this case it doesn’t:
fly m remove 06e82557bd2987 -f
machine 06e82557bd2987 was found and is currently in started state, attempting to destroy...
Error could not destroy machine 06e82557bd2987: failed to destroy VM 06e82557bd2987: nats: no responders available for request
Machine status:
fly m status -d 06e82557bd2987
Machine ID: 06e82557bd2987
Instance ID: 01GHNJZ8C3DWNAG10Q5BWFN2PT
State: started
VM
ID = 06e82557bd2987
Instance ID = 01GHNJZ8C3DWNAG10Q5BWFN2PT
State = started
Image = udns:94b28531a4bfd38ad5f3a23e355b4f917d33894a
Name = udns-dfw
Private IP = fdaa:0:35f3:a7b:2203:b916:9f80:2
Region = dfw
Process Group = app
Memory = 256
CPUs = 1
Created = 2022-09-18T00:44:50Z
Updated = 2022-11-25T04:14:14Z
Command =
Event Logs
STATE EVENT SOURCE TIMESTAMP INFO
started start flyd 2022-11-25T09:44:14.757+05:30
starting start flyd 2022-11-25T09:44:14.474+05:30
stopped exit flyd 2022-11-25T09:44:14.373+05:30
started start flyd 2022-11-25T07:51:29.478+05:30
starting start flyd 2022-11-25T07:51:29.175+05:30
I doubt this Machine (06e82557bd2987
) recovers on its own:
- I am curious how and when Machines can enter this state?
- And if there’s a way to side-step it (
fly m remove -f
doesn’t work) or avoid it?
We’ve hit this before for Machines running in vin
, and it required manual intervention by Fly’s super-operators.