A zombie machine appears

I deploy 40+ machines in a single app (udns). Today, I noticed that one of them has gone missing (without my intervention). I don’t know exactly when, but it is worrying that machines that are active may go missing like this.

# the zombie machine in question:
24d891dec4d687	udns-yyz 	stopped	   :	1970-01-01T00:00:00Z	0001-01-01T00:00:00Z

This is the second time that it has happened (previously, it was in vin which is not explicitly supported for machines, so that was okay), I thought I’d let Fly engs know that there’s some latent bug lurking which possibly could have dire consequences, esp for Fly-automated Postgres v2.

cc: @JP_Phillips

The underlying host that machine was on has been decommissioned. We’ll get the machine record updated to reflect this.

1 Like

Thanks. So future decommissions could cause zombies? Or, is the root-cause being addressed? I ask because I’d want to factor this for before I begin to move all our prod traffic to Fly.

I don’t know if the issue I am seeing is related, but I can’t deploy newer image to any machine anymore (except the ones in maa). I suspect some lease or the other is what’s blocking the deploy to udns? If that’s indeed the case, how can I make Fly relinquish those leases?

I can see that the newer machines that I clone in udns are also stuck in created and never transition to started/stopped.

e148e452addd89	udns-yyz3	created	yyz   	udns:deployment-01GE3134FAY3X1AS8F76DZBZ16	fdaa:0:35f3:a7b:88dc:eba1:c3e1:2	      	2022-10-25T15:14:52Z	2022-10-25T15:14:52Z	

Sorry it took a bit to get things cleaned up but you should no longer see machine 24d891dec4d687. And as for machine e148e452addd89, it did eventually start after we resolved an issue with our registry in yyz, incident.

1 Like

Oh wow, a machine in jnb (73d8d1d7a9d891) that went full zombie (presumably due to some incident or the other) a few days ago has automagically recovered! I needn’t monitor zombies anymore then?