I deploy 40+ machines in a single app (udns). Today, I noticed that one of them has gone missing (without my intervention). I don’t know exactly when, but it is worrying that machines that are active may go missing like this.
# the zombie machine in question:
24d891dec4d687 udns-yyz stopped : 1970-01-01T00:00:00Z 0001-01-01T00:00:00Z
This is the second time that it has happened (previously, it was in vin which is not explicitly supported for machines, so that was okay), I thought I’d let Fly engs know that there’s some latent bug lurking which possibly could have dire consequences, esp for Fly-automated Postgres v2.
Thanks. So future decommissions could cause zombies? Or, is the root-cause being addressed? I ask because I’d want to factor this for before I begin to move all our prod traffic to Fly.
I don’t know if the issue I am seeing is related, but I can’t deploy newer image to any machine anymore (except the ones in maa). I suspect some lease or the other is what’s blocking the deploy to udns? If that’s indeed the case, how can I make Fly relinquish those leases?
Sorry it took a bit to get things cleaned up but you should no longer see machine 24d891dec4d687. And as for machine e148e452addd89, it did eventually start after we resolved an issue with our registry in yyz, incident.
Oh wow, a machine in jnb (73d8d1d7a9d891) that went full zombie (presumably due to some incident or the other) a few days ago has automagically recovered! I needn’t monitor zombies anymore then?