I have a zombie machine and can't kill it

Hello,
I just noticed that my “worker” machine has gone in a “zombie” status 3 days ago, and I wasn’t aware of it.
When I run fly machine list I see the machine with ID e784362a436998 in started state, with no image and with creation date 1970-01-01T00:00:00Z.

I just run a new fly deploy and a new machine was created. Everything works again but I can’t remove this old machine… Am I currently being charged for this machine?

I would like to understand why this happened, and how to prevent it from happening again.

Thanks!

Hi Gustavo,

I took a look at your account and the server which was hosting that Fly Machine went up in flames and is never coming back. So you’re getting an error because flyctl can’t connect with that server to tell it to destroy your machine.

We’re going to be destroy those Fly Machines from the backend in a little while, so you should see it gone soon.

We checked our billing system and confirmed that immediately from the start of the downtime, the system should NOT have allowed you to accrue any more charges for that Machine. And if when you get your bill for this month you see that a change somehow slipped through anyway, please email billing@fly.io to ask for a refund.

Hope this helps, I’ll be around to take any more questions you have.

Thanks a lot for your fast response!

How could I be notified if this happens again?

If you log in to the dashboard on the Fly Web UI, you should see that there’s a banner up at the top which, in this case, reads

A server hosting some of your apps has suffered irreparable hardware damage. Please migrate your Fly Machines to other hosts and restore volumes from any backups.

That’s been there since this server broke. Any issues which don’t affect the Fly Platform as a whole but may impact individual user’s apps will be published in this fashion.

Is that what you’re looking for about notification?

I think I might have the same problem. I have this status update:

2024-04-01 19:56:16 UTC A server hosting some of your apps has suffered irreparable hardware damage. Please migrate your Fly Machines to other hosts and restore volumes from any backups.

As it happens the app was fine, but perhaps it was booted automatically elsewhere, and I did not notice downtime.

So I did the scale-to-0 and scale-to-2 trick, and now I have three machines:

flyctl machines list
3 machines have been retrieved from app brumstack.
View them in the UI here (​https://fly.io/apps/brumstack/machines/)

brumstack
ID              NAME                    STATE   REGION  IMAGE           IP ADDRESS                      VOLUME  CREATED                 LAST UPDATED            APP PLATFORM    PROCESS GROUP   SIZE                
2871753a0e1228  billowing-sound-326     started lhr     brumstack:      fdaa:5:f9ca:a7b:19:8885:35cf:2          2024-04-06T15:22:18Z    2024-04-06T15:22:23Z    v2              app             shared-cpu-1x:256MB     
7843d29a23d728  weathered-pine-2382     started lhr     brumstack:      fdaa:5:f9ca:a7b:19:cee0:72b1:2          2024-04-06T15:22:18Z    2024-04-06T15:22:23Z    v2              app             shared-cpu-1x:256MB     
4d8902da439d18  divine-violet-9658      started         :                                                       1970-01-01T00:00:00Z    0001-01-01T00:00:00Z                                                            

I can neither stop nor kill the last one. If I try to list machines in the GUI, then I get:

There was an error loading machines

I’d expect things to be rather more robust than this in the case of failure. Now I can destroy the app and recreate it, but I wonder if it is better that I am raising it here, so that the pain point can be identified. Specifically users need to be able to list or remote machines even if some are dead.

No rush on this, thanks for looking into it.

Could we receive an email notification? I assume it is currently in the app only.

If you had two Fly Machines running before, then they would have been on separate hosts, so when one host server died, the platform would just route all traffic to the remaining Fly Machine. This one of the central designs of Fly Platform, and why we encourage everyone to run multiple smaller Machines instead of one big one.

Yes, that last Machine was the one on the dead host. This is actually the first time this type of hardware failure has happened, so it’s surfacing some bugs that we’re now addressing, and that’s why you can’t destroy this Machine. But we’ve double-checked that you are not being charged for the ghost Machine and have not been since the server failed. (And in the unlikely event that we’ve double-checked wrong and there are extra charges on your statement for this month, please email billing@fly.io and we’ll fix it.)

Correct, at present you cannot, but this was already a feature we already had planned, and it’s now been bumped up in priority.

1 Like

Great stuff, thanks @john-fly. I was hoping I could do some clear-up myself, but it isn’t an important request. I shall await the official tidy-up :relieved:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.