Seems like an issue a few others have had to posting here for help with it. We have a machine in our app seems to be dead and which causes some problems:
Got a status alert banner than won’t clear (22 days now)
Machines page won’t load on web
Can’t be destroyed via fly machine destroy e784434ce03d38
Can this be removed for us or is there another route to remove it?
I have the exact same problem, I didn’t think it was much of an issue as I wasn’t being billed for it, but now I spot we have been having intermittent connection errors to the cluster that this machine was a part of ever since this happened - like the LB is still trying it sometimes. We’re also blocked from updating the images (this is a postgres cluster). Would really appreciate some help here from the fly team - our machine image is 91851e1c934483
We currently have three host servers with issues. One of them is dead and will be decommissioned soon from the backend, which will mark all its Machines destroyed. The other two we have not yet given up on, and if we can bring them back their Machines will start back up, so the Machines can’t be marked destroyed.
I do agree that it would be convenient to have users be able to destroy machines while the host server is down, but we haven’t had time to be able to implement that feature and it’s currently not possible, not even from the backend.
Harry is correct, you are not being charged for these Machines while they’re down. There also should be nothing in the platform which is still treating these Machines as if they were online, but this isn’t the first time I’ve heard a user say they were seeing occasional attempted connections. I’ll have to investigate that further, but for right now I’m afraid that I can’t add anything, aside from saying that if it’s happening it must be a bug in some particular codepath; the platform was designed to know when Machines are down and not route traffic to Machines that are down, and that’s what it does normally.
Thanks John - is there any way you can help by removing this machine for us? At the moment the failures are enough to have already lost us some revenue and so we are putting in some increasingly desperate work arounds.
Unfortunately is no ready-made button I can press from the backend. The way the platform is designed, the source of truth for what Fly Machines are running on a host server is on the host server itself, and our other data stores are secondary to that one. Yes it would be good to have a queue of Machine destroy commands that the host server can check when it starts up again, but right now that’s not part of the setup. I will look around though to see if there’s something we can do.
My bigger concern is:
That is a PROBLEM (our problem). Edges should know that a Fly Machine is down and not attempt to send any traffic their way. If that’s not happening, that’s a bug that needs fixing first. Can you give more details about what you’re seeing? I see you’re on the Launch plan, could you send the relevant excerpts from your logs to your support email? That will help us pin down what’s going wrong.
EDIT: Actually, on closer examination, since this is a Postgres cluster, any traffic coming to it wouldn’t be coming from the edges at all. It’d be really helpful if you could send us some logs to know what you’re seeing.
Hi John,
I did send an email last week, but haven’t heard back. Fortunately we haven’t seen the issue in the last 36 hours - which is a first this month. I did update the vm size to be a bit larger for the accessible machines in this cluster
two days ago (though there was no indication of resource limits requiring this from what I could see, we were usually pretty comfortably below 20% in both RAM and CPU) - this was more a sort of trying to poke things to see what happens effort) - which may have helped - perhaps refreshing some cached machine state somewhere?
Okay, we’ve done some digging and we’ve figured out what’s going on enough that I can share publicly.
Our proxy servers work fine generally. The event that triggers these connection errors you’re seeing is when the host server powers on, as it did most recently on May 20 ~16:00 UTC when we were trying out a fix. The host powered on and booted up to a certain level of operation, but not to full health to actually service Fly Machines.
Our proxy servers got alerted too early in the booting process that the host was back online, and started routing traffic to them. Even this doesn’t cause a problem for HTTP traffic, as the proxy is HTTP aware and a lack of a response will cause the proxy to retry the request with a Machine on a different host. But this isn’t the case for other types of TCP and UDP traffic. So that traffic would fail to connect.
Now that this is known, we are working on a fix to our proxy servers. We need to pick a different moment in the host server boot process to signal to our proxy servers that the host and its Machines are back online. This fix is not yet implemented, but it should not take that long.
That’s the general situation. Now to the specifics of this offline host: it’s been offline for almost a month now. Long story short, it’s been quite strange. We keep thinking that we should be able to find the problem and repair it but every proposed solution has not turned out to fix it. But at this point if we can’t fix it we need to call it off. I’m going to do a final check with the rest of our infra team.
Update: It looks like we’re trying one more hardware replacement and the hardware is already in transport to the datacenter. Give us a few more days to try this out. If this doesn’t work, we’ll call the host server dead and delete your machines from the backend.
I’m sorry that no one’s responded to your email. I’m not on the Support team myself, but I’ll draw this to their attention.
I’ve got a similar issue with a dead machine in dfw that I can’t destroy (683dde2a79d398). Went down a couple days ago, came back for a day, and just disappeared again. It’s a replica in a pg cluster and we were able to re-route traffic just fine, but can you confirm that this is a similar (or the same) hardware issue that will be resolved? Thanks!
Hi Elliot, I took a look at that Machine and it looks like its host has not gone down permanently, but it’s developed some type of networking issue (possibly hardware) as it has gone offline and needed infra team attention to come back online a couple times now. I recommend that you add a new Machine to the cluster with fly machine clone 683dde2a79d398 -r dfw and then stop and destroy 683dde2a79d398.