machine became unresponsive, now impossible to stop/start/kill

At 12.40pm CET I started to get alerts that my fly machine was down (from my external prober testing my own API endpoints).
I tried to redeploy, which used to fix the issue in the past. No luck.
Tried the following with the CLI:
Stop, Start, Restart, Kill. No luck.
Now the status in fly machines list shows replacing:

ID            	NAME                 	STATE    	CHECKS	REGION	ROLE	IMAGE                                                	IP ADDRESS                     	VOLUME              	CREATED             	LAST UPDATED        	PROCESS GROUP	SIZE                
080e693b674778	restless-feather-2130	replacing	      	iad   	    	late-glade-7454:deployment-01HTG9M5HNKA69ZR3PN0FFMHV3	fdaa:2:5e4d:a7b:107:3599:e667:2	vol_8l524yjg75347zmp	2023-06-16T21:07:03Z	2025-02-02T18:23:38Z	app          	shared-cpu-1x:256MB```

The web logs for the machine show the following:
replacing update user February 2, 2025 6:23PM
starting start flyd February 2, 2025 1:19PM
stopped exit flyd February 2, 2025 1:19PM exit_code=0,oom_killed=false,requested_stop=false

Any tips on how to stop an unstoppable machine?
Thank you

if getting the system up is the most important thing I would spin up a new machine before wasting time on the old one.

In fact I’d try and scale the curret app up and if it works I’m good and pressure is off :slight_smile: e.g. fly scale count 2

then, after that’s done and there is no pressure, I’d wait and then try to kill the old machine again, keep an eye on it to see if it gets unstuck, contact Fly eventually

Thank you. Creating a new volume from the most recent snapshot and attaching it to a new machine allowed me to get the service back on track. The old machine is still stuck in “replacing” though.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.