Machine stuck in replacing state

I just deployed through our Github and suddenly we encounter this error:

2023-10-26T15:17:44.904 runner[148e435b1d0489] sin [info] Pulling container image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

The machine went suspended mode and seems to be stuck in replacing state:

replacing	update	user  	2023-10-26T23:09:57.279+08:00

I cannot even restart the machine as it fails, possibly due to its current state:

Restarting machine 148e435b1d0489
Error: could not stop machine 148e435b1d0489: failed to restart VM 148e435b1d0489: failed_precondition: unable to restart machine, not currently started or stopped (Request ID: 01HDP9Z2T19G5M6GKJE65R73Z2-sin)

What can I do apart from recreating the machine?

1 Like

I took a look and the machine is now attempting to update but taking a bit of time to pull the image.

Thanks for the response. Can you kindly explain what might cause the delay in pulling the image? Also, any way to restore the state of the machine as it seems it cannot be reached by our deployment:

✖ Machine 148e435b1d0489 [app] update failed: failed to update VM 148e435b1d0489: request returned non-2xx status, 504
Error: failed to update VM 148e435b1d0489: request returned non-2xx status, 504 (Request ID: 01HDPB5QSVV0V6GT8P2T7F8EAA-iad)

It’s usually a function of the size of the image but we’ve had some issues lately with image pull taking longer than necessary and have been working to resolve them. Looking at your app logs, it does seem like sometimes the image pull takes an excessive amount of time:

Oct 26, 2023 @ 15:49:48.552000000	Pulling container image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK
Oct 26, 2023 @ 16:00:37.255000000	Successfully prepared image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK (10m48.703108071s)

vs a pull right after which was able to take advantage of cached layers

Oct 26, 2023 @ 16:00:42.775000000	Pulling container image registry.fly.io/supacart-staging:deployment-01HDN14GRZB61DJB79AA21E958
Oct 26, 2023 @ 16:00:44.931000000	Successfully prepared image registry.fly.io/supacart-staging:deployment-01HDN14GRZB61DJB79AA21E958 (2.156492498s)

And it looks like another deploy is going through now and waiting on the image to pull

Oct 26, 2023 @ 16:00:53.449000000	Pulling container image registry.fly.io/supacart-staging:deployment-01HDPB2947A93KJGXSJWGJRT94

I see. Yes, I did try to deploy again but met with the same error:

Updating existing machines in 'supacart-staging' with rolling strategy
> Updating 148e435b1d0489 [app]
> Updating 148e435b1d0489 [app]
✖ Machine 148e435b1d0489 [app] update failed: failed to update VM 148e435b1d0489: request returned non-2xx status, 504
Error: failed to update VM 148e435b1d0489: request returned non-2xx status, 504 (Request ID: 01HDPBR4E95KZHBPNQ1V93RNJD-iad)

I haven’t looked at the specific 504 request yet but it’s very possible the host the machine is currently running on is having some network issues which is also what’s causing excessive image pulls.

Is there a way to manually migrate our app to another machine?

Yes, you can run fly m clone 148e435b1d0489 --app supacart-staging which will create a new machine on a different underlying host and once it’s running you should be able to destroy 148e435b1d0489. I’d probably try to wait until 148e435b1d0489 is stable and not in a replacing state as it currently is so the cloned machine is from the latest deploy.

1 Like

I have same error, I can’t deploy my app and app is broken now…

What error did you get?

If you’re seeing this error:

Pulling container image registry.fly.io/

Then the image is pulling, you need to wait, now it can take 20 minutes in my case.

Otherwise you can try to delete builder and deploy again. I tried this but it has same above error, but you can give it a try to see if it works for you.

 ✖ [1/2] Machine 5683620b67968e [app] update failed: timed out waiting for machine to reach started state: failed to wait for VM 5683620b67968e in started state: Get "https://api.machines.d…
   [2/2] Waiting for job
-------
Error: timed out waiting for machine to reach started state: failed to wait for VM 5683620b67968e in started state: Get "https://api.machines.dev/v1/apps/4homeandsoulnuxt/machines/5683620b67968e/wait?instance_id=01HDR1MGYH9ENVRNBY7F76TPMZ&state=started&timeout=60": net/http: request canceled
You can increase the timeout with the --wait-timeout flag

I tried fly deploy --local-only but it’s not working

I got blank page, many error in console… local build of my node app is working without problems

I need to clone machines and destroy previous… Now I think it’s working, but I’m still testing my app. But the most important question is: WHY?

yup I got the same error like you.

It is not error from our application, it is error from Fly infrastructure. I think the problem is described in this topic:

In the log I got same error:

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

In this case we can not do anything except destroy machine, builder and recreate it to redeploy again

1 Like

Yes, It helps. But I think this situation should not happen:( my app was down for ~2 hours because of this.

hey @JP_Phillips

I still experiencing this, stuck at replacing, cannot do anything (stopping, restarting), try redeploy, reduce image size, no luck so far.

Here’s the detail.

Error: failed to update VM e784665fdd2483: request returned non-2xx status, 504 (Request ID: 01HDVC1YJCK2YXP8TGQW067CKA-dfw)

please have a look, it’s been down for hours.

thanks!

1 Like

Looking at the machine events for your app, it appears as though it’s just taking a long time to pull the image with each deploy.

launch	created															01HDVAR6BJ52A59CQPH45FT70A	2023-10-28T14:30:44
update	replacing														01HDV99GCVBWKQK3EYNP4S5QJ5	2023-10-28T14:30:44
start	started															01HDV99GCVBWKQK3EYNP4S5QJ5	2023-10-28T14:30:44
update	replaced														01HDV8HA0F13EXSE7TFNYKB8VT	2023-10-28T14:30:43
crash	stopped	exit code: 0, requested: true, oom: false, signal: 9	01HDV8HA0F13EXSE7TFNYKB8VT	2023-10-28T14:30:43
launch	created															01HDV99GCVBWKQK3EYNP4S5QJ5	2023-10-28T14:17:04
update	replacing														01HDV8HA0F13EXSE7TFNYKB8VT	2023-10-28T14:17:04
start	started															01HDV8HA0F13EXSE7TFNYKB8VT	2023-10-28T14:17:04
update	replaced														01HDV8494HQ609CENR6TJ0774Z	2023-10-28T14:17:03
crash	stopped	exit code: 0, requested: true, oom: false, signal: 9	01HDV8494HQ609CENR6TJ0774Z	2023-10-28T14:17:03
launch	created															01HDV8HA0F13EXSE7TFNYKB8VT	2023-10-28T13:52:09
update	replacing														01HDV8494HQ609CENR6TJ0774Z	2023-10-28T13:52:09
start	started															01HDV8494HQ609CENR6TJ0774Z	2023-10-28T13:52:09
update	replaced														01HDTWV6PPGDBNPGZ2T1S251B9	2023-10-28T13:52:08
crash	stopped	exit code: 0, requested: true, oom: false, signal: 9	01HDTWV6PPGDBNPGZ2T1S251B9	2023-10-28T13:52:08
launch	created															01HDV8494HQ609CENR6TJ0774Z	2023-10-28T13:38:42

We’ve been having network issues in our sin region but looking at the sequence of events, the machine is in the started state right up until the point where it gets stopped to be replaced by the new version.

Just wanna say that I’m experiencing same errors, and all my apps are in sin region. I hope this can be better reflected in the status page :smiley:

2 Likes

I’m investigating it now. Can you try again and see if you’re still having issues?

1 Like

not sure if you’re talking to me but yes still having issues similar to this:

Error: failed to update VM 148ed192ae1708: request returned non-2xx status, 504 (Request ID: 01HDVFAV86WBY59V122S2CBCE2-sin)

as well as trying to scale up / kill / restart

Deploys are also stuck with

"Pulling container image registry.fly.io/5m-vm:deployment-01HDVEQB07W1XACRJ8QYVAFG5S