I just deployed through our Github and suddenly we encounter this error:
2023-10-26T15:17:44.904 runner[148e435b1d0489] sin [info] Pulling container image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
The machine went suspended mode and seems to be stuck in replacing state:
replacing update user 2023-10-26T23:09:57.279+08:00
I cannot even restart the machine as it fails, possibly due to its current state:
Restarting machine 148e435b1d0489
Error: could not stop machine 148e435b1d0489: failed to restart VM 148e435b1d0489: failed_precondition: unable to restart machine, not currently started or stopped (Request ID: 01HDP9Z2T19G5M6GKJE65R73Z2-sin)
Thanks for the response. Can you kindly explain what might cause the delay in pulling the image? Also, any way to restore the state of the machine as it seems it cannot be reached by our deployment:
âś– Machine 148e435b1d0489 [app] update failed: failed to update VM 148e435b1d0489: request returned non-2xx status, 504
Error: failed to update VM 148e435b1d0489: request returned non-2xx status, 504 (Request ID: 01HDPB5QSVV0V6GT8P2T7F8EAA-iad)
It’s usually a function of the size of the image but we’ve had some issues lately with image pull taking longer than necessary and have been working to resolve them. Looking at your app logs, it does seem like sometimes the image pull takes an excessive amount of time:
Oct 26, 2023 @ 15:49:48.552000000 Pulling container image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK
Oct 26, 2023 @ 16:00:37.255000000 Successfully prepared image registry.fly.io/supacart-staging:deployment-01HDP8EGKK9EQMBJ996QE1TQWK (10m48.703108071s)
vs a pull right after which was able to take advantage of cached layers
Oct 26, 2023 @ 16:00:42.775000000 Pulling container image registry.fly.io/supacart-staging:deployment-01HDN14GRZB61DJB79AA21E958
Oct 26, 2023 @ 16:00:44.931000000 Successfully prepared image registry.fly.io/supacart-staging:deployment-01HDN14GRZB61DJB79AA21E958 (2.156492498s)
And it looks like another deploy is going through now and waiting on the image to pull
Oct 26, 2023 @ 16:00:53.449000000 Pulling container image registry.fly.io/supacart-staging:deployment-01HDPB2947A93KJGXSJWGJRT94
I haven’t looked at the specific 504 request yet but it’s very possible the host the machine is currently running on is having some network issues which is also what’s causing excessive image pulls.
Yes, you can run fly m clone 148e435b1d0489 --app supacart-staging which will create a new machine on a different underlying host and once it’s running you should be able to destroy 148e435b1d0489. I’d probably try to wait until 148e435b1d0489 is stable and not in a replacing state as it currently is so the cloned machine is from the latest deploy.
Then the image is pulling, you need to wait, now it can take 20 minutes in my case.
Otherwise you can try to delete builder and deploy again. I tried this but it has same above error, but you can give it a try to see if it works for you.
✖ [1/2] Machine 5683620b67968e [app] update failed: timed out waiting for machine to reach started state: failed to wait for VM 5683620b67968e in started state: Get "https://api.machines.d…
[2/2] Waiting for job
-------
Error: timed out waiting for machine to reach started state: failed to wait for VM 5683620b67968e in started state: Get "https://api.machines.dev/v1/apps/4homeandsoulnuxt/machines/5683620b67968e/wait?instance_id=01HDR1MGYH9ENVRNBY7F76TPMZ&state=started&timeout=60": net/http: request canceled
You can increase the timeout with the --wait-timeout flag
I tried fly deploy --local-only but it’s not working
I got blank page, many error in console… local build of my node app is working without problems
It is not error from our application, it is error from Fly infrastructure. I think the problem is described in this topic:
In the log I got same error:
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
In this case we can not do anything except destroy machine, builder and recreate it to redeploy again
We’ve been having network issues in our sin region but looking at the sequence of events, the machine is in the started state right up until the point where it gets stopped to be replaced by the new version.