-------
Updating existing machines in 'my-app' with rolling strategy
-------
✔ [1/3] Machine 080e454a665de8 [app] update succeeded
⠼ [2/3] Waiting for 48ed644a77ee08 [app] to become healthy: 0/1
[3/3] Waiting for job
However, if I curl the endpoint, it returns a successful response:
GET / -> 200 OK [0.1ms]
Eventually, deployment fails with:
-------
Updating existing machines in 'my-app' with rolling strategy
-------
✔ [1/3] Machine 080e454a665de8 [app] update succeeded
✖ [2/3] Machine 48ed644a77ee08 [app] update failed: timeout reached waiting for health checks to pass for machine 48ed644a77ee08: failed to get VM 48ed644a77ee08: G…
✖ [3/3] Machine 784e696a22d208 [app] update canceled
-------
I have done just that, but the issue persists. As part of cloning, this is the output I see:
Machine 1857599a446708 has been created...
Waiting for Machine 1857599a446708 to start...
Waiting for 1857599a446708 to become healthy (started, 0/1)
Error: error while watching health checks: context deadline exceeded
-------
Updating existing machines in 'my-app' with rolling strategy
-------
✔ [1/3] Machine 080e454a665de8 [app] update succeeded
✔ [2/3] Machine 1857599a446708 [app] update succeeded
✖ [3/3] Machine 784e696a22d208 [app] update failed: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: G…
-------
Checking DNS configuration for my-app.fly.dev
Error: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: Get "https://api.machines.dev/v1/apps/my-app/machines/784e696a22d208": net/http: request canceled
Your machine never reached the state "%s".
You can try increasing the timeout with the --wait-timeout flag
Another thing I don’t recall seeing WARN could not unmount /rootfs in my logs before:
2024-05-03T22:04:28.898 app[1857599a446708] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 313
2024-05-03T22:04:29.569 app[1857599a446708] sjc [info] INFO Main child exited normally with code: 0
2024-05-03T22:04:29.584 app[1857599a446708] sjc [info] INFO Starting clean up.
2024-05-03T22:04:29.585 app[1857599a446708] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-05-03T22:04:29.586 app[1857599a446708] sjc [info] [ 423.892221] reboot: Restarting system
Here’s another observation: when I deploy without health checks, the curl command to check the health is nearly instantaneous… ~0.01ms). However, deploying with health checks takes ~1.5s to return. And if I try a few more times, 2 or 3, the app gets stuck and this shows up:
[PR03] could not find a good candidate within 90 attempts at load balancing. last error: [PR01] no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the ‘immediate’ strategy? have your app’s instances all reached their hard limit?)
Solution: remove protocol = "https" from the sub-section.
I don’t know why. Perhaps the fly.io team may want to see why this is so fragile. Would be great to have seen a suggestion stating that protocol = "https" was not needed/bad/etc.
Anyway… problem averted! Thanks a lot for the help!