Unable to perform health checks

I have been reading quite a bit about this but I’m unable to make it work. This is what I have:

app = "my-app"
primary_region = "sjc"

[build]

[deploy]
  strategy = "rolling"

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 2
  processes = ["app"]

[[services]]
  processes = ["app"]
  internal_port = 8443
  protocol = "tcp"
  
[[services.ports]]
  port = 443
  handlers = ["tls"]

[[http_service.checks]]
  grace_period = "10s"
  interval = "1m0s"
  method = "GET"
  path = "/"
  timeout = "5s"
  protocol = "https"
  tls_skip_verify = true

The deployment is stuck here:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ⠼ [2/3] Waiting for 48ed644a77ee08 [app] to become healthy: 0/1
   [3/3] Waiting for job

However, if I curl the endpoint, it returns a successful response:

GET / -> 200 OK [0.1ms]

Eventually, deployment fails with:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ✖ [2/3] Machine 48ed644a77ee08 [app] update failed: timeout reached waiting for health checks to pass for machine 48ed644a77ee08: failed to get VM 48ed644a77ee08: G…
 ✖ [3/3] Machine 784e696a22d208 [app] update canceled
-------

Any idea what could be wrong? Thanks!

1 Like

Since the first machine does work and the second one doesn’t maybe there’s a network issue on your second machine? Here’s a tip:

  1. Destroy the second machine with fly m destroy -f 48ed644a77ee08
  2. Clone the good machine with fly m clone 080e454a665de8 -r REGION
  3. Check if you can fly deploy.

This is a wild guess hopefully it can help

Thank you @lubien,

I have done just that, but the issue persists. As part of cloning, this is the output I see:

  Machine 1857599a446708 has been created...
  Waiting for Machine 1857599a446708 to start...
  Waiting for 1857599a446708 to become healthy (started, 0/1)
  Error: error while watching health checks: context deadline exceeded

I wonder what could be causing this…

What’s clear is that if I deploy with this:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]

It works fine. It’s the health check that for some reason times out.

This was very helpful, I believe you might have the same issue as:

Let me know if that fixes it!

Hello @lubien,
I have simplified it like so:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]
  [[http_service.checks]]
    grace_period = "10s"
    interval = "1m0s"
    method = "GET"
    path = "/"
    timeout = "5s"
    protocol = "https"
    tls_skip_verify = true

Still not happy:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ✔ [2/3] Machine 1857599a446708 [app] update succeeded
 ✖ [3/3] Machine 784e696a22d208 [app] update failed: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: G…
-------
Checking DNS configuration for my-app.fly.dev
Error: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: Get "https://api.machines.dev/v1/apps/my-app/machines/784e696a22d208": net/http: request canceled
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

Another thing I don’t recall seeing WARN could not unmount /rootfs in my logs before:

2024-05-03T22:04:28.898 app[1857599a446708] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 313
2024-05-03T22:04:29.569 app[1857599a446708] sjc [info] INFO Main child exited normally with code: 0
2024-05-03T22:04:29.584 app[1857599a446708] sjc [info] INFO Starting clean up.
2024-05-03T22:04:29.585 app[1857599a446708] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-05-03T22:04:29.586 app[1857599a446708] sjc [info] [ 423.892221] reboot: Restarting system

Thanks so much for your help!

Here’s another observation: when I deploy without health checks, the curl command to check the health is nearly instantaneous… ~0.01ms). However, deploying with health checks takes ~1.5s to return. And if I try a few more times, 2 or 3, the app gets stuck and this shows up:

[PR03] could not find a good candidate within 90 attempts at load balancing. last error: [PR01] no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the ‘immediate’ strategy? have your app’s instances all reached their hard limit?)

This doesn’t happen ever if I skip health checks.

@lubien found the issue!!! :face_exhaling:

Broken config:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]
  [[http_service.checks]]
    grace_period = "10s"
    interval = "1m0s"
    method = "GET"
    path = "/"
    timeout = "5s"
    protocol = "https"

Solution: remove protocol = "https" from the sub-section.

I don’t know why. Perhaps the fly.io team may want to see why this is so fragile. Would be great to have seen a suggestion stating that protocol = "https" was not needed/bad/etc.

Anyway… problem averted! Thanks a lot for the help!

1 Like

Thanks for the heads up!

We are already looking into improving some of that experiece:

Thanks for sharing your feedback on the protocol bit!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.