Unable to perform health checks

tito · May 2, 2024, 3:25am

I have been reading quite a bit about this but I’m unable to make it work. This is what I have:

app = "my-app"
primary_region = "sjc"

[build]

[deploy]
  strategy = "rolling"

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 2
  processes = ["app"]

[[services]]
  processes = ["app"]
  internal_port = 8443
  protocol = "tcp"
  
[[services.ports]]
  port = 443
  handlers = ["tls"]

[[http_service.checks]]
  grace_period = "10s"
  interval = "1m0s"
  method = "GET"
  path = "/"
  timeout = "5s"
  protocol = "https"
  tls_skip_verify = true

The deployment is stuck here:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ⠼ [2/3] Waiting for 48ed644a77ee08 [app] to become healthy: 0/1
   [3/3] Waiting for job

However, if I curl the endpoint, it returns a successful response:

GET / -> 200 OK [0.1ms]

Eventually, deployment fails with:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ✖ [2/3] Machine 48ed644a77ee08 [app] update failed: timeout reached waiting for health checks to pass for machine 48ed644a77ee08: failed to get VM 48ed644a77ee08: G…
 ✖ [3/3] Machine 784e696a22d208 [app] update canceled
-------

Any idea what could be wrong? Thanks!

lubien · May 2, 2024, 11:37am

Since the first machine does work and the second one doesn’t maybe there’s a network issue on your second machine? Here’s a tip:

Destroy the second machine with fly m destroy -f 48ed644a77ee08
Clone the good machine with fly m clone 080e454a665de8 -r REGION
Check if you can fly deploy.

This is a wild guess hopefully it can help

tito · May 3, 2024, 12:27am

Thank you @lubien,

I have done just that, but the issue persists. As part of cloning, this is the output I see:

  Machine 1857599a446708 has been created...
  Waiting for Machine 1857599a446708 to start...
  Waiting for 1857599a446708 to become healthy (started, 0/1)
  Error: error while watching health checks: context deadline exceeded

I wonder what could be causing this…

tito · May 3, 2024, 1:19am

What’s clear is that if I deploy with this:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]

It works fine. It’s the health check that for some reason times out.

lubien · May 3, 2024, 11:59am

This was very helpful, I believe you might have the same issue as:

Let me know if that fixes it!

tito · May 3, 2024, 10:07pm

Hello @lubien,
I have simplified it like so:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]
  [[http_service.checks]]
    grace_period = "10s"
    interval = "1m0s"
    method = "GET"
    path = "/"
    timeout = "5s"
    protocol = "https"
    tls_skip_verify = true

Still not happy:

-------
Updating existing machines in 'my-app' with rolling strategy

-------
 ✔ [1/3] Machine 080e454a665de8 [app] update succeeded
 ✔ [2/3] Machine 1857599a446708 [app] update succeeded
 ✖ [3/3] Machine 784e696a22d208 [app] update failed: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: G…
-------
Checking DNS configuration for my-app.fly.dev
Error: timeout reached waiting for health checks to pass for machine 784e696a22d208: failed to get VM 784e696a22d208: Get "https://api.machines.dev/v1/apps/my-app/machines/784e696a22d208": net/http: request canceled
Your machine never reached the state "%s".

You can try increasing the timeout with the --wait-timeout flag

Another thing I don’t recall seeing WARN could not unmount /rootfs in my logs before:

2024-05-03T22:04:28.898 app[1857599a446708] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 313
2024-05-03T22:04:29.569 app[1857599a446708] sjc [info] INFO Main child exited normally with code: 0
2024-05-03T22:04:29.584 app[1857599a446708] sjc [info] INFO Starting clean up.
2024-05-03T22:04:29.585 app[1857599a446708] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-05-03T22:04:29.586 app[1857599a446708] sjc [info] [ 423.892221] reboot: Restarting system

Thanks so much for your help!

tito · May 3, 2024, 10:17pm

Here’s another observation: when I deploy without health checks, the curl command to check the health is nearly instantaneous… ~0.01ms). However, deploying with health checks takes ~1.5s to return. And if I try a few more times, 2 or 3, the app gets stuck and this shows up:

[PR03] could not find a good candidate within 90 attempts at load balancing. last error: [PR01] no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the ‘immediate’ strategy? have your app’s instances all reached their hard limit?)

This doesn’t happen ever if I skip health checks.

tito · May 4, 2024, 8:58pm

@lubien found the issue!!!

Broken config:

app = "my-app"
primary_region = "sjc"

[build]

[http_service]
  internal_port = 8443
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ["app"]
  [[http_service.checks]]
    grace_period = "10s"
    interval = "1m0s"
    method = "GET"
    path = "/"
    timeout = "5s"
    protocol = "https"

Solution: remove protocol = "https" from the sub-section.

I don’t know why. Perhaps the fly.io team may want to see why this is so fragile. Would be great to have seen a suggestion stating that protocol = "https" was not needed/bad/etc.

Anyway… problem averted! Thanks a lot for the help!

lubien · May 6, 2024, 5:51pm

Thanks for the heads up!

We are already looking into improving some of that experiece:

github.com/superfly/flyctl

Prevent users from deploying fly tomls with duplicate service definitions

superfly:master ← superfly:fix-duplicate-services

opened 11:03PM - 03 May 24 UTC

gwuah

+41 -0

### Change Summary Users sometimes end up defining multiple services for the sa…me protocol & port. This harmless mistake would cause health-checks to fail because of only one of these services get picked by flaps. To fix this, I'm introducing some client-side validation to flag these issues. They would now see this warning/help text 👇 ``` Service [tcp-1738] has 2 duplicate definitions. To resolve this, merge them into 1 service. ``` References https://community.fly.io/t/health-checks-failing-with-failed-to-get-vm/18077 https://community.fly.io/t/unable-to-perform-health-checks/19604 https://community.fly.io/t/health-checks-always-failing/19641/2

Thanks for sharing your feedback on the protocol bit!

system · May 13, 2024, 5:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Health check failing and no app machines , litefs	11	468	June 4, 2024
Deploy stuck on healthcheck Questions / Help rails	2	40	August 26, 2024
Cannot get http_service.checks to work rails	6	260	March 3, 2024
How do you troubleshoot http_checks? Questions / Help	9	1551	February 9, 2022
Health checks always failing ruby	5	214	May 10, 2024

Unable to perform health checks

Related topics