Custom Health Check on Phoenix fails and creates Zombie Machines

Hey folks,

I have a working Phoenix application running but realised that the default health check doesn’t actually check that the machine is “healthy” (as in: can serve web requests). The other day, I deployed a bug that started the application, but caused every request to fail because of a configuration mistake.

So, I wanted to add a custom [[services.http_check]] to the fly.toml that checks whether a HTTP request to /healthy returns 200.

TLDR: The health check never worked and I couldn’t deploy new versions. When I deployed a version, the machine never stopped even though it was never “healthy” and I canceled the deployment.

Here is my fly.toml:

app = "redacted"
primary_region = "arn"
kill_signal = "SIGTERM"


[build]

[deploy]
release_command = "/app/bin/migrate"
strategy = "canary"

[env]
PORT = "8080"
DNS_CLUSTER_QUERY = "redacted"

[http_service]
internal_port = 8080
force_https = false
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 1
processes = ["app"]
[http_service.concurrency]
type = "connections"
hard_limit = 1000
soft_limit = 1000

[[services.ports]]
handlers = ["http"]
port = 8080

[[services.http_checks]]
interval = 10000
grace_period = "5s"
method = "get"
path = "/healthy"
protocol = "http"
timeout = 2000
tls_skip_verify = false

and the critical parts of my runtime.exs:

host = get_env!("PHX_HOST")
port = get_env("PORT", 4000, :int)

config :vcp, VcpWeb.Endpoint,
  url: [host: host, port: 443, scheme: "https"],
  http: [
    ip: {0, 0, 0, 0, 0, 0, 0, 0},
    port: port
  ],
  secret_key_base: secret_key_base,
  check_origin: [
    "https://#{host}",
    "https://www.#{host}"
  ]

When I run fly deploy --remote-only, the application is built and deployed correctly, but the waiting for the machine to become healthy check never completes. When I stop the deployment with CTRL + C, the deployment stops, but the leases are not cleared right away. I had to clear them manually with fly machines leases clear.

When I run fly checks list, I don’t see a particular error, but just:

➜  fly checks list --debug --verbose
Health Checks for redacted
  NAME                      | STATUS  | MACHINE        | LAST UPDATED | OUTPUT
----------------------------*---------*----------------*--------------*----------------------------
  servicecheck-00-http-8080 | warning | d89dee2c400348 | 45s ago      | waiting for status update
----------------------------*---------*----------------*--------------*----------------------------

I currently have 6 machines that are either stopped but I can’t kill them or delete them. :man_shrugging:

Please help :smiley:

Hey Peter, I have a custom healthcheck on my Phoenix app. This is what I do

app = "my-app"
kill_signal = "SIGTERM"
kill_timeout = 5
mounts = []
primary_region = "my-region"
processes = []
swap_size_mb = 512

[deploy]
  strategy = "bluegreen"
  release_command = "/app/bin/migrate"

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  [services.concurrency]
    hard_limit = 10000
    soft_limit = 9000
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

  [[services.http_checks]]
    interval = 10000
    grace_period = "5s"
    method = "get"
    path = "/_healthcheck"
    protocol = "http"
    restart_limit = 0
    timeout = 2000
    tls_skip_verify = false
    [services.http_checks.headers]

My endpoint configuration looks identical to yours (I don’t check protocol in the check_origin though and use a function callback)

fly checks list --debug --verbose -a my-app
Health Checks for my-app
  NAME                      | STATUS  | MACHINE        | LAST UPDATED | OUTPUT 
----------------------------*---------*----------------*--------------*------------------
  servicecheck-00-tcp-8080  | passing | 2865659b74e038 | 19h26m ago   | Success
----------------------------*---------*----------------*--------------*------------------
  servicecheck-01-http-8080 | passing | 2865659b74e038 | 19h26m ago   | {"status":"ok"}
----------------------------*---------*----------------*--------------*------------------
  servicecheck-00-tcp-8080  | passing | 4d8969ec937d87 | 19h26m ago   | Success
----------------------------*---------*----------------*--------------*------------------
  servicecheck-01-http-8080 | passing | 4d8969ec937d87 | 19h26m ago   | {"status":"ok"}
----------------------------*---------*----------------*--------------*------------------
  servicecheck-00-tcp-8080  | passing | 4d89d59ef32048 | 19h26m ago   | Success
----------------------------*---------*----------------*--------------*------------------
  servicecheck-01-http-8080 | passing | 4d89d59ef32048 | 19h26m ago   | {"status":"ok"}
----------------------------*---------*----------------*--------------*------------------
  servicecheck-00-tcp-8080  | passing | 90806d4efd1e98 | 19h26m ago   | Success
----------------------------*---------*----------------*--------------*------------------
  servicecheck-01-http-8080 | passing | 90806d4efd1e98 | 19h26m ago   | {"status":"ok"}
----------------------------*---------*----------------*--------------*------------------
2 Likes