Unrecoverable error: timeout reached waiting for health checks to pass

Hi everyone,

I’ve been running into deployment failures with timeout errors (details below). A few days ago I was able to work around it by adding a new app machine and removing the old one, but now that approach doesn’t work anymore.

Has anyone seen something like this before? Any ideas on what could be causing the problem, or things I should try checking?

Thanks in advance!

Scaling:

Groups
NAME COUNT KIND CPUS MEMORY REGIONS
app 2 shared 2 512 MB fra(2)
cron 1 shared 1 512 MB fra

Deployment log:

[1/3] Running smoke checks on machine e829341f74d938
[2/3] Running machine checks on machine 9185902a133dd8
[2/3] Checking health of machine 9185902a133dd8
:check_mark: [2/3] Machine 9185902a133dd8 is now in a good state
[1/3] Running machine checks on machine e829341f74d938
[1/3] Checking health of machine e829341f74d938
:multiply: [1/3] Unrecoverable error: timeout reached waiting for health checks to pass for machine e829341f74d938: failed to get VM e829341f74d938: Get “https://api.machines.dev/v1/apps/**HIDDEN**/machines/e829341f74d938”: net/http: request canceled
:multiply: [3/3] skipping machine update due to earlier failure
[3/3] Clearing lease for d8d3107b2d7118
[1/3] Clearing lease for e829341f74d938
[2/3] Clearing lease for 9185902a133dd8
:check_mark: [2/3] Cleared lease for 9185902a133dd8
:check_mark: [3/3] Cleared lease for d8d3107b2d7118
:check_mark: [1/3] Cleared lease for e829341f74d938
Error: failed to update machine e829341f74d938: Unrecoverable error: timeout reached waiting for health checks to pass for machine e829341f74d938: failed to get VM e829341f74d938: Get “https://api.machines.dev/v1/apps/**HIDDEN**/machines/e829341f74d938”: net/http: request canceled

App log:

2025-09-30 17:47:39.786
Health check on port 8080 has failed. Your app is not responding properly. Services exposed on ports [80, 443] will have intermittent failures until the health check passes.
2025-09-30 17:47:39.753
2025/09/30 14:47:39 INFO SSH listening listen_address=[fdaa:0:a63d:a7b:504:ca99:253e:2]:22
2025-09-30 17:47:39.578
Machine created and started in 4.391s
2025-09-30 17:47:39.536
INFO [fly api proxy] listening at /.fly/api
2025-09-30 17:47:39.530
INFO Preparing to run: /entrypoint as root

2025-09-30 17:47:39.424
INFO Starting init (commit: 1cd134d4)…
2025-09-30 17:47:38.724
2025-09-30T14:47:38.724136944 [01K6DHFNF55H4YNHQDPEVQQFSA:main] Listening on API socket (“/fc.sock”).
2025-09-30 17:47:38.724
2025-09-30T14:47:38.723923366 [01K6DHFNF55H4YNHQDPEVQQFSA:main] Running Firecracker v1.12.1
2025-09-30 17:47:37.655
[ 2220.531564] reboot: Restarting system
2025-09-30 17:47:37.652
INFO Starting clean up.
2025-09-30 17:47:37.628
INFO Main child exited normally with code: 0
2025-09-30 17:47:36.675
2025-09-30 14:47:36,675 INFO stopped: php (exit status 0)
2025-09-30 17:47:36.647
[30-Sep-2025 14:47:36] NOTICE: exiting, bye-bye!
2025-09-30 17:47:36.645
[30-Sep-2025 14:47:36] NOTICE: Terminating …
2025-09-30 17:47:36.641
2025-09-30 14:47:36,641 INFO stopped: nginx (exit status 0)
2025-09-30 17:47:36.627
2025-09-30 14:47:36,626 INFO waiting for php, nginx to die
2025-09-30 17:47:36.626
2025-09-30 14:47:36,626 WARN received SIGINT indicating exit request
2025-09-30 17:47:36.458
INFO Sending signal SIGINT to main child process w/ PID 649
2025-09-30 17:47:36.421
Configuring firecracker
2025-09-30 17:47:35.467
Container image registry.fly.io/HIDDEN@sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256sha256:0baad0cc2144f318e414aa2091a4bac0a7298a12b140c21520230f0774cfd859 already prepared
2025-09-30 17:47:35.466
Pulling container image registry.fly.io/@sha256IDDEN@sha256:0baad0cc2144f318e414aa2091a4bac0a7298a12b140c21520230f0774cfd859

fly.toml

app = "***"
primary_region = "fra"
kill_signal = "SIGINT"
kill_timeout = "5s"

[build]
  [build.args]
    NODE_VERSION = "18"
    PHP_VERSION = "8.2"

[env]
  APP_DEBUG = "false"
  APP_ENV = "production"
  APP_NAME = "HH ***"
  APP_URL = "https://***.***.com"
  AWS_BUCKET = "***.***"
  AWS_DEFAULT_REGION = "eu-central-1"
  AWS_URL = "apigateway.eu-central-1.amazonaws.com"
  CACHE_DRIVER = "database"
  DB_CONNECTION = "pgsql"
  LOG_CHANNEL = "stderr"
  LOG_LEVEL = "info"
  LOG_STDERR_FORMATTER = "Monolog\\Formatter\\LineFormatter"
  MAIL_ENCRYPTION = "TLS"
  MAIL_FROM_ADDRESS = "***@***"
  MAIL_FROM_NAME = "HH ***"
  MAIL_HOST = "email-smtp.eu-central-1.amazonaws.com"
  MAIL_MAILER = "smtp"
  MAIL_PORT = "587"
  QUEUE_CONNECTION = "database"
  RECORD_JOB_EXECUTION_TIME = "true"
  SESSION_DRIVER = "cookie"
  SESSION_LIFETIME = "4320"
  SESSION_SECURE_COOKIE = "true"

[processes]
  app = ""
  cron = "cron -f"

[[services]]
  protocol = "tcp"
  internal_port = 8080
  processes = ["app"]

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.tcp_checks]]
    interval = "15s"
    timeout = "2s"
    grace_period = "1s"
    restart_limit = 0

The timeout error just means that the machine didn’t pass healthchecks within the timeout set for the deploy. The default timeout is 5 minutes.

This can be normal if your app takes a long time to be ready to serve requests after a machine start. I’d try increasing the healthcheck timeout using fly deploy --wait-timeout 10m0s to use a 10 minute time out. You may also want to increase the grace period for the service check