flyctl deploy stuck: health checks frozen, app unreachable (nrt)

# Health checks frozen / `net/http: request canceled` on machine status polling — app unreachable for 1h45m+ (nrt)

**App**: `keimimansai-shift`

**Machine**: `e78451d2a02d78` (region: `nrt`, shared-cpu-1x, 512mb)

**flyctl**: v0.4.59 windows/amd64 (Commit d10482182142f259db338dcef34556a67702290c, BuildDate 2026-06-09)

**Org**: personal

## Summary

Since ~15:49 JST today, this app has been unreachable via its public URL

(`https://keimimansai-shift.fly.dev/\`), and `flyctl deploy` consistently fails

at the final health-check-confirmation step. The condition has not changed at

all across 6 separate `flyctl deploy` runs and 2 different `http_service`

bind configurations over ~1h45m.

## Symptom 1: `flyctl deploy` fails at health-check wait

Every deploy succeeds at build / push / machine-config-update and reaches

“started” state (currently on release version 6), but then fails:

```

> Waiting for machine e78451d2a02d78 to reach a good state

> Machine e78451d2a02d78 reached started state

> Running smoke checks on machine e78451d2a02d78

> Running machine checks on machine e78451d2a02d78

> Checking health of machine e78451d2a02d78

:multiply: Unrecoverable error: timeout reached waiting for health checks to pass for machine e78451d2a02d78: failed to get VM e78451d2a02d78: Get “https://api.machines.dev/v1/apps/keimimansai-shift/machines/e78451d2a02d78”: net/http: request canceled

> Clearing lease for e78451d2a02d78

:check_mark: Cleared lease for e78451d2a02d78

Error: failed to update machine e78451d2a02d78: Unrecoverable error: timeout reached waiting for health checks to pass for machine e78451d2a02d78: failed to get VM e78451d2a02d78: Get “https://api.machines.dev/v1/apps/keimimansai-shift/machines/e78451d2a02d78”: net/http: request canceled

```

## Symptom 2: `flyctl checks list` output is frozen/stale

```

Health Checks for keimimansai-shift

NAME │ STATUS │ MACHINE │ LAST UPDATED │ OUTPUT

───────────────────────────┼──────────┼────────────────┼──────────────┼─────────────────────────────

servicecheck-00-http-3000 │ critical │ e78451d2a02d78 │ 1h45m ago │ connect: connection refused

```

The `STATUS`/`OUTPUT` here have not changed across all 6 deploy attempts and

2 different `http_service.checks` configurations — only the “ago” duration

advances by real elapsed time, suggesting this health-check record is stuck

and not being re-evaluated.

## Symptom 3: Public URL times out completely

```

curl -v --max-time 20 https://keimimansai-shift.fly.dev/

```

TLS handshake completes, request is sent, but 0 bytes are received before a

20s timeout. fly-proxy does not appear to be routing traffic to the machine.

## What I’ve ruled out

- **Not a local network issue**: `curl https://api.machines.dev/v1/apps/keimimansai-shift`

(unauthenticated) from the same machine/network returns HTTP 401 in ~0.6s.

General connectivity to `api.machines.dev` is fine.

- **Not an app-level issue**: via `flyctl ssh console`, confirmed the app

process is listening on `0.0.0.0:3000` (via `/proc/net/tcp`, local address

`00000000:0BB8`, state `0A`/LISTEN) and responds `HTTP 307` to

`http://127.0.0.1:3000/\`.

- **Not an `http_service` config issue**: `fly.toml` is standard —

`internal_port = 3000`, `force_https = true`, one `[[http_service.checks]]`

with `method = “GET”`, `path = “/”`, `interval = “30s”`, `timeout = “5s”`,

`grace_period = “10s”`. Tried both `-H 0.0.0.0` and `-H ::` in the Dockerfile

CMD — same result either way (currently reverted to `-H 0.0.0.0`, which is

the address Fly’s own socket-scan warning recommends).

- **Not a broad platform incident**: status.flyio.net shows all systems

operational, NRT region 100% uptime over 90 days, no related incidents for

Machines API / health checks / fly-proxy.

## Timeline

- ~15:49 JST: health check first observed `critical` / `connection refused`,

has not changed since.

- Since then: 6× `flyctl deploy` (full image rebuild + rolling update each

time, now at release v6) and several `flyctl machine restart` — each

restart succeeds (machine restarts, file timestamps update), but the

CLI’s health-wait never returns (had to be cancelled).

- Across this whole window, the app has remained internally healthy

(verified via SSH + direct HTTP request to 127.0.0.1:3000 each time).

## Question

Could this be a stuck health-check evaluator or fly-proxy route registration

specific to this machine (`e78451d2a02d78`)? Is there a way to force

fly-proxy / the health-check system to re-register/re-evaluate for this

machine without recreating it (recreating risks duplicating the attached

volume `shift_app_data` mounted at `/data`, which holds a SQLite database I’d

rather not fork)?

Any guidance on unsticking this would be appreciated.

Update: I tried the most invasive self-service fix I could think of — destroying the machine entirely and letting flyctl deploy recreate it from scratch (volume vol_r1jy25z7gl0yxowr survives independently of the machine, so the SQLite data on /data wasn’t at risk).

flyctl machine destroy e78451d2a02d78 --force
# volume vol_r1jy25z7gl0yxowr now shows ATTACHED VM: (empty)
flyctl deploy

Result: a brand-new machine 9080d396c77268 (release v7) was created, the volume auto-reattached correctly (matched on [[mounts]] source = "shift_app_data"), and /data/dev.db is intact (409600 bytes, confirmed via SSH).

But the exact same failure reproduced within ~2 minutes on this brand-new machine ID:

2026-06-12T10:02:40Z health[9080d396c77268] nrt [error]Health check 'servicecheck-00-http-3000' on port 3000 has failed. Your app is not responding properly. ...
2026-06-12T10:02:42Z app[9080d396c77268] nrt [info]✓ Ready in 313ms

The health check fired and recorded critical / connection refused before the app even logged “Ready” (by ~2 seconds), and then — same as the old machine — it never re-evaluated again:

 NAME                      │ STATUS   │ MACHINE        │ LAST UPDATED │ OUTPUT
 servicecheck-00-http-3000 │ critical │ 9080d396c77268 │ 6m12s ago    │ connect: connection refused

flyctl deploy failed with the identical error, just with the new machine ID:

✖ Failed: timeout reached waiting for health checks to pass for machine 9080d396c77268: failed to get VM 9080d396c77268: Get "https://api.machines.dev/v1/apps/keimimansai-shift/machines/9080d396c77268": net/http: request canceled

Public URL is still 100% unreachable (0 bytes / 15s timeout), while SSH + a local http://127.0.0.1:3000/ request from inside the new machine still returns HTTP 307 as expected.

Given this reproduced on a completely fresh machine ID within ~2 minutes, I don’t think this is a stuck-machine-record issue anymore — it looks like the health check is only ever evaluated once (at the moment the machine starts, before the app is ready to accept connections) and is never retried for this app, and/or the authenticated GET /v1/apps/keimimansai-shift/machines/{id} endpoint is failing for this app/org specifically (an unauthenticated GET /v1/apps/keimimansai-shift from the same network succeeds in <1s).

App: keimimansai-shift, current machine: 9080d396c77268 (nrt). Happy to provide any additional logs/IDs — at this point this feels like it needs a look from the platform side.

Would you add some code/quote formatting to this? You may find you’re more likely to get answers if it is easier to see where your voice ends and AI debug starts. Markdown is the formatting flavour here, and it is pretty easy to use.

This is not the expected response for a health check, though:

The health check will not automatically follow any HTTP 301 or 302 redirect, so it will fail if it receives anything other than a 200 OK response.

(I’m surprised that situation is getting lumped under “connection refused”, however, :thinking:. Possibly there are other problems simultaneously.)

This part is weird but normal… The LAST UPDATED column says when the situation last changed, not when the check was last attempted.

(It’s a name that confuses nearly everyone.)

更新:解決済み — 原因はヘルスチェックのパスがリダイレクトを返していたことでした

返信いただいたお二人とも、ありがとうございました。特に2件目の返信が、正解への決定的なヒントになりました。

根本原因が確定しました:私の fly.toml のヘルスチェックは path = "/" に設定していましたが、このアプリの / はNext.jsのページで、セッションの状態に応じて常に**サーバー側でredirect()(HTTP 307)**を /login などへ返す仕様になっており、200 OKを返すことは一切ありません。Flyのヘルスチェックはリダイレクトを追跡しないため、何度評価し直してもこのチェックは絶対にパスしない状態でした。これが、アプリ自体は 127.0.0.1:3000 で正常に動作していたにもかかわらず、fly-proxyが一度も公開トラフィックをマシンにルーティングしなかった理由です。

対処方法:認証やDB依存のない、単純に {"status":"ok"} を200 OKで返す専用の /api/health ルートを追加し、ヘルスチェックの path をそちらに変更しました。

結果flyctl deploy は通常通り成功するようになり(以前のヘルスチェックタイムアウトによる net/http: request canceled ではなく、✔ Machine ... is now in a good state と表示されます)、flyctl checks listservicecheck-00-http-3000 | passing を示し、公開URLにも再びアクセスできるようになりました。

結局のところ、プラットフォーム側のヘルスチェック評価エンジンのバグではなく、単純なアプリ側の設定ミス(ヘルスチェックのパスが、常にリダイレクトを返すルートを指していた)だったということです。Flyのヘルスチェックが301/302/307を追跡しないという点を指摘いただくまで、この2つの事実を結びつけられていませんでした。

「critical / connection refused」という表示が、私を誤った方向に導いた要因でした。これを「プロキシがマシンに全く到達できていない」(接続・プラットフォーム側の問題)という意味だと思い込んでいましたが、実際は「200以外のレスポンス(リダイレクト)を受け取り、それを汎用的な失敗カテゴリとして報告している」ということだったわけです。もしプラットフォームチーム側でこのエラー分類をもう少し明確にできる(例えば「connection refused」と「非2xx/リダイレクトレスポンス」を区別する)のであれば、次に同じ状況に陥る人が、私のような数時間に及ぶ回り道を避けられるかもしれません。ただ、今回の私のケースについては — 問題は解決しました、改めて時間を割いて見ていただきありがとうございました!