Machines freeze while health checks pass - “Can’t reach app” + webhook failures (AMS/EWR)

Hi Fly team,

I’m running a FastAPI webhook service on port 8080. For the past week I’ve been experiencing major instability:

-–

What’s happening:

1. Health checks pass every time

The app is healthy and responds correctly on the health endpoint.

2. But the app “freezes” and stops processing incoming POST webhook requests

During these freezes, no payloads are delivered to the app, even though the machine is alive.

3. Fly sometimes shows this in machine events (every 10 seconds):

Can’t reach your app

Fly proxy waited too much for your machine to become reachable then gave up.

4. After some minutes, it magically becomes reachable again

The issue fixes itself without a deploy or restart.

-–

Key points:

App listens on port 8080

Health checks ALWAYS pass

App does NOT crash

Traffic is extremely low

I have 10 machines, same issue on all

Tried multiple regions (AMS, EWR)

Same problem in both regions

No resource exhaustion, no load issues

-–

This looks like Fly proxy or internal networking issues.

Can you help investigate why:

Machines are “healthy”

But Fly’s proxy cannot consistently reach the app port

And webhook traffic freezes even while health checks pass?

What diagnostics should I provide?

Thanks,

Ayobami

Does this happen only when all of the machines are stopped at the time of a request? Looking at the logs, it takes your app quite a while too boot and start accepting connections so proxy gives up waiting for the machine too boot and tries another one.

If the request has body we may not be able to retry it on another machine. We do buffer request’s body up to a limit to be able to retry/replay it, but only for a certain amount of time.

Thanks greatly for the head up.

  1. Yes mostly when all machines are stopped. I kept one machines always running with good loads balance hard and soft limit and this does not even near the soft limit.
  2. Yes the request has body most time but I kept and heatlh check every 10 secs. And sometimes even without external request the Health check sometimes freeze and say your app is unreachable… . .
  3. Yes my app take a few to boot that us why the soft limit is within reach to boot in time before hard limit max boot is around 30 sec to 1.5mins

I was thinking perhaps is the region or something else so I scale to new region and found same issues. Now I have scale down and only have one region on AMS and since there have been nice issues everything is working fine