Cold start and health check failures on deployed APIs [repost]

Hi Fly.io Support Team,

I am experiencing an issue with my deployed applications where the APIs experience significant cold start delays. Specifically:

  1. After a deployment, the first request to the API takes a long time to respond and occasionally fails.

  2. The health check for the app also experiences delays and sometimes fails on the first run.

  3. Subsequent requests (2nd, 3rd, etc.) work fine.

  4. If the API is idle for a few minutes (around 5 minutes), the first request again experiences a delay or fails, showing a repeated cold start behavior.

This behavior is impacting the reliability of our application, especially for endpoints that need to be responsive immediately after deployment or during idle periods.

Could you please advise:

  • Whether this is expected behavior on Fly.io?

  • Any recommended configuration changes to reduce cold start times?

  • Ways to ensure health checks pass consistently even after deployment?

App Details:

  • Application Name: mediamagic-api

  • Region: Ashburn, Virginia (US)

  • Deployment Method: GitHub

  • Observed Health Check Path: /_health

Hi there!

Everything you’re seeing is normal.

After a deployment, the first request to the API takes a long time to respond and occasionally fails.

This is likely either because the machine takes a bit to spin up and be ready for requests (solution: tune the grace period in your health check) or because on first deploy the machine is stopped; in this case the image is updated but stopped machines aren’t started (let sleeping machines lie).
Also, things definitely work better if you define an explicit http health check as described in that link rather than relying on the implicit health check - it’s a very trivial tcp “is the process up?” test and it doesn’t take into account app startup time, so it’s subject to some exponential back-off from the Fly proxy resulting in the slow first request behavior you see.

Also, your app is extremely slow to reach a point where it starts serving requests; it’s likely because you’re starting a boatload of services on app startup (I spotted redis, nats, and launchdarkly at quick glance). I would recommend splitting these off into separate Fly apps and only running your main web process on the Fly app, for faster boot times.

If the API is idle for a few minutes (around 5 minutes), the first request again experiences a delay or fails, showing a repeated cold start behavior.

This is because your machines are configured to auto-stop after a few minutes when they are not serving any requests. This saves money because machines aren’t running when nobody is using them, with the downside that indeed, when a request comes in, the machine has to wake up (cold-boot) before it can serve the request.

Properly tuning the grace period helps here because a well-tuned grace period means the wait until the request is answered is shorter, but also: auto-stop/auto-start can be turned off if you absolutely need fast first responses all the time. More information here.

Hope this helps!

  • Daniel
1 Like

Related (duplicate post): https://community.fly.io/t/cold-start-and-health-check-failures-on-deployed-apis/25963

Just my opinion: you probably don’t need AI text generation tools to compose a post here. Use your own voice if you can :relieved_face:

What’s the spec of your machine? I wonder if your lengthy boot-up process is overwhelming the CPU, and thus you’re experiencing throttling, which is slowing down your app’s readiness.

Also, what stack are you using? We have seen some feedback here that Java APIs go through some kind of bytecode caching on first boot, which races the CPU on start-up, making the throttling problem worse.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.