Health checks on launch fails quicker than app can launch

Hey all,

I’m running into an issue where the health checks on a new booted instance goes faster than my app can actually boot. I have several boxes running, most of them offline. So when the health check fails, it spins up another instance and I run into the same issues. Coming down to being stuck in a loop until a timeout is hit. I have looked through the documentation but either I can’t find the right option or something has changed on Fly.io’s end. We have been using this system for the last 1,5 years without issues, but as of about 24 hours ago this issue started popping up.

Example logs of one of the instances;

2025-07-19T03:34:53.784 app[0801565b570398] ams [info] INFO Starting init (commit: 6c3309ba)...
2025-07-19T03:34:53.897 app[0801565b570398] ams [info] INFO Preparing to run: `docker-entrypoint.sh bash ./entrypoint.sh` as root
2025-07-19T03:34:53.900 app[0801565b570398] ams [info] INFO [fly api proxy] listening at /.fly/api
2025-07-19T03:34:53.926 runner[0801565b570398] ams [info] Machine started in 1.16s
2025-07-19T03:34:53.928 proxy[0801565b570398] ams [info] machine started in 1.169904221s
2025-07-19T03:34:54.131 app[0801565b570398] ams [info] 2025/07/19 03:34:54 INFO SSH listening listen_address=[xyz]:22
2025-07-19T03:34:59.171 proxy[0801565b570398] ams [info] waiting for machine to be reachable on 0.0.0.0:3000 (waited 5.242988688s so far)
2025-07-19T03:35:02.179 proxy[0801565b570398] ams [error] [PM05] failed to connect to machine: gave up after 15 attempts (in 8.251361929s)
2025-07-19T03:35:02.730 app[0801565b570398] ams [info] Listening: http://0.0.0.0:3000

As you can see, it failed to connect and within a second the server is online and available. It’s a hit or miss whether the server is online on time. Sadly it’s much more a miss than a hit consistency wise.

The TOML config of the app;

app = 'my-obfuscated-app'
primary_region = 'ams'

[build]

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = 'off'
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']
  grace_period = "15s"

[checks]
  [checks.api_v1_health]
    port = 3000
    type = 'http'
    interval = '20s'
    timeout = '10s'
    grace_period = '10s'
    method = 'get'
    path = '/api/v1'

[[vm]]
  size = 'performance-2x'

Any clues or ideas would be more than welcome, I’m quite lost on this case right now.

Kind regards,
Pollie

In the short-term, can you just remove the healthcheck? At least that will get you operational again, even if it is not ideal.

Not really unfortunately, we’re using the servers in special way. You could see it as jobs. If a server is processing something, then we mark the server as “busy” (via health checks), whereby the load balancer will forward request to other boxes. So the health checks are quite crucial in our case.

Ah, interesting. My only worry about that arrangement is that you’re marking an instance as unhealthy, not busy, and I wonder if in devops terms that would generally result in a reboot (I don’t know the Fly-specific behaviour here). What happened previously when an instance went busy, would it still carry on running?

Broadly related to @halfer’s take, the alternative/niche mechanism that you’re using is explicitly intended for situations where the failures don’t affect routing and load balancing:

If your app doesn’t have public-facing services, or you want independent health checks that don’t affect request routing, use this top-level checks section instead of [[services.checks]].

Perhaps Fly-Replay: elsewhere=true would be more in line with what you want, conceptually, :thought_balloon:

the logs provided aren’t the healthcheck failing; starting a machine with a http request doesn’t wait for the check to be passing to send the request to the machine. I’m not sure if we want to (or can) fix this.

Yes, and if Fly advanced networking doesn’t cover the requirement, a Fly machine running Traefik, or some other programmable proxy, would probably do it. Instances that are busy can send a message to remove themselves from the pool, and then send another one to re-add themselves once the work is done.

Hey, yeah it would have. The load balancer would just not have routed new requests to it. But the instance itself was left as is. It is a pain for deployments as those would get cycled during deployments.

We do this as well. From what I’ve experienced, the load balancer routes requests to “busy” instances until the health check have been performed. For that reason we also have the Fly-Replay: elsewhere=true in place to prevent a busy instance to receive a new job.

However, we can have up to hundreds of instances at the same time, and Fly.io only redirects a maximum of 15 times. So purely relying on these redirects wouldn’t work.

Correct me if I’m wrong, but I don’t think that this is exactly the cause. This functionality has been working flawlessly. My issue at this moment is when an instance starts, it starts an Express API. But for some reason Fly marks these instances as unreachable before the express server even starts. Or I’m missing something.

But the load balancer waits until the instance is online, correct? It’s trying to reach the server like 15 times within 8 seconds. But only after ~9 seconds the server is online. Can I extend that to 20 times, or can I put a delay on when it should start checking?

From what I can see, the cause isn’t the niche setup we use exactly. It’s purely an API request coming in to an express server but the load balancer says the server is unreachable whilst it’s still booting.

So the “busy” has nothing to do with the issue explicitely. That is still working fine and working as expected. It’s just a booting issue which could happen to general REST API instances as well I presume?

Hope this clarifies a little with what’s blocking me. Gladly provide more info if required.

it waits until the machine is online; it doesn’t know anything about your app. I don’t believe it’s possible to customize the behaviour, unfortunately.

yes, unfortunately.

But am I missing something then, sorry for my lack of experience with the platform. My “app” is just a Node.JS Express instance. In that sense pretty basic setup with only about 3 routes. Not many services start up besides Express. So this could almost be any REST API setup. Is this not supported then on Fly.IO?

How are instances on Fly.io supposed to work then? Not as API endpoints but just as runnable node scripts?

I’m not sure I understand this question, could you clarify?

the vast majority of apps deployed on Fly are http servers, whether used for serving HTML or JSON or something else (we don’t keep track that I know), and many of them use node.js; so it is definitely possible to run API endpoints in a Fly app.

all I’m trying to say is there is a limitation with autostart where it ignores checks. you can work around this by either making your server start up quicker, or by disabling autostart and keeping your machines always online (and/or manually starting new instances when necessary).

entirely guessing here, but maybe you are thinking of a lambda-style API endpoint where a new instance starts up per-request? it seems unusual to me that a lambda would take 15 seconds to start up, but it being possible is plausible… :thinking:
we don’t support lambdas natively though. you’d need to build it out yourself, it’s not too complicated, there’s a few different ways in which one could do that.

if that’s not it either, let me know, I’m happy to help you work through whatever you’re trying to do (or let you know if it’s not a good fit for Fly).

It’ll be down to slow server start up from what I understand. I’m not sure why it’d be slow as it’s a simple setup from our side but I’ll investigate and see what I can find.

Appreciate your time and information.

I suggest looking at the “Fly Instance” dashboard at https://fly-metrics.net. a common reason why apps may take a long time to start is CPU throttling. good luck!

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.