502 on first request awakening a stopped Rails app

quanrong · April 21, 2025, 7:31am

I have a simple Rails app with scale to zero on a single machine. When the machine is started by a request, the first request always fails with a 502, seemingly because the app doesn’t start fast enough (but it’s a simple Rails app, it doesn’t do anything special on startup). Any subsequent request is successful, as the machine is already running.

Is there anything I can do to prevent this? For example, would it be possible to increase the maximum number of attempts at load balancing? Any help will be greatly appreciated!

Here are the logs for when this happens:

2025-04-21T07:21:00Z proxy[e825104f015398] mad [info]Starting machine
2025-04-21T07:21:01Z app[e825104f015398] mad [info]2025-04-21T07:21:01.056784460 [01JSBJHRQKNHK885M9XD9NEMCN:main] Running Firecracker v1.7.0
2025-04-21T07:21:01Z health[e825104f015398] mad [warn]Health check on port 3000 is in a 'warning' state. Your app may not be responding properly.
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO Starting init (commit: d15e62a13)...
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO Checking filesystem on /data
2025-04-21T07:21:01Z app[e825104f015398] mad [info]/dev/vdc: clean, 12/64512 files, 8866/258048 blocks
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO Mounting /dev/vdc at /data w/ uid: 1000, gid: 1000 and chmod 0755
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO Resized /data to 1056964608 bytes
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO starting statics vsock server
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO Preparing to run: `/rails/bin/docker-entrypoint ./bin/rails server` as 1000
2025-04-21T07:21:01Z app[e825104f015398] mad [info] INFO [fly api proxy] listening at /.fly/api
2025-04-21T07:21:01Z runner[e825104f015398] mad [info]Machine started in 1.034s
2025-04-21T07:21:01Z proxy[e825104f015398] mad [info]machine started in 1.049074467s
2025-04-21T07:21:01Z proxy[e825104f015398] mad [info]machine became reachable in 7.569973ms
2025-04-21T07:21:02Z proxy[e825104f015398] mad [error][PC01] instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
2025-04-21T07:21:02Z app[e825104f015398] mad [info]2025/04/21 07:21:02 INFO SSH listening listen_address=[fdaa:a:54e5:a7b:49:367c:cc78:2]:22
2025-04-21T07:21:02Z health[e825104f015398] mad [error]Health check on port 3000 has failed. Your app is not responding properly.
2025-04-21T07:21:08Z app[e825104f015398] mad [info]=> Booting Puma
2025-04-21T07:21:08Z app[e825104f015398] mad [info]=> Rails 7.2.2.1 application starting in production
2025-04-21T07:21:08Z app[e825104f015398] mad [info]=> Run `bin/rails server --help` for more startup options
2025-04-21T07:21:10Z app[e825104f015398] mad [info]Puma starting in single mode...
2025-04-21T07:21:10Z app[e825104f015398] mad [info]* Puma version: 6.4.3 (ruby 3.3.5-p100) ("The Eagle of Durango")
2025-04-21T07:21:10Z app[e825104f015398] mad [info]*  Min threads: 3
2025-04-21T07:21:10Z app[e825104f015398] mad [info]*  Max threads: 3
2025-04-21T07:21:10Z app[e825104f015398] mad [info]*  Environment: production
2025-04-21T07:21:10Z app[e825104f015398] mad [info]*          PID: 668
2025-04-21T07:21:10Z app[e825104f015398] mad [info]* Listening on http://0.0.0.0:3000
2025-04-21T07:21:10Z app[e825104f015398] mad [info]Use Ctrl-C to stop
2025-04-21T07:21:12Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:12.642042 #668]  INFO -- : [71bf9302-ce04-4972-b86c-f5a66bd39260] Started GET "/up" for 172.19.25.177 at 2025-04-21 07:21:12 +0000
2025-04-21T07:21:12Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:12.644409 #668]  INFO -- : [71bf9302-ce04-4972-b86c-f5a66bd39260] Processing by Rails::HealthController#show as HTML
2025-04-21T07:21:12Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:12.645447 #668]  INFO -- : [71bf9302-ce04-4972-b86c-f5a66bd39260] Completed 200 OK in 1ms (Views: 0.3ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
2025-04-21T07:21:13Z health[e825104f015398] mad [info]Health check on port 3000 is now passing.
2025-04-21T07:21:16Z proxy[e825104f015398] mad [error][PR04] could not find a good candidate within 20 attempts at load balancing
2025-04-21T07:21:19Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:19.037429 #668]  INFO -- : [05d11e95-b94e-4c51-bb69-0bbfda6ab7f5] Started GET "/up" for 172.19.25.177 at 2025-04-21 07:21:19 +0000
2025-04-21T07:21:19Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:19.038500 #668]  INFO -- : [05d11e95-b94e-4c51-bb69-0bbfda6ab7f5] Processing by Rails::HealthController#show as HTML
2025-04-21T07:21:19Z app[e825104f015398] mad [info]I, [2025-04-21T07:21:19.039100 #668]  INFO -- : [05d11e95-b94e-4c51-bb69-0bbfda6ab7f5] Completed 200 OK in 0ms (Views: 0.2ms | ActiveRecord: 0.0ms (0 queries, 0 cached) | GC: 0.0ms)
2025-04-21T07:21:19Z health[e825104f015398] mad [info]Health check on port 3000 is now passing.

(despite the health check finally passing, the client has already received a 502 by then)

halfer · April 21, 2025, 9:41am

Is the healthcheck one written by you, or is it a default one from Fly? Could you tell us more about it, and show us the code? (Some healthchecks only report that the container is running, and do not take the listener into account, though I acknowledge that does not seem to be the case here).

quanrong · April 21, 2025, 10:03am

Than you for your reply. The problem is not the healthcheck (which is the default from Fly), but the app returning a 502 to the request that is waking up the server. I suppose the relevant line is this:

2025-04-21T07:21:16Z proxy[e825104f015398] mad [error][PR04] could not find a good candidate within 20 attempts at load balancing

I have fixed the issue for this app by setting it to suspend instead of stop. However, I’m having the same issue now for a GPU service that can’t be suspended, only stopped. The service is just a simple fastapi app with ollama. It sometimes fails with the same PR04 error due to 20 attempts at load balancing because it takes a long time to start (around 10 seconds), and sends a 502. Once loaded, it doesn’t fail anymore.

As you can see here, the code for this endpoint is extremely simple, so I don’t think it’s a problem with the code:

@app.post("/awaken")
async def awaken_ollama():
    return ollama.chat(model=MODEL)

(this loads the model if not loaded already)

This issue is difficult to reproduce, because it sometimes happens and sometimes it doesn’t.

Edit: sometimes this GPU app fails after apparently one attempt at load balancing:

proxy[908064e9a570d8] ord [error][PR04] could not find a good candidate within 1 attempts at load balancing

halfer · April 21, 2025, 10:08am

My first suspicion is that it’s the health-check. I would add a custom one, assuming Fly will use a HEALTHCHECK in your Dockerfile instead of the default, and of course in that check you can do a much deeper assessment of your system health, including HTTP listener and database connectivity.

Of course I cannot promise this will solve your problem, but it is the first thing I would try.

quanrong · April 21, 2025, 10:23am

You were right. Setting my own health check fixed this issue. Thanks a lot!

system · April 28, 2025, 10:23am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.