Fly Proxy not connected when gunicorn ready?

jasonhd · March 28, 2026, 1:34pm

I’ve been having a lot of issues troubleshooting this. For some reason gunicorn posts that it’s ready to accept requests, however, the fly proxy fails. This causes a resource cascade where more machines start booting.

13:31:43[2026-03-28 09:31:43 -0400] [645] [DEBUG] Arbiter booted

13:31:43[2026-03-28 09:31:43 -0400] [645] [INFO] Listening at: http://0.0.0.0:8080 (645)
13:31:43[2026-03-28 09:31:43 -0400] [645] [INFO] Using worker: sync
13:31:43[2026-03-28 09:31:43 -0400] [645] [INFO] Gunicorn ready to serve requests at 1774704703.436851
13:31:43[2026-03-28 09:31:43 -0400] [656] [INFO] Booting worker with pid: 656
13:31:43[2026-03-28 09:31:43 -0400] [657] [INFO] Booting worker with pid: 657
13:31:43[2026-03-28 09:31:43 -0400] [656] [DEBUG] GET /health/
13:31:43[2026-03-28 09:31:43 -0400] [645] [DEBUG] 2 workers
13:31:45[2026-03-28 09:31:45 -0400] [657] [DEBUG] GET /health/
13:31:46waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.475607079s so far)
13:31:49[PM05] failed to connect to machine: gave up after 15 attempts (in 8.479125072s)

This is my current fly.toml

app = “pb-app-web”
primary_region = “iad”
swap_size_mb = 1024

[deploy]
strategy = “rolling”
release_command = “python manage.py migrate” # Run migrations once per deploy, not per container
release_command_timeout = “30m”
wait_timeout = “20m”
max_unavailable = 0

[build]
dockerfile = “Dockerfile.web”

[env]
DJANGO_SETTINGS_MODULE = “core.settings.web”

Keep your explicit process group so we can bind the HTTP service to it

[processes]
web = “/start-web.sh”

[http_service]
internal_port = 8080
processes = [“web”] # bind service to the web process group
force_https = true # 80 → 443 redirect at the proxy

autosuspend / autostart (Machines v2 way)

auto_stop_machines = “suspend” # “off” | “stop” | “suspend”
auto_start_machines = true
min_machines_running = 0 # set to 0 if you want full auto-off

[http_service.concurrency]
type = “requests” # better for HTTP than “connections”
soft_limit = 2 # Reduced to match worker count
hard_limit = 4

[[http_service.checks]]
interval = “10s”
timeout = “2s”
grace_period = “60s”
method = “GET”
path = “/health/”
[http_service.checks.headers]
Host = “######.fly.dev”
X-Forwarded-Proto = “https”

[[vm]]
cpu_kind = “shared”
cpus = 1
memory = “1gb”

lubien · March 28, 2026, 1:42pm

Howdy! Your internal port and logs indicate that it’s not a port issue so my next guess is that your internal health check endpoint is not returning 200.

Can you comment/remove the health check debug and see if the deploy is successful? If it is it will indicate that the issue is either on the health check configuration or in health check action itself (which I would suggest you add more logging to see)

jasonhd · March 28, 2026, 2:12pm

Thank you for this! It does appear removing the health check has resolved the issue. Interestingly, before removing the health check, the system was reporting the health check later returned a 200.

13:56:13[2026-03-28 09:56:13 -0400] [642] [DEBUG] 2 workers

13:56:14[2026-03-28 09:56:14 -0400] [658] [DEBUG] GET /health/
13:56:15waiting for machine to be reachable on 0.0.0.0:8080 (waited 5.53290463s so far)
13:56:18[PM05] failed to connect to machine: gave up after 15 attempts (in 8.536627937s)
13:56:24172.19.8.121 - - [28/Mar/2026:09:56:24 -0400] “GET /health/ HTTP/1.1” 200 16 “-” “Consul Health Check”

lubien · March 28, 2026, 2:22pm

I wonder if the health check is taking long to send 200.

Can you try to simplify your health action to just sending a 200 with no real checks and see if the deploy goes through fast? If it does it means something your health is doing is taking long to work (tweak your timeouts) or if it doesn’t it could indicate that the web server itself might be at fault here.

halfer · March 28, 2026, 3:30pm

A good tip for healthcheck issues is to spin up your service locally in Docker. You can then make the healthcheck yourself and see how long it takes to reply. Of course, if you don’t already have a working local environment (including a local database and other downstream dependencies) then I’d say that’s a priority to get operational anyway.