Worker machine (no HTTP service) silently stops processing after ~12-18 hours

piotrkrzysztof · February 23, 2026, 7:56am

Hello!

I’ve been hitting a recurring issue with a background worker machine that silently stops working after roughly 12-18 hours. I’ve already filed a support ticket, but wanted to post here as well in case others have encountered this or have ideas.

Setup:

Two process groups: app (HTTP via uvicorn) and worker (Python background job processor)
The worker has no [http_service] — it just polls a database and processes jobs
Region: fra, shared-cpu-1x, 2GB RAM
Restart policy: always with 5 retries

What happens:

After a deploy, the worker starts fine and processes jobs normally
~12-18 hours later, it silently stops doing anything
No crash logs, no error output, no OOM — the last log entry is a normal periodic heartbeat, then silence
fly machine status shows the machine as started
SSHing in confirms the process (PID) is still alive

Diagnostics from inside the machine while it was stuck:

$ cat /proc/<PID>/syscall7 
0x… 0x1 0xffffffffffffffff …    # poll() with timeout=-1 (infinite)

$ cat /proc/<PID>/stack
[<0>] do_poll.constprop.0+0x231/0x360
[<0>] do_sys_poll+0x164/0x240
[<0>] __x64_sys_poll+0x41/0x140

$ grep -E “State|Threads|VmRSS” /proc/<PID>/status
State:    S (sleeping)
VmRSS:     198060 kB       # ~193MB of 2048MB
Threads:    1

The process is stuck in poll() with an infinite timeout. It’s not crashed, not OOM’d, just frozen. Meanwhile, a fresh Python process started via SSH on the same machine can query the
database without issues. Restarting the machine fixes it until it happens again within next 12-18 hours.

What I’ve ruled out:

Application crash — try/except around the entire poll loop, logs errors with full stack trace. No errors logged.
OOM — 193MB used out of 2048MB
Stale DB connections — using pool_pre_ping=True and pool_recycle=300. Fresh process on same machine connects fine.
Auto-stop — auto_stop_machines = ‘off’ on the app process; worker has no service definition so Fly Proxy shouldn’t be involved at all (per the docs)

What I’m wondering:

Could something at the Firecracker/VM level be suspending or freezing the process?
Is there a known issue with long-running processes on shared-cpu machines going into this state?
Is there anything specific about machines without a [service] definition that could cause this?

Relevant fly.toml config:

[processes]
  app = "uvicorn main:app --host 0.0.0.0 --port 8000"
  worker = "python -m worker"

[http_service]
  auto_stop_machines = 'off'
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app']

[[restart]]
  policy = "always"
  retries = 5
  processes = ["worker"]

[[vm]]
  memory = '2gb'
  cpu_kind = 'shared'
  cpus = 1
  processes = ['worker']

Any pointers appreciated. Happy to provide more diagnostics if needed.

halfer · February 23, 2026, 7:00pm

Have a look in Grafana logs for the affected time period; maybe there is something in there. This will nearly always be an application-level crash or an OOM. I appreciate you’ve ruled these out, but there’s no harm in looking at your memory graphs in Grafana.