Worker machine (no HTTP service) silently stops processing after ~12-18 hours

Hello!

I’ve been hitting a recurring issue with a background worker machine that silently stops working after roughly 12-18 hours. I’ve already filed a support ticket, but wanted to post here as well in case others have encountered this or have ideas.

Setup:

  • Two process groups: app (HTTP via uvicorn) and worker (Python background job processor)
  • The worker has no [http_service] — it just polls a database and processes jobs
  • Region: fra, shared-cpu-1x, 2GB RAM
  • Restart policy: always with 5 retries

What happens:

  • After a deploy, the worker starts fine and processes jobs normally
  • ~12-18 hours later, it silently stops doing anything
  • No crash logs, no error output, no OOM — the last log entry is a normal periodic heartbeat, then silence
  • fly machine status shows the machine as started
  • SSHing in confirms the process (PID) is still alive

Diagnostics from inside the machine while it was stuck:

$ cat /proc/<PID>/syscall7 
0x… 0x1 0xffffffffffffffff …    # poll() with timeout=-1 (infinite)

$ cat /proc/<PID>/stack
[<0>] do_poll.constprop.0+0x231/0x360
[<0>] do_sys_poll+0x164/0x240
[<0>] __x64_sys_poll+0x41/0x140

$ grep -E “State|Threads|VmRSS” /proc/<PID>/status
State:    S (sleeping)
VmRSS:     198060 kB       # ~193MB of 2048MB
Threads:    1

The process is stuck in poll() with an infinite timeout. It’s not crashed, not OOM’d, just frozen. Meanwhile, a fresh Python process started via SSH on the same machine can query the
database without issues. Restarting the machine fixes it until it happens again within next 12-18 hours.

What I’ve ruled out:

  • Application crash — try/except around the entire poll loop, logs errors with full stack trace. No errors logged.
  • OOM — 193MB used out of 2048MB
  • Stale DB connections — using pool_pre_ping=True and pool_recycle=300. Fresh process on same machine connects fine.
  • Auto-stop — auto_stop_machines = ‘off’ on the app process; worker has no service definition so Fly Proxy shouldn’t be involved at all (per the docs)

What I’m wondering:

  1. Could something at the Firecracker/VM level be suspending or freezing the process?
  2. Is there a known issue with long-running processes on shared-cpu machines going into this state?
  3. Is there anything specific about machines without a [service] definition that could cause this?

Relevant fly.toml config:

[processes]
  app = "uvicorn main:app --host 0.0.0.0 --port 8000"
  worker = "python -m worker"

[http_service]
  auto_stop_machines = 'off'
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app']

[[restart]]
  policy = "always"
  retries = 5
  processes = ["worker"]

[[vm]]
  memory = '2gb'
  cpu_kind = 'shared'
  cpus = 1
  processes = ['worker']

Any pointers appreciated. Happy to provide more diagnostics if needed.

Have a look in Grafana logs for the affected time period; maybe there is something in there. This will nearly always be an application-level crash or an OOM. I appreciate you’ve ruled these out, but there’s no harm in looking at your memory graphs in Grafana.