Hello!
I’ve been hitting a recurring issue with a background worker machine that silently stops working after roughly 12-18 hours. I’ve already filed a support ticket, but wanted to post here as well in case others have encountered this or have ideas.
Setup:
- Two process groups: app (HTTP via uvicorn) and worker (Python background job processor)
- The worker has no [http_service] — it just polls a database and processes jobs
- Region: fra, shared-cpu-1x, 2GB RAM
- Restart policy: always with 5 retries
What happens:
- After a deploy, the worker starts fine and processes jobs normally
- ~12-18 hours later, it silently stops doing anything
- No crash logs, no error output, no OOM — the last log entry is a normal periodic heartbeat, then silence
- fly machine status shows the machine as started
- SSHing in confirms the process (PID) is still alive
Diagnostics from inside the machine while it was stuck:
$ cat /proc/<PID>/syscall7
0x… 0x1 0xffffffffffffffff … # poll() with timeout=-1 (infinite)
$ cat /proc/<PID>/stack
[<0>] do_poll.constprop.0+0x231/0x360
[<0>] do_sys_poll+0x164/0x240
[<0>] __x64_sys_poll+0x41/0x140
$ grep -E “State|Threads|VmRSS” /proc/<PID>/status
State: S (sleeping)
VmRSS: 198060 kB # ~193MB of 2048MB
Threads: 1
The process is stuck in poll() with an infinite timeout. It’s not crashed, not OOM’d, just frozen. Meanwhile, a fresh Python process started via SSH on the same machine can query the
database without issues. Restarting the machine fixes it until it happens again within next 12-18 hours.
What I’ve ruled out:
- Application crash — try/except around the entire poll loop, logs errors with full stack trace. No errors logged.
- OOM — 193MB used out of 2048MB
- Stale DB connections — using
pool_pre_ping=Trueandpool_recycle=300. Fresh process on same machine connects fine. - Auto-stop — auto_stop_machines = ‘off’ on the app process; worker has no service definition so Fly Proxy shouldn’t be involved at all (per the docs)
What I’m wondering:
- Could something at the Firecracker/VM level be suspending or freezing the process?
- Is there a known issue with long-running processes on shared-cpu machines going into this state?
- Is there anything specific about machines without a [service] definition that could cause this?
Relevant fly.toml config:
[processes]
app = "uvicorn main:app --host 0.0.0.0 --port 8000"
worker = "python -m worker"
[http_service]
auto_stop_machines = 'off'
auto_start_machines = true
min_machines_running = 1
processes = ['app']
[[restart]]
policy = "always"
retries = 5
processes = ["worker"]
[[vm]]
memory = '2gb'
cpu_kind = 'shared'
cpus = 1
processes = ['worker']
Any pointers appreciated. Happy to provide more diagnostics if needed.