My website has been going down somewhat periodically. The logs say:
2025-06-04T16:52:24Z proxy[2865100f961158] sjc [info]App naagm has excess capacity, autostopping machine 2865100f961158. 0 out of 1 machines left running (region=sjc, process group=app)
2025-06-04T16:52:24Z app[2865100f961158] sjc [info] INFO Sending signal SIGTERM to main child process w/ PID 649
2025-06-04T16:52:24Z app[2865100f961158] sjc [info]16:52:24.691 [notice] SIGTERM received - shutting down
2025-06-04T16:52:25Z app[2865100f961158] sjc [info] WARN Reaped child process with pid: 718 and signal: SIGUSR1, core dumped? false
2025-06-04T16:52:26Z app[2865100f961158] sjc [info] INFO Main child exited normally with code: 0
2025-06-04T16:52:26Z app[2865100f961158] sjc [info] INFO Starting clean up.
2025-06-04T16:52:26Z app[2865100f961158] sjc [info] INFO Umounting /dev/vdc from /mnt/db
2025-06-04T16:52:26Z app[2865100f961158] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2025-06-04T16:52:26Z app[2865100f961158] sjc [info][11060.842739] reboot: Restarting system
2025-06-04T17:21:57Z proxy[2865100f961158] sjc [info]Starting machine
2025-06-04T17:21:57Z proxy[2865100f961158] sjc [error][PM01] machines API returned an error: "could not reserve resource for machine: insufficient memory available to fulfill request"
2025-06-04T17:21:57Z proxy[2865100f961158] sjc [info]Starting machine
When I look in the dashboard it says my machine is ‘suspended’.
I don’t really know how to proceed. I can fix it by redeploying the app, but then it happens again later. The app is deployed in SJC.
The app doesn’t get much traffic or do anything resource intensive. I shouldn’t need to scale it. I
Ok, got it. I get that it auto stops when it’s idle to save money. But why can’t it start back up when it gets traffic? auto_start_machines is set to true in my toml.
It’s trying to, but you can see the problem near the end of the log snippet:
This is one of the reasons why it’s unwise to run just a single Machine on the Fly.io platform.
(Another is the high risk of permanent data loss on the volume, .)
The Machine is pinned to a single underlying physical host, and although they do migrate them sometimes these days, you can’t rely on it happening on the timescale of auto-start, etc. If there’s a capacity crunch there, then your site is down for a while.
I’d suggest rethinking your architecture to match the platform’s strengths and limitations. E.g., a managed Postgres database + 2 Elixir app Machines, instead.
(Possibly a different hosting service entirely, if you really do just want 1 of everything…)