Machine in “stopped” state instead of “suspended” after deploy - auto-start not working

0xCAFE · April 1, 2026, 11:36pm

Hi Fly.io support,

We’re experiencing an issue where one of our machines consistently gets stuck in “stopped” state instead of “suspended” after deployments, making it unavailable for auto_start scaling.

Region: nrt (Tokyo)

Configuration:

[http_service]
  auto_stop_machines = 'suspend'
  auto_start_machines = true
  min_machines_running = 3

[http_service.concurrency]
  type = 'requests'
  soft_limit = 25

Problem:

We have 4 machines. After every deployment (fly deploy), this specific machine ends up in “stopped” state instead of “suspended”.
Since auto_start_machines only works with “suspended” machines (not “stopped”), this machine is excluded from the autoscaling pool.
Even when we manually start it (fly machine start), it gets stopped again within 5 minutes by Fly Proxy, instead of being suspended.
This causes 502 errors during traffic spikes because only 3 machines handle load instead of 4.

Machine event log:

STATE    EVENT   SOURCE  TIMESTAMP
stopped  update  flyd    2026-04-02T00:03:32
created  launch  user    2026-04-02T00:03:29
pending  launch  flyd    2026-04-02T00:03:29

Note: No “started” event after creation — the machine was created and immediately stopped.

Workaround attempted:
We added a post-deploy CI step that starts all stopped machines:

- name: Start stopped machines
  run: |
    for id in $(flyctl machine list --json | jq -r '.[] | select(.state == "stopped") | .id'); do
      flyctl machine start "$id" -a hypots-server || true
    done

But the machine gets stopped again within minutes instead of being suspended.

Expected behavior:

After deploy, excess machines should be in “suspended” state (not “stopped”)
auto_stop_machines = 'suspend' should suspend machines, not stop them
Manually started machines should be suspended (not stopped) when excess capacity is detected

Impact:

502 errors during peak traffic (08:00-09:00 KST, ~300K requests/30min)
Autoscaling pool effectively limited to 3 machines instead of 4

We found a related community discussion: Suspended machines are stopped on new deploy

Could you help us understand why this machine is being stopped instead of suspended, and if there’s a fix or workaround?

Thank you.

FlorianRegaz · April 6, 2026, 5:49pm

I’m having a similar issue, not the same, but it might be of help.

I spin up machines and then I want to suspend them to use them later. However, often when I suspend them, instead of suspend its reaching “stopped“.

Here is a small snippet of my logs:


2026-04-06T17:30:58Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:30:58.375 | INFO     | apps.services.worker_manager:create_worker:199 - voice_event=worker_create_requested region=fra version=deployment-01KNHTEBJWCAQS98D8HB843BRX
2026-04-06T17:30:58Z app\[781e453c9739d8\] fra \[info\]INFO:     172.16.4.186:42684 - “POST /connect HTTP/1.1” 200 OK
2026-04-06T17:30:58Z app\[781e453c9739d8\] fra \[info\]INFO:     172.16.4.186:42696 - “POST /offer HTTP/1.1” 200 OK
2026-04-06T17:30:59Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:30:59.050 | INFO     | apps.control_plane_app:mark_session_connected:240 - voice_event=session_connected session_id=14ac42e4-e89b-4f6f-9142-d499ef5024f2 worker_machine_id=287e64df36e208 allocation_to_connected_ms=1000
2026-04-06T17:30:59Z app\[781e453c9739d8\] fra \[info\]INFO:     172.16.4.186:42700 - “POST /internal/session-events/connected HTTP/1.1” 200 OK
2026-04-06T17:30:59Z app\[781e453c9739d8\] fra \[info\]INFO:     172.16.4.186:42710 - “PATCH /offer HTTP/1.1” 200 OK
2026-04-06T17:31:00Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:31:00.211 | INFO     | apps.services.worker_manager:create_worker:224 - voice_event=worker_created machine_id=2872619f027628 region=fra version=deployment-01KNHTEBJWCAQS98D8HB843BRX fly_state=created
2026-04-06T17:31:08Z app\[781e453c9739d8\] fra \[info\]INFO:     172.19.4.185:52246 - “GET /metrics HTTP/1.1” 200 OK
2026-04-06T17:31:15Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:31:15.214 | WARNING  | apps.services.worker_manager:\_with_retries:876 - Worker operation wait_for_machine_started.wait_for_machine_state failed on attempt 1/3: Fly Machines API request failed: . Retrying in 0.50s.
2026-04-06T17:31:23Z app\[781e453c9739d8\] fra \[info\]INFO:     172.19.4.185:53022 - “GET /metrics HTTP/1.1” 200 OK
2026-04-06T17:31:38Z app\[781e453c9739d8\] fra \[info\]INFO:     172.19.4.185:49924 - “GET /metrics HTTP/1.1” 200 OK
2026-04-06T17:31:53Z app\[781e453c9739d8\] fra \[info\]INFO:     172.19.4.185:35408 - “GET /metrics HTTP/1.1” 200 OK
2026-04-06T17:31:56Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:31:56.540 | INFO     | apps.services.worker_manager:suspend_worker:403 - voice_event=worker_suspend_requested machine_id=2872619f027628 previous_fly_state=started
2026-04-06T17:31:59Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:31:59.501 | WARNING  | apps.services.worker_manager:\_with_retries:876 - Worker operation suspend_worker.wait_for_machine_state failed on attempt 1/3: Fly Machines API request failed with status 409: {“error”:“aborted: machine reached stopped state instead of suspended state”} \[status=409\] \[body={“error”:“aborted: machine reached stopped state instead of suspended state”}\]. Retrying in 0.50s.
2026-04-06T17:32:00Z app\[781e453c9739d8\] fra \[info\]2026-04-06 17:32:00.019 | WARNING  | apps.services.worker_manager:\_with_retries:876 - Worker operation suspend_worker.wait_for_machine_state failed on attempt 2/3: Fly Machines API request failed with status 409: {“error”:“aborted: machine reached stopped state instead of suspended state”} \[status=409\] \[body={“error”:“aborted: machine reached stopped state instead of suspended state”}\]. Retrying in 1.00s.

mayailurus · April 6, 2026, 7:49pm

Interesting, I hadn’t seen that particular log pattern before…

The Fly.io platform, I believe, is generally always allowed to leave your Machines in the stopped state instead of suspended, so your code should really be ready for both outcomes:

https://fly.io/docs/reference/suspend-resume/#snapshot-behavior-with-suspend

https://fly.io/docs/reference/suspend-resume/#limitations-and-considerations

(I read that as implied by the above, anyway.)

Many people are particularly surprised by the way in which this behavior interacts with deploys. (The above doc notes that deploys result in a “cold start”, but a large fraction of users seem to interpret that as meaning that every Machine will be restarted immediately, as part of the deploy.) This comes up in the forum every few months.

https://community.fly.io/t/machines-still-stopping-instead-of-suspending/27115/7

It doesn’t look like yours was associated with a deploy, but perhaps it was falling more under the “space reclamation” clause mentioned in the doc, …

(Some of the underlying physical host machines have been really short on capacity lately.)

PeterCxy · April 6, 2026, 7:59pm

No, auto_start_machines works with both suspended and stopped machines. Suspension was only a later addition, the option should work just fine with stopped machines.

Do you mind sharing the name of your app? It really doesn’t sound like this was an issue related to stopping vs suspending, but probably something else.