Machines are autostopped even though remaining machines is below `min_machines_running`

I have the following config set in my fly.toml:

[http_service]
  # [snip]
  min_machines_running = 2

However, I still see that Fly’s autoscaler will occasionally bring down one of the machines then immediately bring it back up:

2026-02-05 12:08:00.255 App transit-tracker-api has excess capacity, autostopping machine 1857590b352408. 1 out of 2 machines left running (region=sjc, process group=app)
2026-02-05 12:08:03.237 Starting machine

This is even stranger because there is no excess capacity. The soft_limit is 70 connections, and there were a total of 163 concurrent connections to the app when the autoscaler decided to scale down. So it was actually over capacity! All of the machine’s connections are then handled by the other running machine and the load becomes incredibly unbalanced:

My app mostly involves long-running WebSocket connections, so the load remains unbalanced until enough clients reconnect or the over-capacity machine starts CPU-throttling and dropping connections on its own.

Why does this happen? Is this Fly reallocating the machine to a new physical host?

Hm… There’s an ongoing incident mentioning bad “networking configuration”, so perhaps this is another symptom?

https://status.flyio.net/incidents/c5btnv5wkqd8

You wouldn’t see the auto-scaler mentioned in that case, I wouldn’t think. (Also, you can verify by looking in fly m status under Events, in the Info column. This doesn’t always show a lot of history, unfortunately, but if it was very recent, then you should see it.)

This has happened intermittently for at least the past month (maybe more), so I doubt it. Usually at least once per day.

Intuitively you’d think so, but log messages have misled me in the past :slight_smile:

I couldn’t think of any other situation where this would happen other than a bug in the autoscaler.

Over the last few days we’ve had this exact issue occur, as well as issues with traffic not being routed to certain machines or machines from certain regions not reporting metrics.

@elliotdickison I’m having issues with metrics as well, I made this thread for it: CPU and RAM metrics missing from new machines