Autoscaler destroys machines that are still processing work

056xyz · July 31, 2025, 9:01pm

I have a background worker that reads from a Redis stream and processes messages every 15 seconds.

I exposed a redis_stream_total_work metric for Prometheus.

Currently it autoscales the background workers properly using the autoscaler fly.toml file.

However when the total work metric goes down, machines are being scaled down and ones that are still processing work get destroyed.

How can I control which machines get destroyed or maybe delay the destruction of the machines?

halfer · July 31, 2025, 9:07pm

I would use a different architecture. I use an app, which I call a Distributor, which is small and always on (I use a pair of machines here, but you can use one machine for simplicity if you want). This app receives requests from the web, and then creates new machines using the API, which do work and then exit when they finish.

056xyz · August 1, 2025, 7:57am

Thought of something similar and would go for it if there is no other option.

Is this the preferred way of doing it on fly?

halfer · August 1, 2025, 8:10pm

I would say that, in my opinion, it is a good general approach to architecture. Whether Fly has special features in its networking/auto-scaling that offer a quicker/better approach, I could not say.

At one stage I think Fly was offering Fly-related architectural advice, but I don’t know if they still do that.

lillian · August 1, 2025, 9:09pm

paid email support can answer questions like this, and we do also offer real-time solutions architecture sessions!

roadmr · August 1, 2025, 9:37pm

Hi @056xyz. What’s happening here is that, when using the auto-scaler in “create new machines when needed, destroy them when not needed” mode (that is, defining FAS_CREATED_MACHINE_COUNT), the destroy machine operation uses the equivalent of --force=true. This means it brutally kills the machine, no questions asked, without honoring kill_timeout (which I notice you’ve set to a large value to allow your workers to complete their tasks).

What you could do is use FAS_STARTED_MACHINE_COUNT instead; manually create the maximum number of machines you think you’ll need, and let the auto-scaler stop and start them instead of creating and destroying them.

When operating in this mode, the machines are stopped in the usual, graceful shutdown way:

your defined kill_signal is sent to the machine. This should signal your workers to finish what they’re doing and not pick up any more work.
If your app hasn’t exited, wait_timeout seconds later, SIGTERM is sent and the machine is forcibly stopped at this point.

Let me know if this helps !

Daniel

system · August 8, 2025, 9:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Auto-scaler + auto start/stop interplay? Questions / Help autoscaling	3	181	June 21, 2024
More flexible autoscaling with fly-autoscaler Fresh Produce autoscaling	0	349	March 20, 2024
Machine downscaled even if a process is running JavaScript machines	1	369	December 21, 2023
Autoscale from metrics	6	445	November 9, 2023
Queue/Worker architecture with Autostop/autostart Machines? autoscaling	16	512	October 15, 2024

Autoscaler destroys machines that are still processing work

Related topics