Autoscaler destroys machines that are still processing work

I have a background worker that reads from a Redis stream and processes messages every 15 seconds.

I exposed a redis_stream_total_work metric for Prometheus.

Currently it autoscales the background workers properly using the autoscaler fly.toml file.

However when the total work metric goes down, machines are being scaled down and ones that are still processing work get destroyed.

How can I control which machines get destroyed or maybe delay the destruction of the machines?

I would use a different architecture. I use an app, which I call a Distributor, which is small and always on (I use a pair of machines here, but you can use one machine for simplicity if you want). This app receives requests from the web, and then creates new machines using the API, which do work and then exit when they finish.

Thought of something similar and would go for it if there is no other option.

Is this the preferred way of doing it on fly?

I would say that, in my opinion, it is a good general approach to architecture. Whether Fly has special features in its networking/auto-scaling that offer a quicker/better approach, I could not say.

At one stage I think Fly was offering Fly-related architectural advice, but I don’t know if they still do that.

paid email support can answer questions like this, and we do also offer real-time solutions architecture sessions!

1 Like

Hi @056xyz. What’s happening here is that, when using the auto-scaler in “create new machines when needed, destroy them when not needed” mode (that is, defining FAS_CREATED_MACHINE_COUNT), the destroy machine operation uses the equivalent of --force=true. This means it brutally kills the machine, no questions asked, without honoring kill_timeout (which I notice you’ve set to a large value to allow your workers to complete their tasks).

What you could do is use FAS_STARTED_MACHINE_COUNT instead; manually create the maximum number of machines you think you’ll need, and let the auto-scaler stop and start them instead of creating and destroying them.

When operating in this mode, the machines are stopped in the usual, graceful shutdown way:

  1. your defined kill_signal is sent to the machine. This should signal your workers to finish what they’re doing and not pick up any more work.
  2. If your app hasn’t exited, wait_timeout seconds later, SIGTERM is sent and the machine is forcibly stopped at this point.

Let me know if this helps !

  • Daniel

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.