Handling Long-Running Tasks with Automatic Machine Shutdown on Fly.io

Berthold · March 15, 2025, 1:50pm

I’ve recently encountered a challenge while deploying a FastAPI backend service on Fly.io. My goals are:

Machines should automatically start up when requests come in.
Machines should automatically stop or suspend when idle to save costs (and support auto-scaling nicely!).
Long-running jobs triggered by users should run reliably until completion.

Here’s the catch

Fly.io determines whether a machine is “idle” based on active HTTP connections. This works perfectly for short requests, but it becomes problematic when dealing with long running background tasks (triggered by a http request).

Rough Code Structure

@app.get("/run")
async def run_long_job():
    asyncio.create_task(my_long_running_task())
    return {"status": "running"}

where my_long_running_task() is iteratively writing its results to a Database.

If the HTTP request returns immediately (e.g., after spawning an asyncio background task), Fly.io sees no active connections and stops the machine after a few minutes, interrupting ongoing tasks. On the other hand, as in the frontend app I would like to start multiple such runns at the same time, I would like to not need to keep the request open until completed.

My question for the community

How do you handle long-running tasks in FastAPI (or similar frameworks) on Fly.io, while still leveraging auto-scaling and auto-shutdown capabilities?

Are there elegant solutions or patterns that you’ve successfully used to balance cost efficiency with reliability?

I’m looking forward to your insights, workarounds, and recommendations!

Thank you very much and have a great day

halfer · March 15, 2025, 1:55pm

You could have a machine start-up on incoming requests, and then send itself periodic HTTP requests while its internal job is still running. Once the machine is finished on its task, it could either stop itself via the API, or cease the period requests, and let the autoscaler stop it in the normal fashion.

Berthold · March 15, 2025, 2:17pm

Hei

Thank you so much for the very fast reply. This is indeed a workaround that I also had in mind it just feels a bit too “hacky” because we either introduce a new FastAPI endpoint (e.g. /keep_alive) that needs authentication dependency or we call a non existing endpoint throwing an error.

As an alternative I also thought to have two machines (FastAPI machine and Worker machine). Wher n the background of the firs machine we open a connection to the second machine until it is done with the job. But I am not sure whether this would keep alive both machines untul the job is done. Also this does not feel like the “best practice” of doing it!

What do you think?

halfer · March 15, 2025, 2:23pm

I wouldn’t worry about it being hacky in the first cut; just get it working, and then improve it if the solution bothers you. The new endpoint doesn’t have to be authenticated as such, just use a hardwired string that the machine itself knows about. Keeping a machine alive is a low-security issue.

A machine to monitor other machines is fine, and I have that arrangement myself. But in your case I’d say it was overkill.

Another idea is to use Fly’s auto-wake-up system, but disable the automatic spin down. Does Fly support that configuration? If so, a machine stopping or deleting itself would be very simple, and you’d not need to a keep-alive device.

Berthold · March 15, 2025, 2:37pm

Ok, thank you very much, I’ll go with the keep-myself-alive option!

If I may ask just out of personal interest, for what kind of use case do you use a machine that monitors other machines? A very brief answer is fully sufficient!

Have a great day and thanks once again

halfer · March 15, 2025, 2:44pm

what kind of use case do you use a machine that monitors other machines?

My application architecture is here, have a peek!

catflydotio1 · March 16, 2025, 11:30pm

@Berthold Apologies if I missed a requirement of your architecture, but I’d say this is the way to go. When your app’s main process halts, the Machine it’s running on shuts down. If your app can decide for itself when it’s done and shut itself off, then you can dispense with the fly-proxy concurrency-based autostop.

rubys · March 17, 2025, 2:02am

update nevermind. Others have already made this suggestion. Consider this an endorsement of their recommendation.

Another option if you have a way to determine for yourself that your application is idle:

In fly.toml:

auto_stop_machines = false
auto_start_machines = true

In your application:

process.exit(0)

Berthold · March 21, 2025, 4:37pm

Hello @rubys and @catflydotio1, thank you so much for your suggestion.

As the application could be called multiple times using the async setup, this means we would count the number of running jobs inside a global variable and call process.exit(0) once there is no active job anymore?!

rubys · March 21, 2025, 5:31pm

Yes, the only thing I would do different is that when the count reached zero, I would call setTimeout to schedule a shutdown of the process, and call clearTimeout if a new job comes in before the shutdown actually occurs.

Berthold · March 21, 2025, 8:07pm

Very cool, thank you so much

Topic		Replies	Views
How to keep a fly machine awake with a long running task Questions / Help autoscaling	3	256	July 1, 2024
Prevent machine being suspended while active despite `min_machines_running = 0` autoscaling	2	33	September 25, 2024
How to auto start backend if frontend starts? autoscaling	8	38	September 28, 2024
Fly io app running, despite auto stop set to stop autoscaling	2	122	September 25, 2024
Billed high amounts for auto start/stop machines Questions / Help	4	478	September 29, 2023

Handling Long-Running Tasks with Automatic Machine Shutdown on Fly.io

Here’s the catch

Rough Code Structure

My question for the community

Related topics