Deploys and Job Queue Workers

danalloway · August 25, 2022, 12:19pm

we have a Fly App that subscribes and works a job queue in our platform, it handles many many jobs and, is for the most part, always working something.

how do we handle deploying this App, because inevitably the deploy will interrupt and shut the current machine down and most of the time it was in the middle of working a Job from the queue so that job abruptly terminates.

any thoughts on how to architect this in a better way?
thanks!

greg · August 25, 2022, 12:59pm

Good question. Off the top of my head, you could explore different deployment strategies to at least be sure the new vm is started and running before the old one is shut down. So that would keep messages being processed continually:

Though you would still be left with the problem of the old vm being shut down possibly mid-job. Resulting in a failed job. I guess … you could either listen for some kind of signal the machine is being shut down, and put the message back on to the queue (for the new vm to subsequently pick up and retry). Or maybe you could send a message pre-deploy … Like if you are deploying as part of some kind of CI (or even if not) you could have a prior step which would send a message to your queue (or direct to the vm?) to tell it it should not take on any new jobs (and abandon any current ones) as it is about to be killed. A heads-up. And so that would put it into some kind of idle state. That way when the deploy does happen, seconds later, it is not in the middle of a job.

ignoramous · August 25, 2022, 6:38pm

Distributed workflow and task management are hard problems.

If I were you, I’d either deploy Apache Airflow (any other such workflow management software, really) to Fly, or use a managed solution like Temporal, AWS Step Functions, etc.

charsleysa · August 26, 2022, 12:08am

Depends on how long the jobs take.

If the jobs take hours then you should look at designing resumable jobs so if it does get interrupted partway another worker can takeover from the last checkpoint.

If the jobs only take a few minutes then you can look at using fly kill_signal and kill_timeout options. You can set kill_signal = "SIGTERM" and the timeout to something like 5 minutes kill_timeout = 300 then inside your app listen for the SIGTERM signal. This tells the fly deploy to wait until your app shuts down gracefully or until the kill_timeout is reached.

When your app receives the SIGTERM signal you know it’s time to start shutting down, stop new jobs from starting, and wait for any running jobs to finish. Once all your jobs are finished you can exit gracefully and continue the deploy. If for whatever reason you don’t finish the jobs within the timeout, the deploy will forcefully kill your instance and continue the deployment.

danalloway · August 27, 2022, 8:02pm

thanks @greg this gives us some things to think through, appreciate the tips!

danalloway · August 27, 2022, 8:03pm

thanks @charsleysa, super fascinating approach, may try this one first and see how it works

N81 · July 11, 2024, 7:55pm

Did you ever arrive at a good solution?

Topic		Replies	Views
How to deploy new version without interrupting background jobs? Questions / Help machines	4	47	June 12, 2025
Updating a worker that might still be working	2	210	July 22, 2022
Feature Request: "busy" endpoint/busy check endpoint for fly.toml	6	319	July 31, 2023
Question on deployment lifecycle (wait before kill) Questions / Help	3	688	June 19, 2023
Deploying a multi process app and I don't want any machines running in one of the groups after a deploy.	5	263	November 14, 2023

Deploys and Job Queue Workers

Related topics