we have a Fly App that subscribes and works a job queue in our platform, it handles many many jobs and, is for the most part, always working something.
how do we handle deploying this App, because inevitably the deploy will interrupt and shut the current machine down and most of the time it was in the middle of working a Job from the queue so that job abruptly terminates.
any thoughts on how to architect this in a better way?
thanks!
Good question. Off the top of my head, you could explore different deployment strategies to at least be sure the new vm is started and running before the old one is shut down. So that would keep messages being processed continually:
Though you would still be left with the problem of the old vm being shut down possibly mid-job. Resulting in a failed job. I guess … you could either listen for some kind of signal the machine is being shut down, and put the message back on to the queue (for the new vm to subsequently pick up and retry). Or maybe you could send a message pre-deploy … Like if you are deploying as part of some kind of CI (or even if not) you could have a prior step which would send a message to your queue (or direct to the vm?) to tell it it should not take on any new jobs (and abandon any current ones) as it is about to be killed. A heads-up. And so that would put it into some kind of idle state. That way when the deploy does happen, seconds later, it is not in the middle of a job.
Distributed workflow and task management are hard problems.
If I were you, I’d either deploy Apache Airflow (any other such workflow management software, really) to Fly, or use a managed solution like Temporal, AWS Step Functions, etc.
If the jobs take hours then you should look at designing resumable jobs so if it does get interrupted partway another worker can takeover from the last checkpoint.
If the jobs only take a few minutes then you can look at using fly kill_signal and kill_timeout options. You can set kill_signal = "SIGTERM" and the timeout to something like 5 minutes kill_timeout = 300 then inside your app listen for the SIGTERM signal. This tells the fly deploy to wait until your app shuts down gracefully or until the kill_timeout is reached.
When your app receives the SIGTERM signal you know it’s time to start shutting down, stop new jobs from starting, and wait for any running jobs to finish. Once all your jobs are finished you can exit gracefully and continue the deploy. If for whatever reason you don’t finish the jobs within the timeout, the deploy will forcefully kill your instance and continue the deployment.