Recommended pattern for upgrading workers with long running (5-10 minute) jobs?

infinitydeltax · March 8, 2025, 1:17am

Hi there,

I have some fly machines that are managing inference workloads that take a very long time (5-10+ minutes) to complete. I’d like to be able to upgrade these machines, so that the old machines can continue to interact with clients that have in-progress inferences, but new requests will go to the new machines.

Is there a recommended deploy pattern here? Maybe somehow mark old machines as “old” and have them start fly-replaying to new machines once the upgrade has started, then stop themselves when they’re done with their last request?

Rolling deploys seem to kill the machines after a maximum of five minutes, which is too short. Plus I would need the new machine to be taking new requests in that time.

mayailurus · March 8, 2025, 3:10am

Hi… It looks like there a couple good suggestions (including the little-known elsewhere variant of Fly-Replay) in the following older thread:

https://community.fly.io/t/dynamic-request-router-hinting/20543

The cordon alternative is the one I would try first, I think.

Hope this helps a little!

system · March 15, 2025, 3:11am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Update machine without killing the old one Questions / Help	2	223	September 6, 2023
Deploys and Job Queue Workers Questions / Help	6	1397	July 11, 2024
Don't kill running machines during fly deploy ? Questions / Help	6	540	December 6, 2023
Update scheduled machine from GitHub action Questions / Help	3	199	June 2, 2023
Updating a worker that might still be working	2	210	July 22, 2022

Recommended pattern for upgrading workers with long running (5-10 minute) jobs?

Related topics