Recommended pattern for upgrading workers with long running (5-10 minute) jobs?

Hi there,

I have some fly machines that are managing inference workloads that take a very long time (5-10+ minutes) to complete. I’d like to be able to upgrade these machines, so that the old machines can continue to interact with clients that have in-progress inferences, but new requests will go to the new machines.

Is there a recommended deploy pattern here? Maybe somehow mark old machines as “old” and have them start fly-replaying to new machines once the upgrade has started, then stop themselves when they’re done with their last request?

Rolling deploys seem to kill the machines after a maximum of five minutes, which is too short. Plus I would need the new machine to be taking new requests in that time.

Hi… It looks like there a couple good suggestions (including the little-known elsewhere variant of Fly-Replay) in the following older thread:

https://community.fly.io/t/dynamic-request-router-hinting/20543

The cordon alternative is the one I would try first, I think.

Hope this helps a little!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.