Does fly support Elixir hot code reloading?

I haven’t been able to find any answers to this online or in this forum. There is one unanswered post from 2021 - Search results for 'hot code' - Fly.io

I’m assuming the answer is “no” due to containerised nature of Fly, but it would be nice to know if I’m missing something. My app has long-lived live sessions (chat/gaming) that don’t play nicely with frequent traditional deployments.

(To avoid an XY problem, I would also consider alternatives that allow for blue-green handovers of genserver state during deploys).

Thanks for any advice

I also think it’s no for now too unfortunately. But I’m also very interested in solutions for your problem which is very interesting too :slight_smile:

We do support bluegreen btw!

So here’s a few random ideas ranging from ‘kinda easy’ to ‘we are building a platform on top of Fly’ so feel to dismiss any or all.

Application logic to reconnect

This is a wild guess since I don’t know your frontend specially but could you in theory just reconnect all folks in the same channel to a newly machine?

Processes gracefully handover work to newly deployed machines

If you launched your app recently Phoenix would have generated an env.sh.eex roughly like this:

#!/bin/sh

# configure node for distributed erlang with IPV6 support
export ERL_AFLAGS="-proto_dist inet6_tcp"
export DNS_CLUSTER_QUERY="${FLY_APP_NAME}.internal"
export RELEASE_DISTRIBUTION="name"
export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

RELEASE_NODE is the main important bit here: we make it left@right where left is a combo of app name and deployment image ref because we don’t want your updated machines to cluster with older machines.

You could remove that and do some application logic to handover to new genservers? That definitely requires application code.

Releases with volumes

That’s a thing I’d love to try someday myself but the gist is each machine would have a small volume (say 1GB) and we’d not update the machine using the regular fly deploy command, we’d just scp into the machine the new code and do whatever it takes for hot code reloading.

Machines would only be needed to be fly deployed in case there was a change in the Dockerfile (say install new stuff).


Would love input from other folks on these.

1 Like

Thanks for the detailed reply!

This is a wild guess since I don’t know your frontend specially but could you in theory just reconnect all folks in the same channel to a newly machine?

That works for clients connected to liveviews (ephemeral processes) but not for the long-lived process backing their game state. (In this case, yes, we could explore things like persisting the game state and reloading on a new machine, but this gets complex).

Digging into bluegreen a little more:

  1. It’s my understanding that they use a healthcheck at their core. Are there other ways to control the bluegreen? Because I’d need to do something like: “launch new machines, wait for health, transfer process state to new servers, only mark an old machine as blue once all processes are migrated off it”.

  2. Given point 1, are there timeouts or other weirdness that we would hit if we had a server taking a long time to hand-off?

  3. What would happen if a machine was never marked as blue? E.g. it has a problem and the process won’t handoff.

  4. Are there other ways to achieve something a bit like bluegreen without using your specific bluegreen pipeline? E.g. scale from 5 machines to 10, the new machines having the new code, then scale back down to 5 when ready, culling the old machines?

To reply to all of them at once. Our blue green strat will do its best to ensure all new machines are passing health checks to route traffic to them you can even include your own health checks if you feel like you should:

Maybe you could sync new machines with their old counterparts and pass the check when it’s done? I’m not sure how that would go, just throwing an idea out there to see if it helps.

If the new machines never pass health we are not going to route to them :slight_smile:


Another approach: our machines API is very exposed to the public. You could deploy with a strategy like:

  1. Build the image with fly deploy --build-only and store the image ref.
  2. Clone machines (either via flyctl or machines API) but just change the image ref.
  3. Your app would cluster all images and machines know how to takeover state

I feel like flyctl can handle steps 1 and 2 neatly. Step 3 is application code specific though.

Cool, thanks Lubien, we’ll look into these.

Sorry, 1 more thing!

Specifically we need to get more detailed than healthcheck.

A timeline:

  • Server1 is healthy, and active.
  • Sever2 starts up. It is now primary, and healthy.
  • Server1 is no longer primary, but should NOT be killed until all connections have eneded (this might be hours, or days).
  • We want fly to route all new connections to server2, and none to server1).
  • Server1 takes 7 hours for all connections to finish
  • Now, server1 is reporting as “ready to be pruned”.
  • Fly kills server1.

This is more nuanced than a strict pass/fail healthcheck, and I can’t see how to do this using fly.toml config.

I see what you’re getting at!

I believe to make it work you’ll need to handle more of the deployment yourself. Here’s an example of something I’d experiment first:

Sidenote: all things below can be done with flyctl and machines API, feel free to chose which feels more comfortable to you.

  1. Spawn Server2 using flyctl machine clone (changing the build image) or POST /apps/x/machines
  2. Somehow let Server1 know it’s no longer the correct one (flag on DB? flag on genserver? will require application code).
  3. If traffic goes to Server1 just use Fly-Replay, send to Server2: Dynamic Request Routing · Fly Docs
  4. Server1 knows that when theres 0 players it can exit gracefully. You could even make the machine DELETE itself, bonkers right?
  5. Server2 is the only server alive

@alex1 I kinda created a POC: https://twitter.com/joao_lubien/status/1763341930367189345

The reply alone was super valuable, the POC is next level. Thanks :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.