We are currently developing a Phoenix application on fly. It’s a multiplayer backend that stores state in memory and dispatches updates to clients through WebSockets, clustered across the globe to ensure optimal ping response times.
The in-memory state is persisted to S3. We use the bluegreen deployment strategy, and it’s important to save all the state to S3 before routing traffic to the new cluster, so we don’t lose anything. The process we came up with goes roughly like this:
release_command adds an entry in DB to signal the old cluster a deployment is happening
new cluster boots but its /health responds with a 503 as long as the DB entry is present, so traffic does not get routed to it yet
old cluster sees the DB entry and starts a “pre shutdown” sequence
a. close all channels
b. save all projects to S3
c. remove DB entry
new cluster’s /health responds 200, traffic is routed to the new cluster and clients start reconnecting
old cluster nodes get a SIGTERM and finish to shutdown
This is a bit too complex for our taste, we would like to get rid of the pre-shutdown sequence and DB state. Our ideal deployment would go like this:
boot new cluster
(optional) run some quick integration tests on the new cluster via a private tunnel
gracefully shutdown the old cluster with SIGTERM
route traffic to the new cluster
Is it something that could be achieved with the current APIs? If not, we would be happy to start a discussion around this subject
Have you considered storing up data in Fly-managed Redis instead to keep the clusters in-sync? Quite expensive and still in preview, but using it might simplify the pre/post deploy ceremonies, otherwise.
The problem with failing the health check (steps 2 to 4) for longer time is, Fly might rollback the deployment (which is another scenario the app would have to handle).
Gotcha, but I’d avoid it if I were you (there is a bunch that can go wrong, as you know).
Is it something that could be achieved with the current APIs?
Anyways, for Machine apps, both release_command (code) and rolling strat (there’s no blue-green) are driven client-side by flyctl (code). So, if you have a new strat that you want to impl or customize the existing one, it is pretty straight-forward to do so (don’t quote me on it, I’ve never had to do it ;)).
For regular apps, the deployments are handled server-side by Fly, and so short of them implementing a new strat or modifying an existing one, I don’t see how it would be a worthwhile endeavour…
I’m also looking into a custom deploy strategy, which has 2 goals
more extensive tests in real life infra, including 6pn
keeping old scaled to 0 deploys around for “skew protection”
Let’s say we have app with name foo, which has custom domains, certs, dns and everything set up
Ideally, I think, it’d be amazing to be able to deploy to a randomly generated app name bar, do some testing etc, then basically move all the traffic from foo over to bar while having some internal mapping of git versions to deployment names/ids so that if bar receives a request which originated from a foo front-end it can replay the request to foo via the replay header, which should boot up the possibly already scaled to 0 app foo. So I think this might require something like renaming an existing and running app which doesn’t seem to be possible from what I read
As the app already needs inter instance/app communication to forward websocket traffic etc it’d be nice to do the same in this scenario, though from what I gathered so far this might not really be possible at the moment? Tho pleease correct me if I’m wrong
The alternative I can think of which I probably like to avoid would be a separate api-gateway app which needs less (potentially breaking) deploys and just orchestrates where to proxy the incoming request to, which would make all the domain & dns stuff be stable and the name of the actual app wouldn’t matter to the outside world, this might also be helpful with the custom user domains part
Another one might be programmatically changing DNS records to move traffic from foo over to bar
Also note we plan on supporting custom user domains soonish which I don’t have the definitive setup we’re going for fleshed out yet, might be purely on fly or via a cdn in front, was looking into bunny or potentially cloudfront as alternative, cloudflare seems to be a bit pricey for this use-case. The custom domain thing makes me think that the last option, updating dns records might be a little less convenient overall and potentially everything leads to a separate api-gateway app in front