How to make high availability?

porton · January 17, 2026, 6:59pm

In Explain, how high availability works - #4 by lillian it’s explained how high availability works. But when I run flyctl deploy, it updates my two machines (in the IAD region) simultaneously, so making the site inaccessible.

How to make high availability of shipping system in practice?

mayailurus · January 17, 2026, 7:11pm

Hi… It would help if you could post your full fly.toml. Like @lillian said, the default rolling method should only update one at a time.

porton · January 17, 2026, 7:26pm

This .toml file

I didn’t specify [deploy], but despite App configuration (fly.toml) · Fly Docs says it’s default, rolling strategy seems not having been chosen. It seems to follow immediate strategy. What’s wrong?

mayailurus · January 17, 2026, 7:57pm

Hm… If you fly deploy from the command line, what does it output? It should explicitly state the method that it’s using. E.g.,

Updating existing machines in 'uc8po' with rolling strategy

-------
┘  [1/2] Checking that 7840e95a9344d8 [app] is up and running
   [2/2] Waiting for job

(The update of the second Machine there is pending. Only the first one got taken offline, so far.)

You can use the </> button in the toolbar to get an area suitable for pasting output, etc.

porton · January 17, 2026, 8:13pm

Hm:

Updating existing machines in 'meritocracy' with rolling strategy
> [1/2] Acquiring lease for 56837e5ea1e068
> [1/2] Acquired lease for 56837e5ea1e068
> [2/2] Acquiring lease for d8d501fe03e068
> [2/2] Acquired lease for d8d501fe03e068
> [2/2] Updating machine config for d8d501fe03e068
> [1/2] Updating machine config for 56837e5ea1e068
> [2/2] Updating d8d501fe03e068 [app]
> [1/2] Updating 56837e5ea1e068 [app]
> [2/2] Updated machine config for d8d501fe03e068
✔ [2/2] Machine d8d501fe03e068 is now in a good state
> [1/2] Updated machine config for 56837e5ea1e068
✔ [1/2] Machine 56837e5ea1e068 is now in a good state
> [2/2] Clearing lease for d8d501fe03e068
> [1/2] Clearing lease for 56837e5ea1e068
✔ [2/2] Cleared lease for d8d501fe03e068
✔ [1/2] Cleared lease for 56837e5ea1e068
Checking DNS configuration for meritocracy.fly.dev
✓ DNS configuration verified

So, it’s indeed rolling. Why then does it make pauses in availability? Maybe, I have wrong timeouts? (How to measure the right ones?)

Also:

19:00:04
Machine started in 209ms
19:00:05
machine started in 458.28983ms

My two machines started almost synchronously. Bug?

Hypermind · January 17, 2026, 9:30pm

If you aim for gapless availability, you need to employ bluegreen deployment strategy. I tried rolling before, and it causes drops in service availability, even when multiple machines are seemingly available to “roll”.

porton · January 18, 2026, 2:53pm

I treed bluegreen and gaps still remained. It is either a bug in fly.io or maybe I have wrong timeouts or something.

mayailurus · January 18, 2026, 3:04pm

You mentioned a long startup time in another thread. How many milliseconds is that, roughly?

(The Machines platform generally doesn’t handle really long boots well, regardless of the timeouts set.)

porton · January 18, 2026, 7:13pm

I think, 20 sec.

porton · January 18, 2026, 7:40pm

24 sec on my testing machine between launching npm start (the Docker’s CMD) and both ports available. The time for production machines may differ a little, because testing uses SQLite and production uses a PostgreSQL server in the same region.

mayailurus · January 18, 2026, 7:44pm

Yikes, , that’s way outside of what the Machines orchestration really expects. (That is my understanding, anyway.)

You might be able to partly work around this by making sure that all Machines are already in the started state before deploying, in which case one will be running and serving requests while the other completes its slow climb back to the land of the living.

(I tried that out a couple times on a test app with an artificial 20 second delay in the start script.)

It would really be best to find out what is taking so long during boot, though, since there are situations where you can’t actually avoid a Machine getting stopped…

porton · January 18, 2026, 7:48pm

In the rolling strategy two machines begin to re-load simultaneously, so no one is available. Is this a bug?

19:44:47
Successfully prepared image registry.fly.io/meritocracy@sha256:39e22e6e782c510c436c81ee3993b4e85a7857604eb2f73454fde0cf6cbd48a7 (24.352688583s)
19:44:48
Configuring firecracker
19:45:02
Successfully prepared image registry.fly.io/meritocracy@sha256:39e22e6e782c510c436c81ee3993b4e85a7857604eb2f73454fde0cf6cbd48a7 (39.454128621s)
19:45:04
Configuring firecracker

mayailurus · January 18, 2026, 7:49pm

Were both of those already in the started state, before the deploy?

porton · January 18, 2026, 7:53pm

No, they were in Suspended. Does it matter?

BTW, production to start for one of the two production machines was 28 sec last time.

mayailurus · January 18, 2026, 7:56pm

I think so. If they’re suspended, then what you just posted looks like a sequential update.

(It doesn’t actually try to start them in that case.)

porton · January 18, 2026, 7:59pm

It is a bad behavior, I’d say a bug.

There should be a way to finish my npm start immediately after updating my machine, but it seems that there isn’t a way.

Even bluegreen seems not to solve this problem

mayailurus · January 18, 2026, 8:09pm

The user community is sharply divided on this, with roughly half agreeing with you, it seems.

(The other half won out, though; they have the biggest fleets to wait through all the auto-scaling of, .)

You can manually fly m start right before each deploy, although perhaps I’m misunderstanding you here…

porton · January 18, 2026, 8:29pm

It would be a solution, but I suspect, that the machines will be suspended again, if I execute this before the fly deploy from my .github/workflows before fly deploy.

So, I will instead execute it after each deploy.

porton · January 18, 2026, 10:34pm

But it does not work:

could not start machine XXX: failed to start VM XXX: failed_precondition: machine getting replaced, refusing to start

Bug?

mayailurus · January 19, 2026, 12:50am

Their CRDT can take a few seconds to converge, which is my guess at what that is. This is inconvenient but not a bug.

fly m list --json will let a script see the current state, and the lower-level Machines API has a nicer way, which is to wait on a particular code, with a timeout. (Thus avoiding a polling loop, in most cases.)

That latter one requires more logic, though, and might not be a good fit for GitHub Actions. (I don’t use those, myself, and hence don’t know the limitations there.)