How to make high availability?

In Explain, how high availability works - #4 by lillian it’s explained how high availability works. But when I run flyctl deploy, it updates my two machines (in the IAD region) simultaneously, so making the site inaccessible.

How to make high availability of shipping system in practice?

Hi… It would help if you could post your full fly.toml. Like @lillian said, the default rolling method should only update one at a time.

This .toml file

I didn’t specify [deploy], but despite App configuration (fly.toml) · Fly Docs says it’s default, rolling strategy seems not having been chosen. It seems to follow immediate strategy. What’s wrong?

Hm… If you fly deploy from the command line, what does it output? It should explicitly state the method that it’s using. E.g.,

Updating existing machines in 'uc8po' with rolling strategy

-------
┘  [1/2] Checking that 7840e95a9344d8 [app] is up and running
   [2/2] Waiting for job

(The update of the second Machine there is pending. Only the first one got taken offline, so far.)

You can use the </> button in the toolbar to get an area suitable for pasting output, etc.

Hm:

Updating existing machines in 'meritocracy' with rolling strategy
> [1/2] Acquiring lease for 56837e5ea1e068
> [1/2] Acquired lease for 56837e5ea1e068
> [2/2] Acquiring lease for d8d501fe03e068
> [2/2] Acquired lease for d8d501fe03e068
> [2/2] Updating machine config for d8d501fe03e068
> [1/2] Updating machine config for 56837e5ea1e068
> [2/2] Updating d8d501fe03e068 [app]
> [1/2] Updating 56837e5ea1e068 [app]
> [2/2] Updated machine config for d8d501fe03e068
✔ [2/2] Machine d8d501fe03e068 is now in a good state
> [1/2] Updated machine config for 56837e5ea1e068
✔ [1/2] Machine 56837e5ea1e068 is now in a good state
> [2/2] Clearing lease for d8d501fe03e068
> [1/2] Clearing lease for 56837e5ea1e068
✔ [2/2] Cleared lease for d8d501fe03e068
✔ [1/2] Cleared lease for 56837e5ea1e068
Checking DNS configuration for meritocracy.fly.dev
✓ DNS configuration verified

So, it’s indeed rolling. Why then does it make pauses in availability? Maybe, I have wrong timeouts? (How to measure the right ones?)

Also:

19:00:04
Machine started in 209ms
19:00:05
machine started in 458.28983ms

My two machines started almost synchronously. Bug?

1 Like

If you aim for gapless availability, you need to employ bluegreen deployment strategy. I tried rolling before, and it causes drops in service availability, even when multiple machines are seemingly available to “roll”.

I treed bluegreen and gaps still remained. It is either a bug in fly.io or maybe I have wrong timeouts or something.

You mentioned a long startup time in another thread. How many milliseconds is that, roughly?

(The Machines platform generally doesn’t handle really long boots well, regardless of the timeouts set.)

I think, 20 sec.

24 sec on my testing machine between launching npm start (the Docker’s CMD) and both ports available. The time for production machines may differ a little, because testing uses SQLite and production uses a PostgreSQL server in the same region.

Yikes, :weary_cat:, that’s way outside of what the Machines orchestration really expects. (That is my understanding, anyway.)

You might be able to partly work around this by making sure that all Machines are already in the started state before deploying, in which case one will be running and serving requests while the other completes its slow climb back to the land of the living.

(I tried that out a couple times on a test app with an artificial 20 second delay in the start script.)

It would really be best to find out what is taking so long during boot, though, since there are situations where you can’t actually avoid a Machine getting stopped…

In the rolling strategy two machines begin to re-load simultaneously, so no one is available. Is this a bug?

19:44:47
Successfully prepared image registry.fly.io/meritocracy@sha256:39e22e6e782c510c436c81ee3993b4e85a7857604eb2f73454fde0cf6cbd48a7 (24.352688583s)
19:44:48
Configuring firecracker
19:45:02
Successfully prepared image registry.fly.io/meritocracy@sha256:39e22e6e782c510c436c81ee3993b4e85a7857604eb2f73454fde0cf6cbd48a7 (39.454128621s)
19:45:04
Configuring firecracker

Were both of those already in the started state, before the deploy?

No, they were in Suspended. Does it matter?

BTW, production to start for one of the two production machines was 28 sec last time.

I think so. If they’re suspended, then what you just posted looks like a sequential update.

(It doesn’t actually try to start them in that case.)

It is a bad behavior, I’d say a bug.

There should be a way to finish my npm start immediately after updating my machine, but it seems that there isn’t a way.

Even bluegreen seems not to solve this problem :frowning:

The user community is sharply divided on this, with roughly half agreeing with you, it seems.

(The other half won out, though; they have the biggest fleets to wait through all the auto-scaling of, :sweat_smile:.)

You can manually fly m start right before each deploy, although perhaps I’m misunderstanding you here…

It would be a solution, but I suspect, that the machines will be suspended again, if I execute this before the fly deploy from my .github/workflows before fly deploy.

So, I will instead execute it after each deploy.

But it does not work:

could not start machine XXX: failed to start VM XXX: failed_precondition: machine getting replaced, refusing to start

Bug?

Their CRDT can take a few seconds to converge, which is my guess at what that is. This is inconvenient but not a bug.

fly m list --json will let a script see the current state, and the lower-level Machines API has a nicer way, which is to wait on a particular code, with a timeout. (Thus avoiding a polling loop, in most cases.)

That latter one requires more logic, though, and might not be a good fit for GitHub Actions. (I don’t use those, myself, and hence don’t know the limitations there.)