In Explain, how high availability works - #4 by lillian it’s explained how high availability works. But when I run flyctl deploy, it updates my two machines (in the IAD region) simultaneously, so making the site inaccessible.
How to make high availability of shipping system in practice?
I didn’t specify [deploy], but despite App configuration (fly.toml) · Fly Docs says it’s default, rolling strategy seems not having been chosen. It seems to follow immediate strategy. What’s wrong?
Updating existing machines in 'meritocracy' with rolling strategy
> [1/2] Acquiring lease for 56837e5ea1e068
> [1/2] Acquired lease for 56837e5ea1e068
> [2/2] Acquiring lease for d8d501fe03e068
> [2/2] Acquired lease for d8d501fe03e068
> [2/2] Updating machine config for d8d501fe03e068
> [1/2] Updating machine config for 56837e5ea1e068
> [2/2] Updating d8d501fe03e068 [app]
> [1/2] Updating 56837e5ea1e068 [app]
> [2/2] Updated machine config for d8d501fe03e068
✔ [2/2] Machine d8d501fe03e068 is now in a good state
> [1/2] Updated machine config for 56837e5ea1e068
✔ [1/2] Machine 56837e5ea1e068 is now in a good state
> [2/2] Clearing lease for d8d501fe03e068
> [1/2] Clearing lease for 56837e5ea1e068
✔ [2/2] Cleared lease for d8d501fe03e068
✔ [1/2] Cleared lease for 56837e5ea1e068
Checking DNS configuration for meritocracy.fly.dev
✓ DNS configuration verified
So, it’s indeed rolling. Why then does it make pauses in availability? Maybe, I have wrong timeouts? (How to measure the right ones?)
Also:
19:00:04
Machine started in 209ms
19:00:05
machine started in 458.28983ms
My two machines started almost synchronously. Bug?
If you aim for gapless availability, you need to employ bluegreen deployment strategy. I tried rolling before, and it causes drops in service availability, even when multiple machines are seemingly available to “roll”.
24 sec on my testing machine between launching npm start (the Docker’s CMD) and both ports available. The time for production machines may differ a little, because testing uses SQLite and production uses a PostgreSQL server in the same region.
Yikes, , that’s way outside of what the Machines orchestration really expects. (That is my understanding, anyway.)
You might be able to partly work around this by making sure that all Machines are already in the started state before deploying, in which case one will be running and serving requests while the other completes its slow climb back to the land of the living.
(I tried that out a couple times on a test app with an artificial 20 second delay in the start script.)
It would really be best to find out what is taking so long during boot, though, since there are situations where you can’t actually avoid a Machine getting stopped…
It would be a solution, but I suspect, that the machines will be suspended again, if I execute this before the fly deploy from my .github/workflows before fly deploy.
Their CRDT can take a few seconds to converge, which is my guess at what that is. This is inconvenient but not a bug.
fly m list --json will let a script see the current state, and the lower-level Machines API has a nicer way, which is to wait on a particular code, with a timeout. (Thus avoiding a polling loop, in most cases.)
That latter one requires more logic, though, and might not be a good fit for GitHub Actions. (I don’t use those, myself, and hence don’t know the limitations there.)