Zero downtime deployment

When reading these threads it seems there were some issues with deployment causing intermittent donwtime:

However trying it today it seems no better :frowning: Am I missing something?

I’ve used this guide to get a simple Go app: Run a Go App · Fly Docs

When deploying (using default, rolling or canary strategy) I get spotty service for about 30-60s (depends on the strategy how bad and long it is).

Is zero downtime deployment possible currently?

It’s pretty hard deploying anything production worthy if downtime is expected on every deployment.

I noticed this also for another application where I have very aggresive health checks and large grace periods but it always results in downtime. I assumed that was because I configured it wrong so testing with the Go app seemed like a good way to ensure it was Fly and not me.

This is a log of me sending a request every second:

HTTP/2 200
date: Tue, 18 Oct 2022 18:34:55 GMT
content-length: 137
content-type: text/html; charset=utf-8
server: Fly/af1875645 (2022-10-18)
via: 2 fly.io
fly-request-id: 01GFP614XR2P7Z12EAES7HHS07-ams

curl: (28) Operation timed out after 803 milliseconds with 0 bytes received
curl: (28) Operation timed out after 803 milliseconds with 0 bytes received
curl: (28) Operation timed out after 806 milliseconds with 0 bytes received
curl: (28) Operation timed out after 800 milliseconds with 0 bytes received
curl: (28) Operation timed out after 804 milliseconds with 0 bytes received
curl: (28) Operation timed out after 804 milliseconds with 0 bytes received
curl: (28) Operation timed out after 806 milliseconds with 0 bytes received
curl: (28) Operation timed out after 803 milliseconds with 0 bytes received
curl: (28) Operation timed out after 802 milliseconds with 0 bytes received
curl: (28) Operation timed out after 800 milliseconds with 0 bytes received
curl: (28) Operation timed out after 805 milliseconds with 0 bytes received
curl: (28) Operation timed out after 802 milliseconds with 0 bytes received
curl: (28) Operation timed out after 806 milliseconds with 0 bytes received
curl: (28) Operation timed out after 802 milliseconds with 0 bytes received
curl: (28) Operation timed out after 803 milliseconds with 0 bytes received
curl: (28) Operation timed out after 800 milliseconds with 0 bytes received

HTTP/2 200
date: Tue, 18 Oct 2022 18:35:26 GMT
content-length: 138
content-type: text/html; charset=utf-8
server: Fly/af1875645 (2022-10-18)
via: 2 fly.io
fly-request-id: 01GFP622J88D0VZ8HE2JWTWHZ3-ams

There are ~30 seconds between the request on the previous version and the next one, however it takes until 18:35:55 (so a full minute) before service is stable again with intermittent timeouts.

1 Like

I was just about to post something similar. I see the original box running, but as soon as I start a deployment my calls fail. They don’t succeed again until well after the new box is up and running.

May be this post has some more insights (even if it could be outdated): High availability on Fly.io - #3 by kurt

Only if one is running multi-region Fly apps, does fly deploy --strategy rolling <other-args> prevent outages from client PoV (ref)?

I am compelled to point out that, Machine app instances get fixed 6pn (private) IPs across deploys (unlike regular Fly apps unless they ‘anchor’ themselves to a host), and so, these are likely to not be subject to deploy-induced downtime. If you can, you should move to Machines (after evaluating it for a round or two).

Interesting, thanks for your detailed reply, it seems like v1 apps can’t guarantee any high availabilty / zero downtime because the backhaul updates too slow to make this happen if I’m reading between the lines a little.

I must admit that Machines sound like they are supposed to be v2 apps at some point and could be treated like that although I have a hard time figuring out how to deploy the example as a v2/machines app instead of a v1/nomad app.

Maybe I’m just not as smart as I’m thinking or I’m missing some doc page but I would have expected a way to set I want to use machines in my fly.toml but haven’t found any docs explaining this well yet and all examples seem to be v1 examples still. Reading the machines doc (Machines · Fly Docs) gives me more questions than answers to be honest because it mostly tells you how to do things manually.

Is there some doc I’m missing?

@stayallive This morning when I pushed some new code and then ran flyctl deploy the rolling update worked as I expected in that the first instance stayed up until the second was ready.

Interestingly, changing an ENV did NOT work this way (which is what I was doing yesterday when I first noticed this. In that case, the old instance is terminated immediately and there is downtime before the second is available to handle incoming requests.

You mean fly secrets set ...? That’s interesting.


Wait for Fly engs to confirm this one way or the other. We need zero down-time deploys ourselves and so I use Machines but I haven’t yet moved all our prod traffic to Fly. To my untrained eye, Machines might provide this, but I haven’t yet put ours under any duress, yet.

Join me in writing a Haiku, won’t you: Write us a haiku and we'll put it on the login screen - #13 by ignoramous (:

# create a machines app
fly apps create <app-name> --machines

# then run it
fly m run
  . \
  # services.port and services.internal_port with handlers go here
  -p <443:8080/tcp:tls?> \
  --dockerfile </path/to/dockerfile?>
  --memory <256?> \
  --region <iad?> \
  --name <uniq-name?>
  -a <app-name?>

After a Machine is run once, you can clone it to other regions:

# list machines
fly m list -a <app-name>

# note down alloc-id from above, and clone like so:
fly m clone <machine-alloc-id> --config fly.toml --name <uniq-name> --region <aws>

You can also use fly deploy (once the app’s created) to deploy to all machines (but it is rough at its edges). I fixed a few issues with fly deploy and Machines, but I maintain a local-fork of it.

See also: Ephemeral fly machine? - #2 by ignoramous

Ref this Fly blog post.

Yep.

1 Like