Rolling deployments sequence

pims · January 20, 2021, 6:47am

Hi

I have a question (or two) about rolling deployments.
Imagine an application with 2 running/healthy instances:

2021-01-20T06:15:22.518Z eac02266 sjc [info] [2021-01-20T06:15:21.461Z] "GET / HTTP/1.1" 200 
2021-01-20T06:15:24.127Z 6997dc52 sjc [info] [2021-01-20T06:15:24.067Z] "GET /ready HTTP/1.1"

I manually deploy a new version, and I see a new VM come online.

2021-01-20T06:15:26.229Z a2ab8699 sjc [info] Starting instance
2021-01-20T06:15:26.253Z a2ab8699 sjc [info] Configuring virtual machine
2021-01-20T06:15:26.254Z a2ab8699 sjc [info] Pulling container image
2021-01-20T06:15:28.584Z a2ab8699 sjc [info] Unpacking image
...
2021-01-20T06:15:29.929Z a2ab8699 sjc [info] [runner] starting

and I see the 2 healtchecks pass for this new VM:

2021-01-20T06:15:34.539Z a2ab8699 sjc [info] [2021-01-20T06:15:34.529Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.141" "Consul Health Check" "cec8c8af-99e9-4da1-8cd2-b2c99c6515e7" "172.19.2.130:9903" "-"
2021-01-20T06:15:44.543Z a2ab8699 sjc [info] [2021-01-20T06:15:44.530Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.141" "Consul Health Check" "97d0e784-0c1c-4978-9cef-7a46ea1c7812" "172.19.2.130:9903" "-"

the first timestamp is the timestamp from fly logs while the second one between brackets is Envoy’s access log timestamp

As soon as 2 Health Checks pass but without any Healthcheck status changed to passing log message, I see my two original VMs get terminated.

2021-01-20T06:15:44.965Z eac02266 sjc [info] Shutting down virtual machine
2021-01-20T06:15:45.089Z 6997dc52 sjc [info] Shutting down virtual machine

I’ve added a kill_timeout of 120 seconds, so the VM should stick around for a bit, but there doesn’t seem to be any logs after the SIGINT is sent.

A new VM starts (the 2nd one):

2021-01-20T06:15:56.620Z 74d94cfa sjc [info] Starting instance
2021-01-20T06:15:56.640Z 74d94cfa sjc [info] Configuring virtual machine
2021-01-20T06:15:56.641Z 74d94cfa sjc [info] Pulling container

but only one healthcheck is logged:

2021-01-20T06:16:08.808Z 74d94cfa sjc [info] [2021-01-20T06:16:08.799Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.143" "Consul Health Check" "433ba8df-6888-4477-ae34-ed4c0522385c" "172.19.3.74:9903" "-"
2021-01-20T06:16:12.972Z 74d94cfa sjc [info] Health check status changed to 'passing'

My questions:

Is it normal that after only one new VMs comes online, the previous 2 are scheduled for termination?
Is it normal to not have any log messages after SIGINT for the running VMs? I catch SIGINT and then gracefully drain http listeners, but I’m not seeing any log messages for this
Would the behavior be any different if I had 3 running VMs?

ps: @kurt I ended up writing a tiny go binary that handles the graceful shutdown because doing it in bash was too painful

kurt · January 21, 2021, 1:54am

Whoops! Missed this earlier.

By default, a deploy does a canary + rolling restart. This means it’ll boot the first instance, wait for health checks, and then promote the canary. When this happens, it goes to a rolling process. The rolling deploy stops one instance, boots another, stops another instance, etc.

What you’re seeing is basically this sequence:

Canary boots, health checks pass
Canary gets promoted, replacing old instance #1
Old instance #2 stops (rolling deploy)
New instance #2 starts

If you had three instances:

Canary boots, health checks pass
Canary gets promoted, replacing old instance #1
Old instance #2 stops (rolling deploy)
New instance #2 starts
Old instance #3 stops
New instance #3 starts

Now that I’ve written all that out, I realize it’s mildly confusing for 2 instance apps. It might be more predictable to do a blue/green deploy, or repeat the canary process all l the way through.

Regarding logs + SIGINT, you should see log messages after SIGINT if you’re still producing them. Is it possible the process is shutting down very quickly on SIGINT?

I ended up replacing a bunch of bash with Go as well. It’s so handy and … compiles.

pims · January 21, 2021, 3:16am

Thanks for the detailed explanation regarding the deploy sequence. I’ll try again with 3 instances.

As for the logs not appearing after SIGINT, the script waits ~45s before shutting down. It should be enough time for logs to make it to Fly’s backend before being terminated.

I’ll add more logs and share my script.

kurt · January 21, 2021, 3:44am

If you want to share the app name, and you’re cool with me firing off random SIGINTs to your instances, I can see what’s going on too.

pims · January 21, 2021, 3:59am

Definitely. App name envoy

kill_signal = "SIGINT"
# allow 2 minutes to cleanly shutdown
kill_timeout = 120

and the runner script is: runner.go · GitHub

kurt · January 21, 2021, 4:26am

The kill timeout on your app is at the default 5s, so I think the setting isn’t taking. Can you post your whole fly.toml?

pims · January 21, 2021, 4:28am

# fly.toml file generated for envoy on 2020-12-22T22:49:56-08:00

app = "envoy"

[experimental]
  private_network = true


kill_signal = "SIGINT"
# allow 2 minutes to cleanly shutdown
kill_timeout = 120

[[services]]
  internal_port = 10000
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20

  [[services.ports]]
    port = "443"

[[services]]
  internal_port = 9903

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/ready"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    restart_limit = 5

kurt · January 21, 2021, 4:29am

Oh, toml strikes again. Those are technically getting attached to the [experimental] block. If you move them to the top it should work.

kurt · January 21, 2021, 4:32am

Incidentally, this answers a question we had about whether we should use top level config settings or not. The answer is “no”. We’ll probably shift those to a [shutdown] block to make it less ambiguous.

pims · January 21, 2021, 4:32am

That would explain it.
Got it from Graceful VM exits, some dials · The Fly Blog

Just add these options to your fly.toml configuration file:

kill_timeout : Number of seconds to wait before killing a VM. Shared CPU VMs allow up to 5 minutes and dedicated CPU VMs allow up to 24 hours. The default is 5 seconds.

kill_signal : defaults to SIGINT . Also accepts SIGTERM , SIGQUIT , SIGUSR1 , SIGUSR2 , SIGKILL , or SIGSTOP .

Thanks for looking into this

pims · January 21, 2021, 4:36am

~~still not seeing the logs though, I’ll try to repro with a simpler example.~~

This might be because of the shell script and nothing to do with fly.

Edit: confirmed it works

amithm7 · December 27, 2021, 7:37am

In practice, I have seen the default as canary + immediate.

This is what happens:

Canary boots, health checks pass
Canary gets promoted, replacing old instance #1
Old instance #2 stops
Old instance #3 stops
New instance #2 starts
New instance #3 starts

After first instance boots up, rest of the instances shut down together like this:

2021-12-27T05:55:15.088 app[0e49f967] mia [info] Star child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.088 app[0e49f967] mia [info] Starting clean up.
2021-12-27T05:55:15.158 app[d9752c87] dfw [info] Main child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.158 app[d9752c87] dfw [info] Starting clean up.
2021-12-27T05:55:15.241 app[0eefd5ed] mad [info] Main child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.242 app[0eefd5ed] mad [info] Starting clean up.

This is a bug?