Rolling deployments sequence

Hi :wave:

I have a question (or two) about rolling deployments.
Imagine an application with 2 running/healthy instances:

2021-01-20T06:15:22.518Z eac02266 sjc [info] [2021-01-20T06:15:21.461Z] "GET / HTTP/1.1" 200 
2021-01-20T06:15:24.127Z 6997dc52 sjc [info] [2021-01-20T06:15:24.067Z] "GET /ready HTTP/1.1" 

I manually deploy a new version, and I see a new VM come online.

2021-01-20T06:15:26.229Z a2ab8699 sjc [info] Starting instance
2021-01-20T06:15:26.253Z a2ab8699 sjc [info] Configuring virtual machine
2021-01-20T06:15:26.254Z a2ab8699 sjc [info] Pulling container image
2021-01-20T06:15:28.584Z a2ab8699 sjc [info] Unpacking image
...
2021-01-20T06:15:29.929Z a2ab8699 sjc [info] [runner] starting

and I see the 2 healtchecks pass for this new VM:

2021-01-20T06:15:34.539Z a2ab8699 sjc [info] [2021-01-20T06:15:34.529Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.141" "Consul Health Check" "cec8c8af-99e9-4da1-8cd2-b2c99c6515e7" "172.19.2.130:9903" "-"
2021-01-20T06:15:44.543Z a2ab8699 sjc [info] [2021-01-20T06:15:44.530Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.141" "Consul Health Check" "97d0e784-0c1c-4978-9cef-7a46ea1c7812" "172.19.2.130:9903" "-"

:spiral_notepad: the first timestamp is the timestamp from fly logs while the second one between brackets is Envoy’s access log timestamp

As soon as 2 Health Checks pass but without any Healthcheck status changed to passing log message, I see my two original VMs get terminated.

2021-01-20T06:15:44.965Z eac02266 sjc [info] Shutting down virtual machine
2021-01-20T06:15:45.089Z 6997dc52 sjc [info] Shutting down virtual machine

:spiral_notepad: I’ve added a kill_timeout of 120 seconds, so the VM should stick around for a bit, but there doesn’t seem to be any logs after the SIGINT is sent.

A new VM starts (the 2nd one):

2021-01-20T06:15:56.620Z 74d94cfa sjc [info] Starting instance
2021-01-20T06:15:56.640Z 74d94cfa sjc [info] Configuring virtual machine
2021-01-20T06:15:56.641Z 74d94cfa sjc [info] Pulling container

but only one healthcheck is logged:

2021-01-20T06:16:08.808Z 74d94cfa sjc [info] [2021-01-20T06:16:08.799Z] "GET /ready HTTP/1.1" 200 - 0 2 0 - "136.144.55.143" "Consul Health Check" "433ba8df-6888-4477-ae34-ed4c0522385c" "172.19.3.74:9903" "-"
2021-01-20T06:16:12.972Z 74d94cfa sjc [info] Health check status changed to 'passing'

My questions:

  1. Is it normal that after only one new VMs comes online, the previous 2 are scheduled for termination?
  2. Is it normal to not have any log messages after SIGINT for the running VMs? I catch SIGINT and then gracefully drain http listeners, but I’m not seeing any log messages for this
  3. Would the behavior be any different if I had 3 running VMs?

ps: @kurt I ended up writing a tiny go binary that handles the graceful shutdown because doing it in bash was too painful :see_no_evil:

Whoops! Missed this earlier.

By default, a deploy does a canary + rolling restart. This means it’ll boot the first instance, wait for health checks, and then promote the canary. When this happens, it goes to a rolling process. The rolling deploy stops one instance, boots another, stops another instance, etc.

What you’re seeing is basically this sequence:

  1. Canary boots, health checks pass
  2. Canary gets promoted, replacing old instance #1
  3. Old instance #2 stops (rolling deploy)
  4. New instance #2 starts

If you had three instances:

  1. Canary boots, health checks pass
  2. Canary gets promoted, replacing old instance #1
  3. Old instance #2 stops (rolling deploy)
  4. New instance #2 starts
  5. Old instance #3 stops
  6. New instance #3 starts

Now that I’ve written all that out, I realize it’s mildly confusing for 2 instance apps. It might be more predictable to do a blue/green deploy, or repeat the canary process all l the way through.

Regarding logs + SIGINT, you should see log messages after SIGINT if you’re still producing them. Is it possible the process is shutting down very quickly on SIGINT?

I ended up replacing a bunch of bash with Go as well. It’s so handy and … compiles.

Thanks for the detailed explanation regarding the deploy sequence. I’ll try again with 3 instances.

As for the logs not appearing after SIGINT, the script waits ~45s before shutting down. It should be enough time for logs to make it to Fly’s backend before being terminated.

I’ll add more logs and share my script.

If you want to share the app name, and you’re cool with me firing off random SIGINTs to your instances, I can see what’s going on too.

Definitely. App name envoy

kill_signal = "SIGINT"
# allow 2 minutes to cleanly shutdown
kill_timeout = 120

and the runner script is: runner.go · GitHub

The kill timeout on your app is at the default 5s, so I think the setting isn’t taking. Can you post your whole fly.toml?

# fly.toml file generated for envoy on 2020-12-22T22:49:56-08:00

app = "envoy"

[experimental]
  private_network = true


kill_signal = "SIGINT"
# allow 2 minutes to cleanly shutdown
kill_timeout = 120

[[services]]
  internal_port = 10000
  protocol = "tcp"

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20

  [[services.ports]]
    port = "443"

[[services]]
  internal_port = 9903

  [[services.http_checks]]
    interval = 10000
    method = "get"
    path = "/ready"
    protocol = "http"
    timeout = 2000
    tls_skip_verify = false
    restart_limit = 5

Oh, toml strikes again. Those are technically getting attached to the [experimental] block. If you move them to the top it should work.

Incidentally, this answers a question we had about whether we should use top level config settings or not. The answer is “no”. We’ll probably shift those to a [shutdown] block to make it less ambiguous.

That would explain it.
Got it from Graceful VM exits, some dials · The Fly Blog

Just add these options to your fly.toml configuration file:

  • kill_timeout : Number of seconds to wait before killing a VM. Shared CPU VMs allow up to 5 minutes and dedicated CPU VMs allow up to 24 hours. The default is 5 seconds.
  • kill_signal : defaults to SIGINT . Also accepts SIGTERM , SIGQUIT , SIGUSR1 , SIGUSR2 , SIGKILL , or SIGSTOP .

Thanks for looking into this :pray:

still not seeing the logs though, I’ll try to repro with a simpler example.

This might be because of the shell script and nothing to do with fly.

Edit: confirmed it works :partying_face:

1 Like

In practice, I have seen the default as canary + immediate.

This is what happens:

  1. Canary boots, health checks pass
  2. Canary gets promoted, replacing old instance #1
  3. Old instance #2 stops
  4. Old instance #3 stops
  5. New instance #2 starts
  6. New instance #3 starts

After first instance boots up, rest of the instances shut down together like this:

2021-12-27T05:55:15.088 app[0e49f967] mia [info] Star child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.088 app[0e49f967] mia [info] Starting clean up.
2021-12-27T05:55:15.158 app[d9752c87] dfw [info] Main child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.158 app[d9752c87] dfw [info] Starting clean up.
2021-12-27T05:55:15.241 app[0eefd5ed] mad [info] Main child exited with signal (with signal 'SIGINT', core dumped? false)
2021-12-27T05:55:15.242 app[0eefd5ed] mad [info] Starting clean up.

This is a bug?

1 Like