Old instance stopped before new one is healthy on bluegreen deploy

I’m deploying an app with bluegreen strategy and this happened today. The running, healthy app stopped before the new one even started. Here are the logs from the monitoring panel

2022-12-07T21:01:15.405 runner[9c2bfbc9] iad [info] Starting instance
2022-12-07T21:01:49.848 runner[59ff838f] iad [info] Shutting down virtual machine
2022-12-07T21:01:49.882 app[59ff838f] iad [info] Sending signal SIGINT to main child process w/ PID 521
2022-12-07T21:01:49.883 app[59ff838f] iad [info] 21:01:49 system | SIGINT received
2022-12-07T21:01:49.884 app[59ff838f] iad [info] 21:01:49 system | sending SIGTERM to webserver.1 (pid 570)
2022-12-07T21:01:49.888 app[59ff838f] iad [info] 21:01:49 system | sending SIGTERM to papertrail.1 (pid 571)
2022-12-07T21:01:49.891 app[59ff838f] iad [info] 21:01:49 webserver.1 | [2022-12-07 21:01:49 +0000] [588] [INFO] Worker exiting (pid: 588)
2022-12-07T21:01:49.891 app[59ff838f] iad [info] 21:01:49 webserver.1 | [2022-12-07 21:01:49 +0000] [589] [INFO] Worker exiting (pid: 589)
2022-12-07T21:01:49.898 app[59ff838f] iad [info] 21:01:49 webserver.1 | [2022-12-07 21:01:49 +0000] [580] [INFO] Handling signal: term
2022-12-07T21:01:49.908 app[59ff838f] iad [info] 21:01:49 system | papertrail.1 stopped (rc=-15)
2022-12-07T21:01:49.993 app[59ff838f] iad [info] 21:01:49 webserver.1 | Sentry is attempting to send 1 pending error messages
2022-12-07T21:01:49.994 app[59ff838f] iad [info] 21:01:49 webserver.1 | Sentry is attempting to send 1 pending error messages
2022-12-07T21:01:49.994 app[59ff838f] iad [info] 21:01:49 webserver.1 | Waiting up to 2 seconds
2022-12-07T21:01:49.994 app[59ff838f] iad [info] 21:01:49 webserver.1 | Waiting up to 2 seconds
2022-12-07T21:01:49.994 app[59ff838f] iad [info] 21:01:49 webserver.1 | Press Ctrl-C to quit
2022-12-07T21:01:49.994 app[59ff838f] iad [info] 21:01:49 webserver.1 | Press Ctrl-C to quit
2022-12-07T21:01:51.502 app[59ff838f] iad [info] 21:01:51 webserver.1 | [2022-12-07 21:01:51 +0000] [580] [INFO] Shutting down: Master
2022-12-07T21:01:51.529 app[59ff838f] iad [info] 21:01:51 system | webserver.1 stopped (rc=-15)
2022-12-07T21:01:52.296 app[59ff838f] iad [info] Starting clean up.
Error: could not find an instance to route to
Error: could not find an instance to route to
Error: could not find an instance to route to
Error: could not find an instance to route to
2022-12-07T21:11:45.347 runner[9c2bfbc9] iad [info] Configuring virtual machine
2022-12-07T21:11:45.348 runner[9c2bfbc9] iad [info] Pulling container image
2022-12-07T21:11:45.448 runner[9c2bfbc9] iad [info] Unpacking image
2022-12-07T21:11:45.471 runner[9c2bfbc9] iad [info] Preparing kernel init
2022-12-07T21:11:54.524 runner[9c2bfbc9] iad [info] Configuring firecracker
2022-12-07T21:11:54.609 runner[9c2bfbc9] iad [info] Starting virtual machine
2022-12-07T21:11:55.469 app[9c2bfbc9] iad [info] Starting init (commit: f447594)...
2022-12-07T21:11:55.527 app[9c2bfbc9] iad [info] Preparing to run: `/app/entrypoint honcho -f /app/fly/Procfile.webserver start` as root
2022-12-07T21:11:55.611 app[9c2bfbc9] iad [info] 2022/12/07 21:11:55 listening on [fdaa:0:b246:a7b:93:9c2b:fbc9:2]:22 (DNS: [fdaa::3]:53)

from the logs I can see that instance 9c2bfbc9 started (or tried to start) then instance 59ff838f started shutting down, but it should only do that when the new one is healthy.

this is the fly.toml file

# fly.toml file generated for sunflower-webserver-stg on 2022-10-05T19:13:17-03:00

kill_signal = "SIGINT"
kill_timeout = 120
processes = []

[deploy]
  release_command = "/app/scripts/migrate"
  strategy = "bluegreen"

[env]
  AWS_DEFAULT_REGION = "us-east-1"
  DJANGO_SETTINGS_MODULE = "config.settings.production"

[experimental]
  allowed_public_ports = []
  auto_rollback = true
  cmd = ["honcho", "-f", "/app/fly/Procfile.webserver", "start"]

[[services]]
  internal_port = 5000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []
  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.http_checks]]
    interval = "30s"
    grace_period = "60s"
    method = "get"
    path = "/auth/login/"
    protocol = "http"
    restart_limit = 0
    timeout = "10s"
    tls_skip_verify = false
    [services.http_checks.headers]
      "X-Forwarded-Proto" = "https"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

Does this behave differently if you run fly deploy --strategy bluegreen? That looks like it actually did a rolling deploy.

I run the deploy on CI using fly deploy (no strategy option as it’s on the config file) and it always worked, and is working now. just that one deploy failed and we ended up with no running instance.
could be just a hiccup on fly side, but I don’t see anything on the status page.