How do I eliminate the `~20s` downtime when running `flyctl deploy --strategy bluegreen`?

I’m running a Express server with an attached PostgreSQL database. Whenever I deploy with --strategy bluegreen and I keep refreshing the URL during the deploy time, I’ve noticed there is a short downtime of ~20s. I’m building an API, and that makes this unacceptable. I’d love to learn how I can fight this downtime. Oh yeah, I’m using a custom Dockerfile.

First, I deploy a new version of my server with flyctl deploy --strategy bluegreen. I look at flyctl status --watch while it’s deploying, and eventually I see the new instance being started up. It passes the health checks, and then both run simultaneously for a few seconds after which the old instance shuts down. What I’ve noticed in the browser is that my requests keep hitting the old instance even after the new instance has passed health checks. When the old instance shuts down, new requests to the URL hang for about ~20s before being processed by the new instance. I’d like to eliminate these 20 seconds, and just switch over traffic to the new instance as soon as it’s passing health checks, and then after that shutting down the old instance. How do I accomplish this?

I’ve attached my fly.toml file below:

app = "my-fly-app"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
PORT = "8080"

[deploy]
  release_command = "npx prisma migrate deploy"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "5s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

Hey, sorry for the delay. I caught you on Discord, but will respond here as well for the benefit of the community.

I tested this in AMS with a production Rails app with the following config:

  • dedicated-cpu-1x VM size
  • fly scale count 4 --max-per-region 8
  • fly deploy --strategy bluegreen

I saw no downtime while hitting the app with a load testing tool, nor did I see anything in metrics to suggest a change in response times.

What kind of request volume do you have?