How do I eliminate the `~20s` downtime when running `flyctl deploy --strategy bluegreen`?

user60 · November 6, 2021, 5:55pm

I’m running a Express server with an attached PostgreSQL database. Whenever I deploy with --strategy bluegreen and I keep refreshing the URL during the deploy time, I’ve noticed there is a short downtime of ~20s. I’m building an API, and that makes this unacceptable. I’d love to learn how I can fight this downtime. Oh yeah, I’m using a custom Dockerfile.

First, I deploy a new version of my server with flyctl deploy --strategy bluegreen. I look at flyctl status --watch while it’s deploying, and eventually I see the new instance being started up. It passes the health checks, and then both run simultaneously for a few seconds after which the old instance shuts down. What I’ve noticed in the browser is that my requests keep hitting the old instance even after the new instance has passed health checks. When the old instance shuts down, new requests to the URL hang for about ~20s before being processed by the new instance. I’d like to eliminate these 20 seconds, and just switch over traffic to the new instance as soon as it’s passing health checks, and then after that shutting down the old instance. How do I accomplish this?

I’ve attached my fly.toml file below:

app = "my-fly-app"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
PORT = "8080"

[deploy]
  release_command = "npx prisma migrate deploy"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "5s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

jsierles · November 9, 2021, 6:50pm

Hey, sorry for the delay. I caught you on Discord, but will respond here as well for the benefit of the community.

jsierles · November 9, 2021, 7:14pm

I tested this in AMS with a production Rails app with the following config:

dedicated-cpu-1x VM size
fly scale count 4 --max-per-region 8
fly deploy --strategy bluegreen

I saw no downtime while hitting the app with a load testing tool, nor did I see anything in metrics to suggest a change in response times.

What kind of request volume do you have?

khuezy · October 30, 2023, 9:09pm

Hi sorry reviving this old thread but I’m experiencing the same issue today. Exactly how OP described.

Deploy w/ strategy = “bluegreen”
Wait until “green” deployment is successful
Fly will destroy the “blue” deployment
Reloading the browser will cause app to hang ~20-30 seconds because it’s trying to communicate with the “blue” deployment

Error in the logs:
machine is in a non-startable state: destroyed
After about 30 seconds, the browser request will route to the new “green” deployment.

morse · December 25, 2023, 4:05pm

Same here, in my case i have a Cloudflare worker in between that uses the fly.dev domain to fetch the origin server

morse · December 29, 2023, 1:24pm

The reproduction shows an 8 seconds ~~downtime~~ slow responses for each deployement, even with canary or bluegreen strategy

Fork with bluegreen strategy: GitHub - remorses/fly-deployment-latency

jerome · December 29, 2023, 5:13pm

Is that downtime or slow responses? I can’t quite tell from the chart here.

This should have been much improved in the past few weeks as we’ve moved to a new state propagation system.

We’re still working on the various components in the deployment path to further reduce the slowness experienced during deploys. Ideally there would be none, but there’s a large refactor required to bundle deployments at a higher level. Right now deployments are orchestrated by flyctl (the client) and update machines based on the deployment strategy. From our systems’ standpoint, machines are being stopped and started and their services are being deleted and re-added. It’s possible for a node to receive all “stops” and “deletes” before receiving any “starts” and “creates”. Now if we had a higher-level structure for deployments or if these changes were all bundled in a “transaction”, we could probably do better.

If you are getting errors (that aren’t timeouts), then that’s another matter. Timeouts are not great either, but at least that’d be inline with my explanation

morse · December 29, 2023, 6:52pm

Requests get very slow, sometimes 20+ seconds. Thank you for the explanation.

Topic		Replies	Views
Bluegreen deployment, old version stopped before new is promoted	3	807	January 23, 2022
Question on deployment lifecycle (wait before kill) Questions / Help	3	685	June 19, 2023
Finally, BlueGreen Deployments for AppsV2! 🚀 Fresh Produce appsv2 , machines	31	4210	July 16, 2023
BlueGreen Deploys Fail with Timeout	1	40	January 16, 2025
Rolling update always causes downtime (proxy errors in log) autoscaling , proxy	9	69	March 18, 2025

How do I eliminate the `~20s` downtime when running `flyctl deploy --strategy bluegreen`?

Related topics