How do I eliminate the `~20s` downtime when running `flyctl deploy --strategy bluegreen`?

I’m running a Express server with an attached PostgreSQL database. Whenever I deploy with --strategy bluegreen and I keep refreshing the URL during the deploy time, I’ve noticed there is a short downtime of ~20s. I’m building an API, and that makes this unacceptable. I’d love to learn how I can fight this downtime. Oh yeah, I’m using a custom Dockerfile.

First, I deploy a new version of my server with flyctl deploy --strategy bluegreen. I look at flyctl status --watch while it’s deploying, and eventually I see the new instance being started up. It passes the health checks, and then both run simultaneously for a few seconds after which the old instance shuts down. What I’ve noticed in the browser is that my requests keep hitting the old instance even after the new instance has passed health checks. When the old instance shuts down, new requests to the URL hang for about ~20s before being processed by the new instance. I’d like to eliminate these 20 seconds, and just switch over traffic to the new instance as soon as it’s passing health checks, and then after that shutting down the old instance. How do I accomplish this?

I’ve attached my fly.toml file below:

app = "my-fly-app"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[env]
PORT = "8080"

[deploy]
  release_command = "npx prisma migrate deploy"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "5s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"
1 Like

Hey, sorry for the delay. I caught you on Discord, but will respond here as well for the benefit of the community.

I tested this in AMS with a production Rails app with the following config:

  • dedicated-cpu-1x VM size
  • fly scale count 4 --max-per-region 8
  • fly deploy --strategy bluegreen

I saw no downtime while hitting the app with a load testing tool, nor did I see anything in metrics to suggest a change in response times.

What kind of request volume do you have?

Hi sorry reviving this old thread but I’m experiencing the same issue today. Exactly how OP described.

  1. Deploy w/ strategy = “bluegreen”
  2. Wait until “green” deployment is successful
  3. Fly will destroy the “blue” deployment
  4. Reloading the browser will cause app to hang ~20-30 seconds because it’s trying to communicate with the “blue” deployment

Error in the logs:
machine is in a non-startable state: destroyed
After about 30 seconds, the browser request will route to the new “green” deployment.

2 Likes

Same here, in my case i have a Cloudflare worker in between that uses the fly.dev domain to fetch the origin server

The reproduction shows an 8 seconds downtime slow responses for each deployement, even with canary or bluegreen strategy

Fork with bluegreen strategy: GitHub - remorses/fly-deployment-latency

Is that downtime or slow responses? I can’t quite tell from the chart here.

This should have been much improved in the past few weeks as we’ve moved to a new state propagation system.

We’re still working on the various components in the deployment path to further reduce the slowness experienced during deploys. Ideally there would be none, but there’s a large refactor required to bundle deployments at a higher level. Right now deployments are orchestrated by flyctl (the client) and update machines based on the deployment strategy. From our systems’ standpoint, machines are being stopped and started and their services are being deleted and re-added. It’s possible for a node to receive all “stops” and “deletes” before receiving any “starts” and “creates”. Now if we had a higher-level structure for deployments or if these changes were all bundled in a “transaction”, we could probably do better.

If you are getting errors (that aren’t timeouts), then that’s another matter. Timeouts are not great either, but at least that’d be inline with my explanation :sweat_smile:

Requests get very slow, sometimes 20+ seconds. Thank you for the explanation.