New blue-green deployments failing - machines never passing healthchecks

andrewmcgrath · January 15, 2025, 2:03am

I’ve been trying to push a new deployment with a blue green strategy for the last 2 hours. The change i’m deploying is literally a one line change, so i’m fairly confident this is an issue unrelated to my change.

The status page says everything is cool, but that isn’t unusual when an outage has started. Anyone else hitting this?

Here’s my logs

Creating green machines
  Created machine 2874e1dc397978 [app]
  Created machine d891247b490238 [app]
Waiting for all green machines to start
  Machine 2874e1dc397978 [app] - created
  Machine d891247b490238 [app] - created
  Machine 2874e1dc397978 [app] - created
  Machine d891247b490238 [app] - started
  Machine 2874e1dc397978 [app] - started
  Machine d891247b490238 [app] - started
Waiting for all green machines to be healthy
  Machine 2874e1dc397978 [app] - unchecked
  Machine d891247b490238 [app] - unchecked
  Machine 2874e1dc397978 [app] - unchecked
  Machine d891247b490238 [app] - 0/1 passing
  Machine 2874e1dc397978 [app] - unchecked
  Machine d891247b490238 [app] - 1/1 passing
  Machine 2874e1dc397978 [app] - 0/1 passing
  Machine d891247b490238 [app] - 1/1 passing
Rolling back failed deployment
  Deleted machine 2874e1dc397978 [app]
  Deleted machine d891247b490238 [app]
Error: wait timeout
could not get all green machines to be healthy
Error: Process completed with exit code 1.

andrewmcgrath · January 15, 2025, 2:18am

Adding a screenshot of the machines page while its deploying showing the checks…they just sit here and then fail.

The existing two machines stay in service, but i cant bring new ones online.

vitaliy · January 15, 2025, 12:59pm

We are experiencing exactly same issue with blue-green deployments. We noticed this behaviour around 9am UTC. We haven’t added any significant changes to code and/or config which would affect deployments. Deployments are failing due to timeout with health checks.
For a testing purposes, we tried deploying with immediate strategy and that worked relatively straight forward, health checks went to green with some delay.

Hypermind · January 15, 2025, 1:04pm

I have the same problem. New machines start, but never pass health checks. No HTTP requests for health checks are made either.

lubien · January 15, 2025, 1:32pm

Hi @andrewmcgrath and everyone, thanks for sharing that, I’ve declared an incident and we are investigating this right now. A workaround for this is that can temporarily deploy using a different strategy by fly deploy --strategy NAME if your app can afford a small time between versions switching for your machines.

https://fly.io/docs/launch/deploy/#deployment-strategy

andrewmcgrath · January 15, 2025, 1:56pm

Last night support responded to a ticket i opened suggesting this was due to capacity issues at ewr, and that i should migrate part of my deployment to bos. Is this guidance still advised?

lubien · January 15, 2025, 2:10pm

At the moment I’d suggest waiting on changing regions until we have the issue figured, I’ll be sure to come back to this thread with an update!

adamprice · January 15, 2025, 2:58pm

Same issue. Looking at the logs, it appeared the problem was that the health check kicked in about a second after the machine was launched, way before my service was up, and then kept restarting the server with the same pattern, only hitting the healthcheck endpoint once at the very beginning of the restart cycle before it had a chance to be healthy.

Logs: https://pastebin.com/raw/bWHN4HAg

dangra · January 15, 2025, 3:37pm

Hi guys, Switching to rolling strategy seems to work if you can afford it while we work on the issue.

byrne-prosper · January 15, 2025, 4:16pm

Is there a reason you recommend rolling over canary? Do you know if the canary strategy is having any issues? Thanks!

benonline · January 15, 2025, 6:39pm

Hey there, I’m getting health check timeouts regardless of the strategy being rolling, canary, etc.

--> Build Summary:  ()
--> Building image done
image: registry.fly.io/xxx:deployment-01JHNKJ0HZHN273P89AWR7X50D
image size: 274 MB

Watch your deployment at https://fly.io/apps/xxx/monitoring

Running xxx release_command: bin/rails fly:release
Starting machine

-------
 ✔ release_command e784457f07d438 completed successfully
-------
-------
Updating existing machines in 'xxx' with rolling strategy

-------
 ⠧ [1/2] Checking health of machine d891310c0d9628
 ⠧ [2/2] Acquired lease for 1857935b271568

invariably leads to

 ✖ [1/2] Unrecoverable error: timeout reached waiting for health checks to pass for machine d891310c0d9628: failed to get VM d891310c0d9628: Get "https://api.machines.dev/v1/apps/xxx-…
 ⠙ [2/2] Checking health of machine 1857935b271568

I was able to deploy once this morning, but I’ve been blocked ever since with this

vitaliy · January 15, 2025, 8:01pm

Hi @lubien, despite incident marked as resolved on status page, we still experiencing such an issue. We are currently using ams and cdg regions.

dustingetz · January 15, 2025, 8:20pm

my deploys are also still down

andrewmcgrath · January 15, 2025, 8:21pm

I’m the OP so figured I should reply here, just tried my failing deploy and it worked. Hopefully you guys get resolution shortly too.

echoi · January 15, 2025, 9:06pm

I’m still experiencing healthcheck issues for bg in narita

bg_deployments_http critical

whats funny is for rolling strategy is that it says it failed, but deploys.
it timesout after hundreds of lease errors like

2025-01-15T21:04:56Z proxy[4d89449f464958] nrt [error][PM01] machines API returned an error: "machine ID 4d89449f464958 lease currently held by e72dfff2-bdf4-5390-b013-778476e67d89@tokens.fly.io, expires at 2025-01-15T21:05:08Z"

but the code is actually deployed

luciengeorge · January 15, 2025, 10:02pm

Issue was marked as resolved but I am still experiencing the issue here. Health checks not passing

dustingetz · January 15, 2025, 11:54pm

deploys still down, live machines exhibit abnormal cpu usage in some regions

Error: failed to update machine 178159edce53e8: Unrecoverable error: timeout reached waiting for health checks to pass for machine 178159edce53e8: failed to get VM 178159edce53e8: Get "https://api.machines.dev/v1/apps/dawn-night-7975/machines/178159edce53e8": net/http: request canceled

fly machine restart 178159edce53e8 Restarting machine 178159edce53e8 Waiting for 178159edce53e8 to become healthy (started, 0/1) Error: failed to restart machine 178159edce53e8: failed to wait for health checks to pass: context deadline exceeded

wjordan · January 16, 2025, 12:29am

Hey folks, just wanted to add some details to provide a bit more clarity to this thread:

We identified a specific platform issue that had been causing healthchecks in blue-green deployments to fail, which we tracked down to a change that we had been slowly rolling out to a handful of regions over the last couple of days (specifically: scl , mia , bom , gig , bog , eze , gdl , yul , otp, and a small portion of sin at 2025-01-13T21:30:00Z, [edit: additionally followed by ewr, lax, lhr, hkg, jnb, arn, atl, bos, cdg, den, dfw at 2025-01-14T21:30:00Z]). We reverted the change in these regions and confirmed this fixed the issue. Any other regions were not affected by this issue.

That said, I highly suspect that ~~the majority of reported~~ [edit: any remaining] issues in this thread are actually more directly related to the CPU Quotas Update that we initially announced last October and that we just completed rolling out yesterday. If your app is running on shared instances and uses a heavy amount of CPU on startup (more than the 1/16th or 6.25% of a core we allocate to each shared vCPU), your app may be taking longer to boot as a result of limited performance, which may impact deployments with health checks (since they may take longer to pass). You may need to adjust your health-check or deploy-wait timeouts, and/or scale up your instances to match your app’s workload.

Hope this extra info is helpful and unblocks those still experiencing various issues.

andrewmcgrath · January 16, 2025, 12:48am

So i beg you to either keep looking or correct this post. I reported this to support last night, and started this thread. I’m in yyz and ewr.

Mine wasn’t working, now it is. Nothing changed on my end. Later today i did switch to performance cpu units but only after everything started working again, i didn’t want to make that change in case the resize would fail too.

Doesn’t add up with your comments?

wjordan · January 16, 2025, 2:04am

Sorry about that, we followed up by double-checking our audit logs and yesterday there was indeed another segment of the rollout to a broader batch of regions which indeed included ewr and 10 other regions. I’ll update my earlier post with these details.