I’ve been trying to push a new deployment with a blue green strategy for the last 2 hours. The change i’m deploying is literally a one line change, so i’m fairly confident this is an issue unrelated to my change.
The status page says everything is cool, but that isn’t unusual when an outage has started. Anyone else hitting this?
Here’s my logs
Creating green machines
Created machine 2874e1dc397978 [app]
Created machine d891247b490238 [app]
Waiting for all green machines to start
Machine 2874e1dc397978 [app] - created
Machine d891247b490238 [app] - created
Machine 2874e1dc397978 [app] - created
Machine d891247b490238 [app] - started
Machine 2874e1dc397978 [app] - started
Machine d891247b490238 [app] - started
Waiting for all green machines to be healthy
Machine 2874e1dc397978 [app] - unchecked
Machine d891247b490238 [app] - unchecked
Machine 2874e1dc397978 [app] - unchecked
Machine d891247b490238 [app] - 0/1 passing
Machine 2874e1dc397978 [app] - unchecked
Machine d891247b490238 [app] - 1/1 passing
Machine 2874e1dc397978 [app] - 0/1 passing
Machine d891247b490238 [app] - 1/1 passing
Rolling back failed deployment
Deleted machine 2874e1dc397978 [app]
Deleted machine d891247b490238 [app]
Error: wait timeout
could not get all green machines to be healthy
Error: Process completed with exit code 1.
We are experiencing exactly same issue with blue-green deployments. We noticed this behaviour around 9am UTC. We haven’t added any significant changes to code and/or config which would affect deployments. Deployments are failing due to timeout with health checks.
For a testing purposes, we tried deploying with immediate strategy and that worked relatively straight forward, health checks went to green with some delay.
Hi @andrewmcgrath and everyone, thanks for sharing that, I’ve declared an incident and we are investigating this right now. A workaround for this is that can temporarily deploy using a different strategy by fly deploy --strategy NAME if your app can afford a small time between versions switching for your machines.
Last night support responded to a ticket i opened suggesting this was due to capacity issues at ewr, and that i should migrate part of my deployment to bos. Is this guidance still advised?
Same issue. Looking at the logs, it appeared the problem was that the health check kicked in about a second after the machine was launched, way before my service was up, and then kept restarting the server with the same pattern, only hitting the healthcheck endpoint once at the very beginning of the restart cycle before it had a chance to be healthy.
Hey there, I’m getting health check timeouts regardless of the strategy being rolling, canary, etc.
--> Build Summary: ()
--> Building image done
image: registry.fly.io/xxx:deployment-01JHNKJ0HZHN273P89AWR7X50D
image size: 274 MB
Watch your deployment at https://fly.io/apps/xxx/monitoring
Running xxx release_command: bin/rails fly:release
Starting machine
-------
✔ release_command e784457f07d438 completed successfully
-------
-------
Updating existing machines in 'xxx' with rolling strategy
-------
⠧ [1/2] Checking health of machine d891310c0d9628
⠧ [2/2] Acquired lease for 1857935b271568
invariably leads to
✖ [1/2] Unrecoverable error: timeout reached waiting for health checks to pass for machine d891310c0d9628: failed to get VM d891310c0d9628: Get "https://api.machines.dev/v1/apps/xxx-…
⠙ [2/2] Checking health of machine 1857935b271568
I was able to deploy once this morning, but I’ve been blocked ever since with this
I’m still experiencing healthcheck issues for bg in narita
bg_deployments_http critical
whats funny is for rolling strategy is that it says it failed, but deploys.
it timesout after hundreds of lease errors like
2025-01-15T21:04:56Z proxy[4d89449f464958] nrt [error][PM01] machines API returned an error: "machine ID 4d89449f464958 lease currently held by e72dfff2-bdf4-5390-b013-778476e67d89@tokens.fly.io, expires at 2025-01-15T21:05:08Z"
deploys still down, live machines exhibit abnormal cpu usage in some regions
Error: failed to update machine 178159edce53e8: Unrecoverable error: timeout reached waiting for health checks to pass for machine 178159edce53e8: failed to get VM 178159edce53e8: Get "https://api.machines.dev/v1/apps/dawn-night-7975/machines/178159edce53e8": net/http: request canceled
fly machine restart 178159edce53e8 Restarting machine 178159edce53e8 Waiting for 178159edce53e8 to become healthy (started, 0/1) Error: failed to restart machine 178159edce53e8: failed to wait for health checks to pass: context deadline exceeded
Hey folks, just wanted to add some details to provide a bit more clarity to this thread:
We identified a specific platform issue that had been causing healthchecks in blue-green deployments to fail, which we tracked down to a change that we had been slowly rolling out to a handful of regions over the last couple of days (specifically: scl , mia , bom , gig , bog , eze , gdl , yul , otp, and a small portion of sin at 2025-01-13T21:30:00Z, [edit: additionally followed by ewr, lax, lhr, hkg, jnb, arn, atl, bos, cdg, den, dfw at 2025-01-14T21:30:00Z]). We reverted the change in these regions and confirmed this fixed the issue. Any other regions were not affected by this issue.
That said, I highly suspect that the majority of reported [edit: any remaining] issues in this thread are actually more directly related to the CPU Quotas Update that we initially announced last October and that we just completed rolling out yesterday. If your app is running on shared instances and uses a heavy amount of CPU on startup (more than the 1/16th or 6.25% of a core we allocate to each shared vCPU), your app may be taking longer to boot as a result of limited performance, which may impact deployments with health checks (since they may take longer to pass). You may need to adjust your health-check or deploy-wait timeouts, and/or scale up your instances to match your app’s workload.
Hope this extra info is helpful and unblocks those still experiencing various issues.
So i beg you to either keep looking or correct this post. I reported this to support last night, and started this thread. I’m in yyz and ewr.
Mine wasn’t working, now it is. Nothing changed on my end. Later today i did switch to performance cpu units but only after everything started working again, i didn’t want to make that change in case the resize would fail too.
Sorry about that, we followed up by double-checking our audit logs and yesterday there was indeed another segment of the rollout to a broader batch of regions which indeed included ewr and 10 other regions. I’ll update my earlier post with these details.