App broken: could not find a good candidate within 90 attempts at load balancing.

jan-hesters · September 5, 2023, 4:22pm

Our app has also been broken in Amsterdam for almost 24hs now:

2023-09-05T16:13:27.348 runner[286560eae70de8] ams [info] machine exited with exit code 0, not restarting

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shutdown? is there an ongoing deployment with a volume or using the 'immediate' strategy? has your app's instances all reached their hard limit?)

And trying to redeploy the app fails, too:

$ fly deploy
==> Verifying app config
Validating /Users/foo/dev/app/remix/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
WARN DATABASE_URL may be a potentially sensitive environment variable. Consider setting it as a secret, and removing it from the [env] section: https://fly.io/docs/reference/secrets/

==> Building image
Waiting for remote builder fly-builder-falling-smoke-8492... 🌎WARN The running flyctl agent (v0.1.81) is older than the current flyctl (v0.1.83).
WARN The out-of-date agent will be shut down along with existing wireguard connections. The new agent will start automatically as needed.
WARN Failed to start remote builder heartbeat: failed building options: agent: failed to start

Error: failed to fetch an image or build from source: error connecting to docker: failed building options: agent: failed to start
The agent failed to start with the following error log:



A copy of this log has been saved at /Users/foo/.fly/agent-logs/339000168.log

leslie · September 5, 2023, 7:08pm

I looked up the Machine ID from your logs and see that it’s on a host in ams that is down due to emergency maintenance. Unfortunately, apps with just one Machine running on the affected host will not be reachable until maintenance is complete. Running 2+ Machines is our recommendation to prevent app downtime in the event of host-side failures.

Let’s see if scaling up helps to bring your app back: fly scale count 2. This should start a new Machine on a different, healthy host in the region and hopefully get your app up and running again. Feel free to follow up with error logs if this doesn’t work for you as expected.

A note to others reading this: if you’re seeing similar errors and want to know whether a downed host is affecting your app, check out your personal status page (also accessible from your org’s dashboard).

jan-hesters · September 6, 2023, 6:04am

Hi @leslie,

Thanks a ton for looking into it! I’m going to try and run fly scale count 2 now and report back to you what happens.

The machine’s status right now is still “stopping” for some reason.

If I run fly deploy today, I get this error:

$ fly deploy
==> Verifying app config
Validating /Users/jan/dev/summarai/remix-app/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
WARN DATABASE_URL may be a potentially sensitive environment variable. Consider setting it as a secret, and removing it from the [env] section: https://fly.io/docs/reference/secrets/

==> Building image
Remote builder fly-builder-falling-smoke-6210 ready
==> Building image with Docker
--> docker host: 20.10.12 linux x86_64
[+] Building 126.4s (23/23) FINISHED                                                                        
 => [internal] load build definition from Dockerfile                                                   0.1s
 => => transferring dockerfile: 1.16kB                                                                 0.1s
 => [internal] load .dockerignore                                                                      0.1s
 => => transferring context: 113B                                                                      0.1s
 => [internal] load metadata for docker.io/library/node:16-bullseye-slim                               0.7s
 => [internal] load build context                                                                     22.1s
 => => transferring context: 157.59MB                                                                 22.0s
 => [base 1/2] FROM docker.io/library/node:16-bullseye-slim@sha256:924a8c6672fcc9f1c3c91294733db26baf  3.0s
 => => resolve docker.io/library/node:16-bullseye-slim@sha256:924a8c6672fcc9f1c3c91294733db26baf58f71  0.0s
 => => sha256:14726c8f78342865030f97a8d3492e2d1a68fbd22778f9a31dc6be4b4f12a9bc 31.42MB / 31.42MB       0.4s
 => => sha256:52c04d0581ff2d21f10df4ba2d6a91cbea206133a6a43d526673b9d5570489e3 4.18kB / 4.18kB         0.0s
 => => sha256:2252379302474e3eda085e6d49c5a499927ae44f9e973bbea87a61a966ab50e9 35.27MB / 35.27MB       0.5s
 => => sha256:6412e873831329aa688a7c3e97f91a26e87ef80047d1f60646279c13ce6bdded 2.76MB / 2.76MB         0.1s
 => => sha256:596f295c67de9aa4ff45c03b3cda63d77be529ffa692eb5f7eb7f060eec9821c 448B / 448B             0.0s
 => => sha256:924a8c6672fcc9f1c3c91294733db26baf58f719d4b976480a65f58b7979ece6 1.21kB / 1.21kB         0.0s
 => => sha256:2bfb0ddaf13161e22e609c443ed312425446b225e97fba59e5e9aab63d3a7c07 1.37kB / 1.37kB         0.0s
 => => sha256:8bc82c955adeda4d02ba1e335ce720f0deca75090bb951e18b48f6dc236ceabe 7.02kB / 7.02kB         0.0s
 => => extracting sha256:14726c8f78342865030f97a8d3492e2d1a68fbd22778f9a31dc6be4b4f12a9bc              1.0s
 => => extracting sha256:52c04d0581ff2d21f10df4ba2d6a91cbea206133a6a43d526673b9d5570489e3              0.0s
 => => extracting sha256:2252379302474e3eda085e6d49c5a499927ae44f9e973bbea87a61a966ab50e9              1.1s
 => => extracting sha256:6412e873831329aa688a7c3e97f91a26e87ef80047d1f60646279c13ce6bdded              0.1s
 => => extracting sha256:596f295c67de9aa4ff45c03b3cda63d77be529ffa692eb5f7eb7f060eec9821c              0.0s
 => [base 2/2] RUN apt-get update && apt-get install -y openssl                                        6.0s
 => [build 1/7] RUN mkdir /app                                                                         0.4s 
 => [build 2/7] WORKDIR /app                                                                           0.0s 
 => [deps 3/4] ADD package.json package-lock.json ./                                                   0.3s 
 => [deps 4/4] RUN npm install --production=false                                                     68.6s 
 => [build 3/7] COPY --from=deps /app/node_modules /app/node_modules                                   7.8s 
 => [build 4/7] ADD prisma .                                                                           0.0s 
 => [production-deps 4/5] ADD package.json package-lock.json ./                                        0.0s 
 => [production-deps 5/5] RUN npm prune --production                                                   7.0s 
 => [build 5/7] RUN npx prisma generate                                                                3.1s 
 => [build 6/7] ADD . .                                                                                1.9s 
 => [build 7/7] RUN npm run build                                                                      6.8s
 => [stage-4 3/7] COPY --from=production-deps /app/node_modules /app/node_modules                      3.7s
 => [stage-4 4/7] COPY --from=build /app/node_modules/.prisma /app/node_modules/.prisma                0.0s 
 => [stage-4 5/7] COPY --from=build /app/build /app/build                                              0.0s 
 => [stage-4 6/7] COPY --from=build /app/public /app/public                                            0.1s 
 => [stage-4 7/7] ADD . .                                                                              1.3s 
 => exporting to image                                                                                 4.3s 
 => => exporting layers                                                                                4.3s 
 => => writing image sha256:0c18f2bedbf1b12a32765035ed800c45ac6108abef27a69e6799de854d51a8f5           0.0s 
 => => naming to registry.fly.io/summarai:deployment-01H9MH1XQ4722TDENF8QCKWWYS                        0.0s
--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/summarai]
ca841d4b4e28: Pushed 
089d8f6ceb69: Pushed 
8928a95dd17c: Pushed 
6009194b5cc8: Pushed 
0f91c446d63a: Pushed 
5f70bf18a086: Pushed 
782b965d8e0e: Pushed 
62d891678503: Pushed 
f258e1ea3220: Pushed 
24f5da2e29b4: Pushed 
dfe128697ddb: Pushed 
b4a44411fb50: Pushed 
63290f9c9e52: Pushed 
deployment-01H9MH1XQ4722TDENF8QCKWWYS: digest: sha256:ca7acc88a9a0ffc9b99eb4a22fbb71df1a665ab61fec7a6a652a31ae123b70f8 size: 3051
--> Pushing image done
image: registry.fly.io/summarai:deployment-01H9MH1XQ4722TDENF8QCKWWYS
image size: 704 MB

Watch your deployment at https://fly.io/apps/summarai/monitoring

Updating existing machines in 'summarai' with rolling strategy
  [1/1] Updating 286560eae70de8 [app]
Error: failed to update VM 286560eae70de8: unknown: deploys to this host are temporarily disabled, please try again later or check the status page: https://status.flyio.net

Afterwards, the machine’s status was still “stopping”.

jan-hesters · September 11, 2023, 3:34pm

Running fly scale count 2 worked, but the other machine is in a continuous state of “stopping” and prevents deployments of new versions.

Could you please shut the one stuck in the state down? Or help me shut it down via the command line?

Beaux · September 12, 2023, 9:37am

I’m having the exact same issue. My app in the ams region has been completely broken for the past couple of days now, as you can see from my Uptime Kuma graph. I didn’t change anything myself, and it has been running fine for months now.

I’ve tried the following:

fly machine restart: didn’t fix anything
fly ssh console: it says Error: error connecting to SSH server: connect tcp ... operation timed out
fly scale count 0 and then fly scale count 1: didn’t fix anything
fly scale count 2: didn’t fix anything, both machines are now in a broken state

Any ideas? I need this to be up and running soon.

Update: found the problem, it turns out my app is waiting for my Postgres database, but the connection between the app and the database is broken for some reason…

I still can’t ssh into my main app though. I’ve tried fly wireguard reset but it says Error: upstream service is unavailable

Update2: Seems it was an outage in the ams region which has been resolved: Can't reach database #ams #flycast

system · September 19, 2023, 9:37am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[PR03] could not find a good candidate within 21 attempts at load balancing. last error: [PU03] unreachable worker host	7	62	February 3, 2025
Unable to access/restart app	2	95	June 16, 2024
"could not find a good candidate within 90 attempts at load balancing" though app can be opened via SSH proxy	3	926	January 17, 2024
Error: found 1 machines that are unmanaged. Questions / Help	4	232	November 20, 2023
[URGENT] Contact Lost with instance Questions / Help	11	1462	October 14, 2022

App broken: could not find a good candidate within 90 attempts at load balancing.

Related topics