Getting 502 with "could not find a good candidate within 90 attempts at load balancing" despite healthcheck passing

gitnik · March 31, 2024, 11:46am

I am running a remix app. Using the prod artifacts locally works fine.

No component on fly.io reports an error. The machine is working fine, the logs are fine, the healthchecks are passing.

And yet I’m getting a 502 when accessing the page:
https://dealday-dev.fly.dev/

In the logs I can see the following:

could not find a good candidate within 90 attempts at load balancing

The 502 happens after a timeout, so it can’t be my app failing.

I also tried scaling to 2 instances but that didn’t change anything.

Here’s the fly.toml:

app = 'dealday-dev'
primary_region = 'ams'
kill_signal = 'SIGINT'
kill_timeout = '5s'

[experimental]
  cmd = ['start.sh']
  entrypoint = ['sh']
  auto_rollback = true

[build]

[env]
  PORT = '8080'
  SITE_URL = 'https://dealday-dev.fly.dev'

[[mounts]]
  source = 'data'
  destination = '/data'

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[services]]
  protocol = 'tcp'
  internal_port = 8080
  processes = ['app']

[[services.ports]]
    port = 80
    handlers = ['http']
    force_https = true

[[services.ports]]
    port = 443
    handlers = ['tls', 'http']

  [services.concurrency]
    type = 'connections'
    hard_limit = 25
    soft_limit = 20

[[services.tcp_checks]]
    interval = '15s'
    timeout = '2s'
    grace_period = '1s'

[[services.http_checks]]
    interval = '10s'
    timeout = '2s'
    grace_period = '5s'
    method = 'get'
    path = '/healthcheck'
    protocol = 'http'
    tls_skip_verify = false

[[vm]]
  size = 'shared-cpu-1x'
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

andie · March 31, 2024, 3:32pm

hi @gitnik

It looks you have 2 services configured that are listening on ports 80 and 443: one in the [http_service] section using internal port 3000 and another in [[services]] using internal port 8080. The [http_service] section is like a shortcut for services that listen on ports 80 and 443, so you don’t need both. See Fly Launch configuration (fly.toml) · Fly Docs.

You can delete the [http_service] section and make sure you set the [[services]] internal_port to whatever your app’s port should be. If you change it from 8080, make sure you change the [env] section too.

I’m curious how you got your fly.toml file? Did you use a fly.toml from another source or did you modify a fly.toml that was generated when you ran fly launch? We want to make sure this works better for you!

gitnik · April 1, 2024, 11:32am

Hi @andie,

I got it from another repo.

Whenever I delete http_service, flyctl will add it back automatically.

I changed the app to listen on 3000 but still get the same result (502). And with this config my healthchecks are not even being called anymore. But when I ssh into the machine, I can curl the app just fine.

app = 'dealday'
primary_region = 'ams'
kill_signal = 'SIGINT'
kill_timeout = '5s'

[experimental]
  cmd = ['start.sh']
  entrypoint = ['sh']
  auto_rollback = true

[build]

[env]
  PORT = '3000'
  SITE_URL = 'https://dealday.fly.dev'

[[mounts]]
  source = 'data'
  destination = '/data'

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[services]]
  protocol = 'tcp'
  internal_port = 3000
  processes = ['app']

[[services.ports]]
    port = 80
    handlers = ['http']
    force_https = true

[[services.ports]]
    port = 443
    handlers = ['tls', 'http']

  [services.concurrency]
    type = 'connections'
    hard_limit = 25
    soft_limit = 20

[[services.http_checks]]
    interval = '10s'
    timeout = '2s'
    grace_period = '5s'
    method = 'get'
    path = '/healthcheck'
    protocol = 'http'
    tls_skip_verify = false

[[vm]]
  size = 'shared-cpu-1x'
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

I also tried the opposite, where I only use the http_server but that doesn’t work either. Except that now I’m getting a 503. The app automatically downscales, so seemingly my requests are not even hitting the app:

app = 'dealday'
primary_region = 'ams'
kill_signal = 'SIGINT'
kill_timeout = '5s'

[experimental]
  cmd = ['start.sh']
  entrypoint = ['sh']
  auto_rollback = true

[build]

[env]
  PORT = '3000'
  SITE_URL = 'https://dealday.fly.dev'

[[mounts]]
  source = 'data'
  destination = '/data'

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

  [http_service.concurrency]
    type = 'connections'
    hard_limit = 25
    soft_limit = 20

[[http_service.checks]]
    interval = '30s'
    timeout = '2s'
    grace_period = '10s'
    method = 'GET'
    path = '/healthcheck'

[[vm]]
  size = 'shared-cpu-1x'
  memory = '512mb'
  cpu_kind = 'shared'
  cpus = 1

andie · April 1, 2024, 1:58pm

That second fly.toml you posted should work! In looking at logs, I see a Error: Invariant failed. Digging around, it looks like this error can pop up in remix apps on Fly.io when the SESSION_SECRET isn’t set.

You can set secrets with fly secrets set [flags] NAME=VALUE NAME=VALUE ...

Check the readme for the stack you’re using and see if that step was missed or is missing from the readme. For example, the indie stack readme has the following:

Add a SESSION_SECRET to your fly app secrets, to do this you can run the following commands:
fly secrets set SESSION_SECRET=$(openssl rand -hex 32) --app indie-stack-template
fly secrets set SESSION_SECRET=$(openssl rand -hex 32) --app indie-stack-template-staging

gitnik · April 1, 2024, 9:23pm

Before adding all the secrets, I wanted to get a working deployment first (verifying via the healthcheck, which requires no env vars) which so far I haven’t been able to. The error you’re seeing only pops up when I’m inside the machine and curl a localhost API.

This is me curling the healthcheck from inside the machine, which produces no errors

i

After 90 seconds or so, the downscaling kicks in:

But subsequent requests don’t ever trigger a scaling event, just how previously they didn’t seem to hit the deployment at all.

rubys · April 1, 2024, 11:10pm

It also looks good to me. Can you try again, there was some sort of network overload incident that appears to have been addressed: Fly.io Status - App Logs Delayed

rubys · April 2, 2024, 12:52am

I just tried a fresh remix app. Worked the first time.

npx create-remix@latest
cd my-remix-app
fly launch --name remix-$USER-$RANDOM

Here’s my fly.toml:

app = 'remix-rubys-24058'
primary_region = 'iad'

[build]

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']

[[vm]]
  memory = '1gb'
  cpu_kind = 'shared'
  cpus = 1

I’ll take it down eventually, but for now, you can access the app at https://remix-rubys-24058.fly.dev/

moishinetzer · April 2, 2024, 12:56am

My machines are having the same issue, can someone from support please contact me? Also for a Remix app, also working until this morning UK time.

andie · April 2, 2024, 1:18am

Hi @moishinetzer
In your case, you have an app affected by an active host issue. You should be able to see this in your dashboard. This page has some info about some steps you could try: Troubleshoot apps when a host is unavailable · Fly Docs

moishinetzer · April 2, 2024, 6:38am

I cannot see the snapshots for my volume, or scale my machine at all

Error: failed retrieving snapshots: failed to get volume vol_XXXX snapshots: json: cannot unmarshal object into Go value of type []fly.VolumeSnapshot

My app has no healthy deploys or volumes. Please can I get support for this it’s really urgent

dedsec · April 2, 2024, 7:41am

Hello @andie ,

Am having the same problem like this here is the logs with --debug

Error: failed retrieving snapshots: failed to get volume vol_e628r6p873nvwmnp snapshots: json: cannot unmarshal object into Go value of type []fly.VolumeSnapshot
Stacktrace:
goroutine 1 [running]:
runtime/debug.Stack()
	/opt/hostedtoolcache/go/1.21.7/x64/src/runtime/debug/stack.go:24 +0x5e
github.com/superfly/flyctl/internal/cli.printError(0xc000403040, 0xc000fc9b3e, 0x1d59200?, {0x25984c0, 0xc000d98180})
	/home/runner/work/flyctl/flyctl/internal/cli/cli.go:162 +0x4db
github.com/superfly/flyctl/internal/cli.Run({0x25b56c8?, 0xc0009bf240?}, 0xc000403040, {0xc00010e190?, 0x5, 0x5})
	/home/runner/work/flyctl/flyctl/internal/cli/cli.go:110 +0x928
main.run()
	/home/runner/work/flyctl/flyctl/main.go:47 +0x156
main.main()
	/home/runner/work/flyctl/flyctl/main.go:26 +0x18

am using the latest version flyctl v0.2.26 linux/amd64 Commit: 32f7fb3048a12c6552332ebb06c2c1db3987445e BuildDate: 2024-04-01T16:35:40Z

gitnik · April 2, 2024, 7:45am

No change. Deployed a fresh app with a new name but the issue persists. Healthchecks works, but the site remains unreachable

moishinetzer · April 2, 2024, 11:58am

My site is still critically down, I cannot access volume snapshots, volumes or machines both from the CLI and the web interface.

andie · April 2, 2024, 12:49pm

@moishinetzer

Suggest emailing support using the support email that you can find in your dashboard under Support. We can’t really help you in this thread (which I think is a different issue from yours).

rubys · April 2, 2024, 12:52pm

Can I get you to try the following?

If that fails for you, something weird is going on. If it succeeds, then we can proceed to determine what is different between your app and this app.

andie · April 2, 2024, 1:01pm

hi @dedsec

I think your affected app is also in an org with a plan that includes email support. Suggest emailing support at the email address you can find in your dashboard under Support.

dedsec · April 2, 2024, 3:05pm

Hi @andie ,
Thanks for the help but the application am trying to restore belong to org that doesn’t have email support, Also I tried to fork the volume attached to it instead of using snapshot since there was no good news with them till now and suddenly am getting this log error:

fly vol fork vol_e628r6p873nvwmnp -r cdg -a homegas
Error: failed to get volume: failed to get volume vol_e628r6p873nvwmnp: deadline_exceeded: Post "http://[fc01:a7b:28df::]:3593/flyd.v1.VolumeService/Get": dial tcp [fc01:a7b:28df::]:3593: i/o timeout (Request ID: 01HTFK5EJ8AFACC84SYH6GHE50-cdg)

andie · April 2, 2024, 8:21pm

@dedsec could you try listing your snapshots now? The issue preventing that should be fixed now. Although your host is still down, you may be able to follow Troubleshoot apps when a host is unavailable · Fly Docs if you can get your volume snapshots now.

gitnik · April 2, 2024, 9:56pm

That ended up working for me and lead me to something else. Previously I had a bash scrip that would be the entrypoint of my docker image, which would call npm run start (which calls remix-server un the hood). Now I’m calling npm run start as the entrypoint directly and it seems. Seems like running it the way I was previously, messes with the port detection or something

rubys · April 2, 2024, 10:09pm

It probably wasn’t your port, but rather your listen/bind address. If you end up only listening to localhost, you only will be able to accept connections originating from that machine. Listening to either 0.0.0.0 or [::] will enable you to accept connections from outside of the machine. I’m not entirely clear why healthchecks would work (the would originate from with the network, but I didn’t think they would originate from your actual machine).

Topic		Replies	Views
remix app, two machines - health checks work fine, one is refusing connections Questions / Help lhr	6	361	February 23, 2024
Can't access new Remix app Questions / Help	11	1838	March 1, 2022
failed to deploy Build debugging	4	546	July 27, 2022
App broken: could not find a good candidate within 90 attempts at load balancing.	5	2856	September 19, 2023
Health check failing and no app machines , litefs	11	559	June 4, 2024

Getting 502 with "could not find a good candidate within 90 attempts at load balancing" despite healthcheck passing

Related topics