remix app, two machines - health checks work fine, one is refusing connections

mjms3 · February 15, 2024, 8:55am

Hoping someone can help me get to the bottom of something that is confusing me.

I am deploying an app based on the remix blues template.

The app is running fine, but it is only utilising one of the two machines. When I start up the app, I get the same message on both machines:

2024-02-15T08:40:58Z runner[d8d9e13a090d18] lhr [info]Machine started in 568ms
2024-02-15T08:40:59Z app[d8d9e13a090d18] lhr [info]> start
2024-02-15T08:40:59Z app[d8d9e13a090d18] lhr [info]> cross-env NODE_ENV=production node ./build/server.js
2024-02-15T08:41:02Z app[d8d9e13a090d18] lhr [info]✅ app ready: http://localhost:8080
2024-02-15T08:41:02Z app[d8d9e13a090d18] lhr [info]✅ metrics ready: http://localhost:8081/metrics
2024-02-15T08:41:04Z app[d8d9e13a090d18] lhr [info]HEAD / 200 - - 80.845 ms
2024-02-15T08:41:04Z app[d8d9e13a090d18] lhr [info]GET /healthcheck 200 - - 111.974 ms
2024-02-15T08:41:04Z app[e286033c6e39d8] lhr [info]HEAD / 200 - - 22.307 ms
2024-02-15T08:41:04Z app[e286033c6e39d8] lhr [info]GET /healthcheck 200 - - 26.911 ms
2024-02-15T08:41:14Z app[d8d9e13a090d18] lhr [info]HEAD / 200 - - 42.826 ms
2024-02-15T08:41:14Z app[d8d9e13a090d18] lhr [info]GET /healthcheck 200 - - 54.623 ms
2024-02-15T08:41:14Z app[e286033c6e39d8] lhr [info]HEAD / 200 - - 25.254 ms
2024-02-15T08:41:14Z app[e286033c6e39d8] lhr [info]GET /healthcheck 200 - - 28.525 ms
2024-02-15T08:41:18Z app[e286033c6e39d8] lhr [info] INFO Sending signal SIGINT to main child process w/ PID 306
2024-02-15T08:41:23Z app[e286033c6e39d8] lhr [info] INFO Sending signal SIGTERM to main child process w/ PID 306
2024-02-15T08:41:24Z app[d8d9e13a090d18] lhr [info]HEAD / 200 - - 49.803 ms
2024-02-15T08:41:24Z app[d8d9e13a090d18] lhr [info]GET /healthcheck 200 - - 59.357 ms
2024-02-15T08:41:24Z app[e286033c6e39d8] lhr [info]HEAD / 200 - - 23.789 ms
2024-02-15T08:41:24Z app[e286033c6e39d8] lhr [info]GET /healthcheck 200 - - 29.048 ms
2024-02-15T08:41:28Z app[e286033c6e39d8] lhr [warn]Virtual machine exited abruptly
2024-02-15T08:41:29Z app[e286033c6e39d8] lhr [info][    0.057523] PCI: Fatal: No config space access function found
2024-02-15T08:41:29Z app[e286033c6e39d8] lhr [info] INFO Starting init (commit: bfa79be)...
2024-02-15T08:41:29Z app[e286033c6e39d8] lhr [info] INFO Preparing to run: `docker-entrypoint.sh npm start` as root
2024-02-15T08:41:29Z app[e286033c6e39d8] lhr [info] INFO [fly api proxy] listening at /.fly/api
2024-02-15T08:41:29Z app[e286033c6e39d8] lhr [info]2024/02/15 08:41:29 listening on [fdaa:3:e4fa:a7b:be65:79c8:89d7:2]:22 (DNS: [fdaa::3]:53)
2024-02-15T08:41:29Z runner[e286033c6e39d8] lhr [info]Machine started in 627ms
2024-02-15T08:41:30Z app[e286033c6e39d8] lhr [info]> start
2024-02-15T08:41:30Z app[e286033c6e39d8] lhr [info]> cross-env NODE_ENV=production node ./build/server.js
2024-02-15T08:41:33Z app[e286033c6e39d8] lhr [info]✅ app ready: http://localhost:8080
2024-02-15T08:41:33Z app[e286033c6e39d8] lhr [info]✅ metrics ready: http://localhost:8081/metrics
2024-02-15T08:41:34Z app[d8d9e13a090d18] lhr [info]HEAD / 200 - - 34.636 ms
2024-02-15T08:41:34Z app[d8d9e13a090d18] lhr [info]GET /healthcheck 200 - - 42.649 ms
2024-02-15T08:41:34Z app[e286033c6e39d8] lhr [info]HEAD / 200 - - 86.321 ms
2024-02-15T08:41:34Z app[e286033c6e39d8] lhr [info]GET /healthcheck 200 - - 129.445 ms
2024-02-15T08:41:44Z app[d8d9e13a090d18] lhr [info]HEAD / 200 - - 38.961 ms

but, when I make a request I get the error message:

2024-02-15T08:41:59Z proxy[e286033c6e39d8] lhr [error]instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

Looking at the collected metrics, it looks like all requests are being handled by one instance (d8d9e13a090d18) in this case. But the very odd thing is that the healthchecks are working fine. I guess these must come from within the fly internal network, so it seems likely to me that this is a routing issue but tbh I’m completely stuck.

My fly.toml file is below for reference:

# fly.toml app configuration file generated for ticketing-remix-fbe1 on 2023-12-14T15:19:16Z
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = "ticketing-remix-fbe1"
primary_region = "lhr"
kill_signal = "SIGINT"
kill_timeout = "5s"

[experimental]
  auto_rollback = true

[build]

[deploy]
  release_command = "bash ./scripts/migrate.sh"

[env]
  METRICS_PORT = "8081"
  PORT = "8080"

[http_service]
  internal_port = 3000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ["app"]

[[services]]
  protocol = "tcp"
  internal_port = 8080
  processes = ["app"]

  [[services.ports]]
    port = 80
    handlers = ["http"]
    force_https = true

  [[services.ports]]
    port = 443
    handlers = ["tls", "http"]
  [services.concurrency]
    type = "connections"
    hard_limit = 25
    soft_limit = 20

  [[services.tcp_checks]]
    interval = "15s"
    timeout = "2s"
    grace_period = "1s"

  [[services.http_checks]]
    interval = "10s"
    timeout = "2s"
    grace_period = "5s"
    method = "get"
    path = "/healthcheck"
    protocol = "http"
    tls_skip_verify = false

[[vm]]
  cpu_kind = "shared"
  cpus = 1
  memory_mb = 1024

[[metrics]]
  port = 8081
  path = "/metrics"

mayailurus · February 16, 2024, 12:20am

Hi… As an additional data point, I did see a response from the other little guy (e286033c6e39d8) when I tried:

$ curl -H 'flyio-debug: doit' --head 'https://ticketing-remix-fbe1.fly.dev/'
HTTP/2 200 
x-fly-region: lhr
strict-transport-security: max-age=3153600000
content-type: text/html; charset=utf-8
set-cookie: toast-session=e30%3D.; Path=/; HttpOnly; SameSite=Lax
vary: Accept-Encoding
date: Thu, 15 Feb 2024 23:50:27 GMT
server: Fly/17d0263d (2024-02-15)
via: 2 fly.io
flyio-debug: {"n":"worker-cf-ewr1-ed97","nr":"ewr","ra":"2605:<elided-ipv6>","rf":"Verbatim",
  "sr":"lhr","sdc":"lon1","sid":"e286033c6e39d8","st":0,"nrtt":99,"bn":"worker-cf-lon1-d88f"}
                                 ^^^^^^^^^^^^^^
fly-request-id: 01HPQJVF35X74Y07PN7JP3PHSF-ewr

(Formatting altered slightly and emphasis added.)

This part might be related; in (at least one episode in) the past, it’s turned up when an underlying physical host was approaching overload, …

https://community.fly.io/t/fly-machine-becoming-unresponsive-and-then-stopping-without-explanation/10502/3

mayailurus · February 16, 2024, 12:21am

Added lhr

mjms3 · February 16, 2024, 9:12am

Thanks so much for taking a look at it. Very interesting that you got a response from the other machine.

I tried stuff again and it looks like now it’s the other way around (e286033c6e39d8 appears to be handling all the responses):

2024-02-16T08:47:05.698 app[d8d9e13a090d18] lhr [info] GET /healthcheck 200 - - 27.190 ms

2024-02-16T08:47:07.150 app[e286033c6e39d8] lhr [info] HEAD / 200 - - 17.175 ms

2024-02-16T08:47:07.153 app[e286033c6e39d8] lhr [info] GET /healthcheck 200 - - 21.335 ms

2024-02-16T08:47:13.457 proxy[d8d9e13a090d18] lhr [error] instance refused connection. is your app listening on 0.0.0.0:3000? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

2024-02-16T08:47:13.694 app[e286033c6e39d8] lhr [info] GET /events/e3935215-4ff5-496b-9652-dabcaeee2a3e/view?checkoutStep=reserve 200 - - 44.986 ms

Thanks for the link to the post - I think the message about Virtual machine exited abruptly might be a bit of a red herring. I restarted both the machines just before generating that set of log output to add in to the post because I wanted to have some log output that showed the three salient things namely:

the VMs start up fine
the health checks work fine
one machine refusing connections

So I think that the Virtual machine exited abruptly message comes from the previous shut down. I’ve played around with a few different things in my fly.toml in a staging site. and I get the same thing - it appears like one VM is refusing connections. So I think it’s probably a config issue but I don’t know exactly what - I’ll do some more looking over the weekend.

One question that might help though: Does anyone know if one VM is privileged / treated differently in any way? I think it might be that the first VM to start up is accepting connections fine and the second one isn’t (but that’s not based on a large sample size survey so could just be supposition).

andie · February 16, 2024, 1:53pm

hi @mjms3

I think the problem in the fly.toml might be that you have two services configured for the same thing on different ports: 3000 and 8080. My understanding is that remix apps use port 3000 by default and so when you ran fly launch the launcher added the [http_service] section with port 3000.

The [http_service] section is like a shortcut for services that listen on ports 80 and 443. See Fly Launch configuration (fly.toml) · Fly Docs.

I’m not sure where the port is set in the remix stack code, but you’ll need to check that and then choose a port to use; either one will work as long as it’s consistent in fly.toml (service and env) and in the app itself. You might want to check your Dockerfile as well, since I’m not sure if you’re using the one from the repo or the one generated by fly launch.

(Note that you can either move the health checks into [http_service] and use that section, or you can delete [http_service] and use [[services]]. Your metrics setup is fine.)

mjms3 · February 16, 2024, 3:28pm

@andie

Thanks so much for your help. You were 100% right! In retrospect I should have done a diff with the remix blues stack template to ensure what I had matched that. Or read the docs better and really understood what was going on in the fly.toml. I guess I just assumed that, as I hadn’t edited the fly.toml it’d be fine - I didn’t think about fly launch.

You’ve saved me hours of headscratching, thank as again.

M

EDIT: hours more headscratching that is!

system · February 23, 2024, 3:29pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problems both deploying and starting up machine Questions / Help lhr	3	298	December 29, 2023
Help troubleshooting rogue machine connection autoscaling	5	14	February 10, 2025
New blue-green deployments failing - machines never passing healthchecks	23	177	January 23, 2025
Health checks on Machines Questions / Help wishlist	4	1261	May 4, 2024
Waiting for app to become healthy / timeout reached waiting for healthchecks to pass for machine	4	1218	October 9, 2023

remix app, two machines - health checks work fine, one is refusing connections

Related topics