I’m not sure exactly what happened, but my site crashed and in an effort to reduce complexity while bringing it up, I reduced down to a single instance and updated my litefs consul key. I’m now getting this on startup:
2023-10-26T01:47:43.904 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-10-26T01:47:43.905 app[5683777fd0008e] den [info] 2023/10/26 01:47:43 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T01:47:44.899 app[5683777fd0008e] den [info] [ 472.917276] reboot: Restarting system
2023-10-26T01:47:45.371 app[5683777fd0008e] den [info] [ 0.050595] PCI: Fatal: No config space access function found
2023-10-26T01:47:45.595 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...
2023-10-26T01:47:45.618 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-10-26T01:47:45.624 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes
2023-10-26T01:47:45.625 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-10-26T01:47:45.639 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-10-26T01:47:45.646 app[5683777fd0008e] den [info] 2023/10/26 01:47:45 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="litefs cloud backup client configured: https://litefs.fly.io"
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
I’m surprised there’s no additional logs after that. You could try removing LITEFS_CLOUD_TOKEN from secrets and restart to see if that works? I’m not sure what LiteFS is waiting for based on the logs given so far.
Trying that now. I just noticed later when I tried again there are more logs this time:
2023-10-26T02:03:41.609 runner[5683777fd0008e] den [info] Pulling container image registry.fly.io/kcd:deployment-01HDMTDBSAWHATWMXGJRHA1RWQ
2023-10-26T02:03:42.142 runner[5683777fd0008e] den [info] Successfully prepared image registry.fly.io/kcd:deployment-01HDMTDBSAWHATWMXGJRHA1RWQ (533.090957ms)
2023-10-26T02:03:42.175 runner[5683777fd0008e] den [info] Setting up volume 'data_machines'
2023-10-26T02:03:42.175 runner[5683777fd0008e] den [info] Opening encrypted volume
2023-10-26T02:03:42.804 runner[5683777fd0008e] den [info] Configuring firecracker
2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] signal received, litefs shutting down
2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] litefs shut down complete
2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] INFO Sending signal SIGINT to main child process w/ PID 314
2023-10-26T02:03:43.791 app[5683777fd0008e] den [info] INFO Main child exited normally with code: 0
2023-10-26T02:03:43.792 app[5683777fd0008e] den [info] INFO Starting clean up.
2023-10-26T02:03:43.792 app[5683777fd0008e] den [info] INFO Umounting /dev/vdb from /data
2023-10-26T02:03:43.797 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-10-26T02:03:43.801 app[5683777fd0008e] den [info] 2023/10/26 02:03:43 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:03:44.796 app[5683777fd0008e] den [info] [ 959.421864] reboot: Restarting system
2023-10-26T02:03:45.261 app[5683777fd0008e] den [info] [ 0.048188] PCI: Fatal: No config space access function found
2023-10-26T02:03:45.504 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...
2023-10-26T02:03:45.527 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-10-26T02:03:45.534 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes
2023-10-26T02:03:45.535 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-10-26T02:03:45.546 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-10-26T02:03:45.556 app[5683777fd0008e] den [info] 2023/10/26 02:03:45 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml
2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d
2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io
2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] level=INFO msg="no backup client configured, skipping"
2023-10-26T02:03:45.605 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
2023-10-26T02:03:52.884 app[5683777fd0008e] den [info] ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader)
2023-10-26T02:03:52.884 app[5683777fd0008e] den [info] waiting for signal or subprocess to exit
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
2023-10-26T02:06:07.218 runner[5683777fd0008e] den [info] Pulling container image registry.fly.io/kcd:deployment-01HDMVPGMRJ7KSYZ4NAF4ZPGD6
2023-10-26T02:06:07.799 runner[5683777fd0008e] den [info] Successfully prepared image registry.fly.io/kcd:deployment-01HDMVPGMRJ7KSYZ4NAF4ZPGD6 (581.470121ms)
2023-10-26T02:06:07.823 runner[5683777fd0008e] den [info] Setting up volume 'data_machines'
2023-10-26T02:06:07.823 runner[5683777fd0008e] den [info] Opening encrypted volume
2023-10-26T02:06:08.466 runner[5683777fd0008e] den [info] Configuring firecracker
2023-10-26T02:06:08.544 app[5683777fd0008e] den [info] signal received, litefs shutting down
2023-10-26T02:06:08.544 app[5683777fd0008e] den [info] litefs shut down complete
2023-10-26T02:06:08.545 app[5683777fd0008e] den [info] INFO Sending signal SIGINT to main child process w/ PID 314
2023-10-26T02:06:08.722 app[5683777fd0008e] den [info] INFO Main child exited normally with code: 0
2023-10-26T02:06:08.723 app[5683777fd0008e] den [info] INFO Starting clean up.
2023-10-26T02:06:08.723 app[5683777fd0008e] den [info] INFO Umounting /dev/vdb from /data
2023-10-26T02:06:08.730 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-10-26T02:06:08.731 app[5683777fd0008e] den [info] 2023/10/26 02:06:08 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:06:09.725 app[5683777fd0008e] den [info] [ 144.504948] reboot: Restarting system
2023-10-26T02:06:10.066 app[5683777fd0008e] den [info] [ 0.046984] PCI: Fatal: No config space access function found
2023-10-26T02:06:10.294 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...
2023-10-26T02:06:10.317 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-10-26T02:06:10.324 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes
2023-10-26T02:06:10.325 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-10-26T02:06:10.339 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-10-26T02:06:10.348 app[5683777fd0008e] den [info] 2023/10/26 02:06:10 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] level=INFO msg="no backup client configured, skipping"
2023-10-26T02:06:10.395 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"
2023-10-26T02:06:17.618 app[5683777fd0008e] den [info] ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader)
2023-10-26T02:06:17.618 app[5683777fd0008e] den [info] waiting for signal or subprocess to exit
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
I believe I hit a related error today as well. After performing a fly deploy I began to see a similar error to this in my logs: ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader). (of course the error message contained my litefs mount directory, not “kcd”.
There were a few things different about today’s fly deploy, compared to prior working deploys:
Today flyctl automatically updated itself (possibly to 0.1.104? But I’m not sure)
I modified my litefs.yml file, adding a command to run migrations when is-candidate: true.
My app server would have failed because it required some environment variables, and I had not set those variables
(whether any of these were factors in the consul issue, I don’t know)
In case it’s useful, my app is prod-nb-site under the neuronbench organization.
Looks like Consul went through a migration from apps v1 to v2 and it had an issue on one of the clusters. We shouldn’t see that same issue again but I’m looking into why we didn’t get an alert sooner about the issue.
We’re still having some issues getting consul-fra-6 back in a good state. I’m surprised that fly attach consul isn’t setting an environment variable. You can also try creating a new app, running fly consul attach there and then copying the value of the FLY_CONSUL_URL secret back to your original app.
So you mean I could change only the fra-6 part of the hostname to fra-5 and it works then?
edit: does not work, getting
cannot connect to consul: register node "app-name-redacted-p7vx1jj24gr1k3z5/litefs": Unexpected response code: 403 (ACL not found)
This is not super critical service and we can digest some downtime; changing the consul url to satellite hosts is something we do not want to do so the question is that do you have any schedule when fra-6 could be alive again?