My site is in a really bad state

I’m not sure exactly what happened, but my site crashed and in an effort to reduce complexity while bringing it up, I reduced down to a single instance and updated my litefs consul key. I’m now getting this on startup:

2023-10-26T01:47:43.904 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-10-26T01:47:43.905 app[5683777fd0008e] den [info] 2023/10/26 01:47:43 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T01:47:44.899 app[5683777fd0008e] den [info] [ 472.917276] reboot: Restarting system
2023-10-26T01:47:45.371 app[5683777fd0008e] den [info] [ 0.050595] PCI: Fatal: No config space access function found
2023-10-26T01:47:45.595 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...
2023-10-26T01:47:45.618 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-10-26T01:47:45.624 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes
2023-10-26T01:47:45.625 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-10-26T01:47:45.639 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-10-26T01:47:45.646 app[5683777fd0008e] den [info] 2023/10/26 01:47:45 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="litefs cloud backup client configured: https://litefs.fly.io"
2023-10-26T01:47:45.692 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

Here’s the litefs config: https://github.com/kentcdodds/kentcdodds.com/blob/main/other/litefs.yml

The app name is kcd. I have no idea what to try at this point.

I’m surprised there’s no additional logs after that. You could try removing LITEFS_CLOUD_TOKEN from secrets and restart to see if that works? I’m not sure what LiteFS is waiting for based on the logs given so far.

Trying that now. I just noticed later when I tried again there are more logs this time:

2023-10-26T02:03:41.609 runner[5683777fd0008e] den [info] Pulling container image registry.fly.io/kcd:deployment-01HDMTDBSAWHATWMXGJRHA1RWQ

2023-10-26T02:03:42.142 runner[5683777fd0008e] den [info] Successfully prepared image registry.fly.io/kcd:deployment-01HDMTDBSAWHATWMXGJRHA1RWQ (533.090957ms)

2023-10-26T02:03:42.175 runner[5683777fd0008e] den [info] Setting up volume 'data_machines'

2023-10-26T02:03:42.175 runner[5683777fd0008e] den [info] Opening encrypted volume

2023-10-26T02:03:42.804 runner[5683777fd0008e] den [info] Configuring firecracker

2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] signal received, litefs shutting down

2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] litefs shut down complete

2023-10-26T02:03:42.891 app[5683777fd0008e] den [info] INFO Sending signal SIGINT to main child process w/ PID 314

2023-10-26T02:03:43.791 app[5683777fd0008e] den [info] INFO Main child exited normally with code: 0

2023-10-26T02:03:43.792 app[5683777fd0008e] den [info] INFO Starting clean up.

2023-10-26T02:03:43.792 app[5683777fd0008e] den [info] INFO Umounting /dev/vdb from /data

2023-10-26T02:03:43.797 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)

2023-10-26T02:03:43.801 app[5683777fd0008e] den [info] 2023/10/26 02:03:43 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)

2023-10-26T02:03:44.796 app[5683777fd0008e] den [info] [ 959.421864] reboot: Restarting system

2023-10-26T02:03:45.261 app[5683777fd0008e] den [info] [ 0.048188] PCI: Fatal: No config space access function found

2023-10-26T02:03:45.504 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...

2023-10-26T02:03:45.527 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755

2023-10-26T02:03:45.534 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes

2023-10-26T02:03:45.535 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root

2023-10-26T02:03:45.546 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api

2023-10-26T02:03:45.556 app[5683777fd0008e] den [info] 2023/10/26 02:03:45 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)

2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml

2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d

2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io

2023-10-26T02:03:45.604 app[5683777fd0008e] den [info] level=INFO msg="no backup client configured, skipping"

2023-10-26T02:03:45.605 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

2023-10-26T02:03:52.884 app[5683777fd0008e] den [info] ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader)

2023-10-26T02:03:52.884 app[5683777fd0008e] den [info] waiting for signal or subprocess to exit

could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)

Here are the logs after doing that:

2023-10-26T02:06:07.218 runner[5683777fd0008e] den [info] Pulling container image registry.fly.io/kcd:deployment-01HDMVPGMRJ7KSYZ4NAF4ZPGD6
2023-10-26T02:06:07.799 runner[5683777fd0008e] den [info] Successfully prepared image registry.fly.io/kcd:deployment-01HDMVPGMRJ7KSYZ4NAF4ZPGD6 (581.470121ms)
2023-10-26T02:06:07.823 runner[5683777fd0008e] den [info] Setting up volume 'data_machines'
2023-10-26T02:06:07.823 runner[5683777fd0008e] den [info] Opening encrypted volume
2023-10-26T02:06:08.466 runner[5683777fd0008e] den [info] Configuring firecracker
2023-10-26T02:06:08.544 app[5683777fd0008e] den [info] signal received, litefs shutting down
2023-10-26T02:06:08.544 app[5683777fd0008e] den [info] litefs shut down complete
2023-10-26T02:06:08.545 app[5683777fd0008e] den [info] INFO Sending signal SIGINT to main child process w/ PID 314
2023-10-26T02:06:08.722 app[5683777fd0008e] den [info] INFO Main child exited normally with code: 0
2023-10-26T02:06:08.723 app[5683777fd0008e] den [info] INFO Starting clean up.
2023-10-26T02:06:08.723 app[5683777fd0008e] den [info] INFO Umounting /dev/vdb from /data
2023-10-26T02:06:08.730 app[5683777fd0008e] den [info] WARN hallpass exited, pid: 315, status: signal: 15 (SIGTERM)
2023-10-26T02:06:08.731 app[5683777fd0008e] den [info] 2023/10/26 02:06:08 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:06:09.725 app[5683777fd0008e] den [info] [ 144.504948] reboot: Restarting system
2023-10-26T02:06:10.066 app[5683777fd0008e] den [info] [ 0.046984] PCI: Fatal: No config space access function found
2023-10-26T02:06:10.294 app[5683777fd0008e] den [info] INFO Starting init (commit: 15238e9)...
2023-10-26T02:06:10.317 app[5683777fd0008e] den [info] INFO Mounting /dev/vdb at /data w/ uid: 0, gid: 0 and chmod 0755
2023-10-26T02:06:10.324 app[5683777fd0008e] den [info] INFO Resized /data to 3217031168 bytes
2023-10-26T02:06:10.325 app[5683777fd0008e] den [info] INFO Preparing to run: `docker-entrypoint.sh litefs mount` as root
2023-10-26T02:06:10.339 app[5683777fd0008e] den [info] INFO [fly api proxy] listening at /.fly/api
2023-10-26T02:06:10.348 app[5683777fd0008e] den [info] 2023/10/26 02:06:10 listening on [fdaa:0:23df:a7b:d828:8baf:fb5b:2]:22 (DNS: [fdaa::3]:53)
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] config file read from /etc/litefs.yml
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] LiteFS v0.5.4, commit=9173accf2f0c0e5288383c2706cf8d132ad27f2d
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] level=INFO msg="host environment detected" type=fly.io
2023-10-26T02:06:10.394 app[5683777fd0008e] den [info] level=INFO msg="no backup client configured, skipping"
2023-10-26T02:06:10.395 app[5683777fd0008e] den [info] level=INFO msg="Using Consul to determine primary"
2023-10-26T02:06:17.618 app[5683777fd0008e] den [info] ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader)
2023-10-26T02:06:17.618 app[5683777fd0008e] den [info] waiting for signal or subprocess to exit
could not find a good candidate within 90 attempts at load balancing. last error: no known healthy instances found for route tcp/443. (hint: is your app shut down? is there an ongoing deployment with a volume or are you using the 'immediate' strategy? have your app's instances all reached their hard limit?)
1 Like

I believe I hit a related error today as well. After performing a fly deploy I began to see a similar error to this in my logs: ERROR: cannot init consul: cannot connect to consul: register node "kcd-g3zmqx5x3y49dlp4/litefs": Unexpected response code: 500 (No cluster leader). (of course the error message contained my litefs mount directory, not “kcd”.

There were a few things different about today’s fly deploy, compared to prior working deploys:

  • Today flyctl automatically updated itself (possibly to 0.1.104? But I’m not sure)
  • I modified my litefs.yml file, adding a command to run migrations when is-candidate: true.
  • My app server would have failed because it required some environment variables, and I had not set those variables

(whether any of these were factors in the consul issue, I don’t know)

In case it’s useful, my app is prod-nb-site under the neuronbench organization.

2 Likes

Looks like there was a hiccup in our Consul cluster. Can you try restarting/redeploying?

1 Like

It’s back up. Thank you.

Anything I can do to prevent this from happening again?

Looks like Consul went through a migration from apps v1 to v2 and it had an issue on one of the clusters. We shouldn’t see that same issue again but I’m looking into why we didn’t get an alert sooner about the issue.

2 Likes

My site is back up too. I had to fly attach consul after a redeploy. Thanks for the help!

I will continue here as something happened to our consul also.
We have on application which was running perfectly, but last night it just lost consul.

Consul had consul-fra-6.fly-shared.net host and now it cannot connect there anymore:

ERROR: cannot init consul: cannot connect to consul: register node "app-name-redacted-p7vx1jj24gr1k3z5/litefs": Put "https://consul-fra-6.fly-shared.net/v1/catalog/register": read tcp [2604:1380:4601:d609:0:7a12:a6f3:1]:52408->[2a09:8280:1::e05c]:443: read: connection reset by peer

We tried fly attach consul but actually app does not even get FLY_CONSUL_URL env setting.
We also tried detach first. No luck.

On our toml, we have

[experimental]
auto_rollback = true 
enable_consul = true

These are probably not needed anymore, but they still are there.

Anything we should try?

We’re still having some issues getting consul-fra-6 back in a good state. I’m surprised that fly attach consul isn’t setting an environment variable. You can also try creating a new app, running fly consul attach there and then copying the value of the FLY_CONSUL_URL secret back to your original app.

Ok, will try creating another app.

Consul url is also set on another hosts and it is a little work to change that to all hosts.

Do you have any estimate when frankfurt could be alive again?

Our fra-6 consul cluster is back up and running now. Let me know if you have any more issues.

1 Like

I can confirm that our app now is back online and has connection to previous consul url.

All good again, thanks!

And thanks - again - for creating litefs. We are still using 0.3 version of it and are more than happy what it does.

1 Like

Seems, that consul is down again:

ERROR: cannot init consul: cannot connect to consul: register node "app-name-redacted-p7vx1jj24gr1k3z5/litefs": Put "https://consul-fra-6.fly-shared.net/v1/catalog/register": read tcp [2604:1380:4601:d609:0:7a12:a6f3:1]:36464->[2a09:8280:1::e05c]:443: read: connection reset by peer

We’re having trouble again with the fra-6 cluster. For now, it would be best to switch to another one if possible, such as fra-5.

So you mean I could change only the fra-6 part of the hostname to fra-5 and it works then?

edit: does not work, getting

cannot connect to consul: register node "app-name-redacted-p7vx1jj24gr1k3z5/litefs": Unexpected response code: 403 (ACL not found)

This is not super critical service and we can digest some downtime; changing the consul url to satellite hosts is something we do not want to do so the question is that do you have any schedule when fra-6 could be alive again?

It seems that fra-6 is alive again.

Yes, I’ve turned a few knobs to revive it! Let us know if you have any more problems.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.