LiteFS with consul leasing: Cannot connect to consul due to failed cert verification

Trying to get LiteFS working with consul following @benbjohnson example repo.

I enabled consul via fly consul attach, but I get the following error message:

ERROR: cannot init consul: cannot connect to consul: register node "litefs-test-yk4g9kmxepz1zom5/litefs":
Put "https://consul-fra.fly-shared.net/v1/catalog/register": tls: failed to verify certificate: x509: certificate signed by unknown authority

Lease type “static” works.

The strange thing is that my machine is in AMS while the consul URL seems to be coming from FRA. Not sure if this a potential issue.

Can you try installing ca-certificates? On Debian, it’s "apt install ca-certificates". On Alpine, it’s "apk add ca-certificates". I’ll fix that in the docs.

It does the job, thanks Ben! I overlooked that in the Dockerfile even though I added fuse3 and sqlite3. If you are about to fix the docs, please mention the issue with fuse permissions since the default Fly’s Dockerfile (e.g., for Phoenix apps) has the following line that must be commented:

# FIXME: Fix permissions to run as non-root
# https://community.fly.io/t/litefs-v0-4-0-released/12278/2
# USER nobody

Good suggestion! I made the fix in our docs to recommend running LiteFS as root in the Dockerfile and you can run your application as another user by using the su command. The docs changes should be visible in a few minutes.

1 Like

As I am recently experimenting more with LiteFS on Fly.io, where shall I post my findings? Shall I use the forum here?

This forum works great for general feedback. If you hit bugs, can you post them as an issue to the LiteFS repository? That makes them easier to track.

Ok. Yet, it might be difficult to assess if it has something to do with LiteFS or Fly.io in general. For example, I just observed the following while using consul lease:

  1. Started a machine, it booted as primary, r/w works.
  2. Cloned it, the clone came as replica, r works, w fails with an I/O error as expected. cat /litefs/.primary points to the parent machine which is expected.
  3. After a while, I suddenly see Fly.io downscaling the app deployment by stopping the parent machine (primary) and keeping the replica running.
  4. I login via fly ssh console to the replica and check /litefs/.primary which is still there pointing to the stopped parent.
  5. The replica which was the single machine running at that time was never promoted to primary. After a while, the parent machine started again and consul worked things out setting the primary on the parent.

My expectation was that consul detects the situation in 3) quickly and promotes the replica machine to primary. It was not the case and all writes were failing.

Is that normal? Does consul work when going from 2 machines to 1? I thought it does.

You can reproduce it by cloning a machine and then stopping one with fly m stop --select, I just did that, it is deterministic.

Thanks for the feedback, @rsas. I’ll give it a try. Do you have a litefs.yml config you can share?

The replica will attempt to become primary when it is disconnected rom the primary but it sounds like the replica still thinks it’s connected. It’ll then attempt to become primary itself (assuming it’s configured as a candidate).

tl;dr that’s definitely a bug. it could be Fly’s proxy holding onto the connection. I’ll test it out and see what’s going on.

Thanks @benbjohnson, here is my litefs.yml:

fuse:
  # Mount dir for apps to access the SQLite db.
  dir: "/litefs"

data:
  # Internal LiteFS storage.
  dir: "/data/litefs"

# This flag ensure that LiteFS continues to run if there is an issue on starup.
# It makes it easy to ssh in and debug any issues you might be having rather
# than continually restarting on initialization failure.
exit-on-error: true

# This section defines a list of commands to run after LiteFS has connected
# and sync'd with the cluster. You can run multiple commands but LiteFS expects
# the last command to be long-running (e.g. an application server). When the
# last command exits, LiteFS is shut down.
exec:
  - cmd: "/app/bin/migrate"
    if-candidate: true

  - cmd: "/app/bin/server"

# The lease section specifies how the cluster will be managed. We're using the
# "consul" lease type so that our application can dynamically change the primary.
#
# These environment variables will be available in your Fly.io application.
lease:
  type: "consul"
  advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:20202"
  candidate: ${FLY_REGION == PRIMARY_REGION}
  promote: true

  consul:
    url: "${FLY_CONSUL_URL}"
    key: "litefs/${FLY_APP_NAME}"

Here is what I do to reproduce (I tried two regions, ams and ord):

  1. Start from scratch with fly deploy, the litefs volume will be created automatically.
  2. Clone the machine with fly m clone --select
  3. Figure out the primary with cat /litefs/.primary on the machines via fly ssh console.
  4. Destroy the primary machine with fly m destroy <id> --force to simulate outage/deployment/downscaling.

I see the following in the logs. This time it worked! But it took almost 2 minutes to disconnect.

2023-05-25T04:38:34Z app[4d891699f06128] ams [info]04:38:34.780 request_id=F2JJCUa4EU5HMf0AAAJx [info] GET /
2023-05-25T04:38:34Z app[4d891699f06128] ams [info]fuse: write(): wal error: read only replica
....
2023-05-25T04:40:26Z app[4d891699f06128] ams [info]5A7C44D5B755B7BB: disconnected from primary with error, retrying: next frame: read tcp [fdaa:0:43a4:a7b:141:7c8b:a0b9:2]:43470->[fdaa:0:43a4:a7b:142:b654:a6cd:2]:20202: read: connection timed out
2023-05-25T04:40:27Z app[4d891699f06128] ams [info]5A7C44D5B755B7BB: primary lease acquired, advertising as http://4d891699f06128.vm.litefs-test.internal:20202

Hmm, strangely, I’m not seeing it hang when I test it. I see the other node immediately acquire the primary lease when the old primary goes down. I tried it with both fly m stop and fly m destroy.

Regardless, it’s a good idea to have a ping/keep-alive so I added an issue and marked it for this next release. Thanks for reporting the bug!

As for the machines autostopping, that’s a flag we recently added in fly.toml and I haven’t done extensive testing with it on LiteFS. You’re probably ok to use auto_stop_machines & auto_start_machines as long as you keep two min_machines_running and you keep your candidates in your primary region.

So something like:

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 2

The issue you can run into with only one machine is that the machine wouldn’t have anyone to replicate to. If it fails and we start up the other machine then it’ll have an old state of the database.

It is weird that you cannot reproduce the issue. Maybe it is something image OS specific. I use Debian. I just tried deleting the app and started from scratch. Same behaviour. After two minutes the disconnection is detected.

That’s a good point. I was using Alpine. I’ll give it a try with Debian.

Here are the image details auto-generated by fly launch:

ARG ELIXIR_VERSION=1.14.4
ARG OTP_VERSION=25.3.1
ARG DEBIAN_VERSION=bullseye-20230227-slim

ARG BUILDER_IMAGE="hexpm/elixir:${ELIXIR_VERSION}-erlang-${OTP_VERSION}-debian-${DEBIAN_VERSION}"
ARG RUNNER_IMAGE="debian:${DEBIAN_VERSION}"

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.