Small consul cluster + litefs failover

tj1 · March 27, 2023, 3:00am

Hoping to get some input on setting up a small consul cluster for litefs on v2 apps since the FLY_CONSUL_URL is gone even with the experimental flag there. Litefs seems to work fine on a static lease, but I don’t want to manually failover. App is at https://litefs-liveview2.fly.dev/

Consul questions

I’m pretty sure multiple people here must have some serious expertise on consul at this point, so here is where I’ve gotten so far:
https://git.sr.ht/~sheertj/fly_consul

The bind address is being setup from the fly-local-6pn entry in /etc/hosts and the entrypoint overridden.

FROM hashicorp/consul:1.15.1 as consul
COPY ./docker-entrypoint-ubi.sh /usr/local/bin/docker-entrypoint.sh
ENTRYPOINT ["docker-entrypoint.sh"]
CMD ["agent", "-server", "-client", "0.0.0.0", "-bootstrap-expect=3", "-ui"]

I setup 3 machines in a region and I have to login and run consul join $6pn_host1 $6pn_host2 $6pn_host3

Is there an easy way to have these join and bootstrap automatically? I see there’s a -retry-join=$ip that can be used somehow potentially and there are a myriad of options for private networking lookups at Private Networking · Fly Docs . Has anyone glued these two together? Can I just use -retry-join=${FLY_APP_NAME}.internal and it will work eventually?
Are persistent volumes required for consul?
Is FLY_CONSUL_URL going to be coming back?
Is there any health-checks required for this and what would they be in fly.toml?

Litefs questions

The lease setup is done below and HOSTNAME is set in the entrypoint via export HOSTNAME=$(hostname --fqdn) before starting litefs and LITEFS_PORT is setup…somewhere.

lease:
  type: "consul"
  advertise-url: "http://${HOSTNAME}.vm.${FLY_APP_NAME}.internal:${LITEFS_PORT}"

  consul:
    url: "http://${CONSUL_APP_NAME}.internal:8500"
    key: "litefs/${FLY_APP_NAME}"
    ttl: "10s"
    lock-delay: "3s"

Is there a better advertise-url that doesn’t require an entrypoint setup? “http://[fly-local-6pn]:${LITEFS_PORT}” makes everyone connect to themselves. I see there’s a FLY_PUBLIC_IP, but I’m not sure if it’s public and I also couldn’t connect on it from the other nodes.
On earlier versions, the failover was within 15 seconds. Now it takes 2.5 min for the replicas to realize primary is down. Could not find the appropriate setting in the docs.

 2023-03-27T01:18:03Z app[e2865521f06486] ord [info][  400.560970] reboot: Restarting system
 2023-03-27T01:20:28Z app[e286de5ce55d86] mia [info]C932729BE3E04D2E310583FE: disconnected from primary with error, retrying: next frame: read tcp [fdaa:0:9144:a7b:88:1f45:114c:2]:48164->[fdaa:0:9144:a7b:f4:7496:8fb4:2]:20202: read: connection timed out
 2023-03-27T01:20:28Z app[6e82993f094687] sjc [info]B3772CDA138D39DEA5702727: disconnected from primary with error, retrying: next frame: read tcp [fdaa:0:9144:a7b:b2e2:ec98:1dd4:2]:36830->[fdaa:0:9144:a7b:f4:7496:8fb4:2]:20202: read: connection timed out

Machines are incredibly fast to boot up. The proxy started a stopped machine in 300ms, but the app comes up afterwards in 10 seconds. I was hoping to start them concurrently, but then the app creates a file in the mount dir and litefs refuses to mount over it. Is there any way to force this? I could not find in the docs.
Is there any other easy failover technique? Consul is an extra dependency.

tj1 · March 28, 2023, 12:00pm

@benbjohnson1 @benbjohnson - any thoughts regarding failover?

Also, on apps v2 is there a way to run a static lease to a single region, have two machines in the single region, and have them failover to each other somehow? I can see how that is theoretically possible on v1, but don’t see how that can be done in v2 without consul.

Thanks.

benbjohnson · March 28, 2023, 4:01pm

I can try to answer some of these questions.

Yes, FLY_CONSUL_URL is coming to apps v2 soon. We wanted to get the other existing functionality out first but we’ll be adding this back in.

I haven’t tried running Consul without persistent volumes but I would assume it does require persistence. It uses the Raft consensus protocol for its strongly consistent parts (e.g. leases) and Raft requires persistence.

Currently, it needs to be the entrypoint because it needs to start before the app. Otherwise the application will try to create the database on the normal file system and then LiteFS tries to mount on top of that, which doesn’t work.

Hmm, that doesn’t sound right. Is your node in ord taking a long time to shut down? The lease TTL is still short (~10s) so the ord node must still be running and renewing its lease.

FUSE provides a flag to mount over an existing directory but there’s all kinds of possible footguns when you do that. Do you know if your application’s process is exiting quickly? LiteFS should be quite fast to shutdown.

We’re working on long-term, durable backup storage for LiteFS (similar to Litestream) and we’re hoping to use that to enforce the primary as well so you won’t need Consul in the future.

How’re you thinking the static lease option would work on apps v1? The main issue is that you need a consensus to safely determine which node is up and running and then determine that it’s the primary. There’s not really a great way to do that when you only have two nodes.

tj1 · March 28, 2023, 10:23pm

Thanks for the additional information.

So, in theory, what I’m looking for is a floating dns name that can be moved to the appropriate host on failover, however, this doesn’t fully solve the problem as litefs also has to decide that it is now the primary (there are a few old-school ways of doing this VeritasFS for Oracle comes to mind where it actually killed all the IO on the primary).

We have two possibilities right now for advertise_url, but neither will work with a static lease.
<alloc_id>.vm.<appname>.internal
<region>.<appname>.internal

I know it’s rather silly, but even if litefs queried a static url for the master name / advertise url to failover, that could work ok with some decent monitoring.

Anyway, it looks like for failover, we’re going to need a small consul cluster atm for v2.

Did some more testing. When litefs process is killed on the machine, it fails over immediately. When fly machines stop is used, it hangs for a bit. Perhaps a tcp socket is lingering?