Can't get a CouchDB Cluster working (connection_closed)

CanRau · May 28, 2022, 2:09am

I’m trying to set up a CoucDB Cluster using the semi-official Docker image, though I’m struggling for days now trying different setups/options.

As CouchDB doesn’t seem to support IPv6 so far and also I need to define a NODENAME I create individual apps with single nodes.

Though when I try to add the other nodes (apps) I get error messages, using the hostname app.fly.dev or the apps public IPv4 “connection_closed”. But I can access Fauxton (the dashboard) on all apps and the logs aren’t really helpful.

Also tried app.internal which results in {conn_failed,{error,nxdomain}}

And just tried, just in case, to set up custom domains though get the same connection_closed error

I’m kinda lost, tried searching more Erlang specific but couldn’t get anything working, also not at all familiar with that language.

Uh and I forgot to mention, that it’s working locally using [this example] (GitHub - cacois/couchdb-docker-clustering-examples: Examples of successfully clustering CouchDB 2.1.1 in Docker containers) slightly tweaked, setting erlang cookie and nodename via its own env vars. That’s why I’m asking here as it seems to be a networking issue and not with the Docker image or CouchDB itself

Update: luckily normal replication does work correctly, so for now I’m just replicating between 2 apps (nodes) to have at least some resilience so I don’t loose data if a node (volume) would crash. This is sufficient for now while I MVP my app but I’d still like to get clustering working, too, for a great future

ignoramous · May 28, 2022, 8:20pm

And just tried, just in case, to set up custom domains though get the same connection_closed error

Without full error log / reference code and fly.toml config entries, it is hard to tell what is at fault here. I can imagine that the service ports aren’t set or being listened to as expected causing this error (see also: tcp/udp services on Fly).

Also, note that Fly load balancers timeout idle tcp connections after 60s: Increasing idle timeout

Though when I try to add the other nodes (apps) I get error messages, using the hostname app.fly.dev or the apps public IPv4 “connection_closed”.

Going over the public internet for clustering is less than ideal, one would think.

As CouchDB doesn’t seem to support IPv6 so far…

For clustering on Fly, you’d definitely need IPv6 support to use 6pn. Also, this blog post on building NATS cluster on Fly is a handy reference.

CanRau · May 28, 2022, 9:00pm

Yea just realized couple of minutes ago, when I wasn’t at the laptop, that the fly.toml would be useful Thanks a lot for your comment tho

Not sure the idle time is affecting here because the error show up pretty quickly like after 1 or 2 seconds.

And yeah, would definitely prefer using 6pn, going to comment my use case in the CouchDB soon-ish. Also would be fantastic if I didn’t have to specify NODENAME on startup so that nodes could auto-discover and connect and I wouldn’t need to create individual apps for each node, though that’s another point I’m going to bring up in their repo even though I doubt it’s “easily fixable”

# app = "couchdb-template"

kill_signal = "SIGINT"
kill_timeout = 5
processes = []

[build]
  # image = "apache/couchdb"
  dockerfile = "Dockerfile"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[mounts]
  destination = "/opt/couchdb/data"
  source = "data"

[[services]]
  http_checks = []
  internal_port = 5984
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 5984

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

[[services]]
  http_checks = []
  internal_port = 4369
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 4369

[[services]]
  http_checks = []
  internal_port = 9100
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    handlers = ["http"]
    port = 9100

As stated the logs didn’t show much, though I’m going to replicate the setup and see what I can bring here.

ignoramous · May 29, 2022, 8:44am

fly.toml looks okay to me.

Also would be fantastic if I didn’t have to specify NODENAME on startup so that nodes could auto-discover and connect and I wouldn’t need to create individual apps for each node

Not really, each node of your Fly app is already 6pn DNS and IPv6 addressable:

<fly-alloc-id>.vm.<app-name>.internal

Where <fly-alloc-id> is the first 4 bytes (8 hex chars: eda70bcd) of the FLY_ALLOC_ID env var (ex: eda70bcd-2f09-45b7-b1d9-1f8f084c62ea) preset for that instance of your Fly app’s VM. Though, I believe this information is only accessible runtime.

See also: Send request to a specific VM - #2 by greg (via 6pn) and Is it possibly for a client to connect to a particular instance - #4 by jerome (with http over fly-proxy).

You can grab an app’s 6pn IP address like so:

flyctl dig <appname>.internal -a <appname>

Or, programmatically too, ref: Best way to internally send a request to ALL running vms? - #8 by kurt / Specify instance-id in fly-replay header - #10 by ignoramous

Make sure <appname> listens on :: or fly-local-6pn or _local_ip.internal to respond to incoming 6pn requests: Private networking not working - #4 by kurt

NB: Just like the <fly-alloc-id> Even the 6pn addresses allocated change between deploys/restarts, unless you use volumes: Can an instance have a persistent network identity? - #7 by kurt

All in all, you’re in for quite a wild ride (:

CanRau · May 30, 2022, 4:18pm

Hmm yea thanks for the feedback on the fly.toml

I’m not sure about the NODENAME might’ve explained more what it is, here’s a quote from the couchdb-docker repo

NODENAME will set the name of the CouchDB node inside the container to couchdb@${NODENAME} , in the file /opt/couchdb/etc/vm.args . This is used for clustering purposes and can be ignored for single-node setups.

So I’m not sure if that can be “dynamic” or it’s okay to change it later on

Or maybe in a custom startup.sh I might be able to do something like export NODENAME="$FLY_ALLOC_ID.vm.$FLY_APP_NAME.internal" and then start CouchDB
Or write it directly to the config myself like here

Though so far it still seems it doesn’t support IPv6 tho I think I might give it a try a little later as from comments on Twitter it seems it might support it just not the URL verifying regexes, which I’m not sure where they’re being used

Because having the cluster in one app would be superior I believe

kurt · May 30, 2022, 4:22pm

CouchDB probably doesn’t support IPv6. IPv6 in Erlang is a little weird, and I don’t think they’ve done anything to make it work properly. It’s possible that you can add "-proto_dist inet6_tcp" > "/opt/couchdb/etc/vm.args", then refer to nodes with the IP address (not hostname) and it’ll work. But it’s a stretch: ignored "-proto_dist inet6_tcp" · Issue #2791 · apache/couchdb · GitHub

I do not think you’ll be able to run a CouchDB cluster on Fly.io if the IPv6 stuff doesn’t work.

One thing you could try is running Tailscale in each VM. That’ll give you a private IPv4 network you can use to for CouchDB clustering.

I would only use [[services]] to expose CouchDB to the internet for clients. There’s no great way to use that for clustering.

CanRau · May 30, 2022, 4:32pm

Hey @kurt thanks for chiming in.

I stumbled upon "-proto_dist inet6_tcp" though I don’t exactly remember how I tested it, might be with hostnames only.

I’ll give that a try a little later and also looking into Tailscale. Thanks a lot

Also a fly managed CouchDB cluster like fly couchdb create would be tremendous, although Couch feels way less mainstream than other DBs, even though it seems one of the “easiest” to get a fully synced multi-leader offline-first experience.

CanRau · May 31, 2022, 1:17am

Problem so far I’ve got is that I can’t get the IPv6 address of the instance within my custom entrypoint

# entrypoint.sh
#!/bin/bash

echo "-------------------------"
echo "-------------------------"
echo "-------------------------"
echo "-------------------------"
echo "RUNNING CUSTOM ENTRYPOINT entrypoint.sh"

export ERL_FLAGS="-proto_dist inet6_tcp"
echo "ERL_FLAGS $ERL_FLAGS"

export INTERNAL_ADDR="$FLY_ALLOC_ID.vm.$FLY_APP_NAME.internal"
echo "INTERNAL_ADDR: $INTERNAL_ADDR"

dig +short aaaa "$INTERNAL_ADDR" @fdaa::3

echo $(dig +short aaaa "$INTERNAL_ADDR" @fdaa::3)

ping -6 $INTERNAL_ADDR

echo $(ping -6 $INTERNAL_ADDR)

ping -6 "$FLY_ALLOC_ID.vm.$FLY_APP_NAME.internal"

export NODENAME=$(dig +short aaaa "$INTERNAL_ADDR" @fdaa::3)

echo $NODENAME

export NODENAME=$INTERNAL_ADDR
echo "-------------------------"
echo "-------------------------"
echo "-------------------------"
echo "-------------------------"

# original entrypoint and cmd
tini -- /docker-entrypoint.sh /opt/couchdb/bin/couchdb

A little chaotic I know, just debugging trying to make it visually more outstanding, especially as CouchDB starts its error log madness about _users db not existent until you create it (that’s expected, tho pretty annoying trying to debugging via the log😅)

# Dockerfile
FROM apache/couchdb:3.2.2

COPY local.ini /opt/couchdb/etc/
COPY entrypoint.sh /usr/local/bin/

RUN apt-get update && apt-get install -y -q nano dnsutils iputils-ping

ENTRYPOINT ["/bin/sh","-c"]
CMD ["/usr/local/bin/entrypoint.sh"]

trying

export INTERNAL_ADDR="$FLY_ALLOC_ID.vm.$FLY_APP_NAME.internal"
ping -6 $INTERNAL_ADDR

results in

ping: 06d8fc4b-d779-2f0c-6b92-919f68352364.vm.crcouchdb1.internal: Name or service not known

Just to be sure also tried ping $INTERNAL_ADDR without -6 with the same result

and

echo $(dig +short aaaa "$INTERNAL_ADDR" @fdaa::3)

just prints nothing

Just in case tried the hostname export NODENAME=$INTERNAL_ADDR because that’s all I had available so far which results in CouchDB Fauxton UI trying to add a second node

{conn_failed,{error,nxdomain}}

What am I doing wrong why can’t I get the IPv6 of the instace via dig or ping like this?

By the way followed this comment on GitHub which suggests it might work, following the steps and as mentioned tried hostname as suggested there

Also I’m not using WireGuard, just to make sure we’re on the same boat and because last time I could query the IPv6 from my Deno app I gave it a shot and made a Deno compiled entrypoint.

const FLY_ALLOC_ID = Deno.env.get("FLY_ALLOC_ID")
const FLY_APP_NAME = Deno.env.get("FLY_APP_NAME")
const internalAddress = `${FLY_ALLOC_ID}.vm.${FLY_APP_NAME}.internal`
console.log(await Deno.resolveDns(internalAddress, "AAAA"))

// result
// error: Uncaught (in promise) NotFound: no record found for name:
// 28aa063f-2614-14f9-9684-11bd8c05f82a.vm.crcouchdb1.internal. type: AAAA class: IN

CanRau · May 31, 2022, 3:48am

Just reread your post @ignoramous and was surprised I didn’t really take not of this one as to me from the docs it seems it’s just the env $FLY_ALLOC_ID as is

So my Deno entrypoint is now resolving the IPv6, trying to get bash working now to not rely on an 80MB entrypoint

This does the trick

FLY_ALLOC_ID_SUB=$(cut -c 1-8 <<< $FLY_ALLOC_ID)
export INTERNAL_ADDR="$FLY_ALLOC_ID_SUB.vm.$FLY_APP_NAME.internal"

Though at least via the Couch UI I’m still unable to add the second instance as a node.
Also realised that using 1 app would probably still not work anyway as those addresses would change on every re-deploy, not sure there’s a way to automate adjusting changed IPs somehow or if it’s still necessary to use independent apps for each cluster node though if I can get it working with IPv6 at all than separate apps wouldn’t be an issue

Just tried via UI and API with the same errors

ignoramous · May 31, 2022, 5:35am

Glad you’re making progress. Are you eventually planning to tunnel 4 in 6 (with tailscale, for example), to make the clustering setup work?

Just reread your post @ignoramous and was surprised I didn’t really take not of this one as to me from the docs it seems it’s just the env $FLY_ALLOC_ID as is

@thomas how about adding a 6pn DNS entry for fly-alloc-id.vm.app-name.internal as is? uuid is 36 chars, well within the 63 char subdomain limit for DNS. It should work out nicely.

CanRau · May 31, 2022, 6:00am

Not sure, probably should give it a try though have to look more into how it works.

This, or make it more clear from the private networking docs, or add another env var like $FLY_ALLOC_ID_SHORT or something so people don’t have to understand how to substring it in bash, was kinda confusing to Google

dch · June 1, 2022, 5:09pm

FWIW CouchDB does support ipv6 (both cluster config and as the user-facing http port). I’ve been using this over zerotier ipv6 for several years without issue.

Getting this working shouldn’t be hard but may require a bit more erlang skills than what you’re expecting.

A quick checklist:

confirm DNS AAAA resolution for *.internal works in the couchdb container
confirm DNS AAAA resolution for *.internal works from erlang shell in the container
ensure “normal” erlang/elixir clustering works in the container (epmd, port restrictions)
get 1 couchdb node running with ipv6 setup
add the rest in manually
figure out the rest of the owl (dynamic fly internal hostnames when nodes are restarted)
profit

wrt epmd, it’s not an http service so that fly.toml above looks a bit off. You should see:

epmd 4369 TCP connection
5984 on whatever ipv6 port you have set
and once couchdb is up and clustered you have /_up as a JSON HTTP endpoint

To resolve the dynamic nature of the fly ipv6 addresses in containers, so long as the erlang shell succeeds to resolve those names, you can change the underlying IP without issue.

What will be tricky, is that those node names will need to be visible & consistent across all containers. I’d usually use DNS CNAMEs for this, but this sounds like it could be pretty fiddly to get working consistently.

I think being able to add CNAMEd DNS records inside the *.internal zone would be a neat feature for fly to add, not just for couchdb.

I didn’t see anything quite like this in _apps.internal docs yet.

ignoramous · June 4, 2022, 11:20am

There are a couple ways to achieve this on Fly short of Fly letting orgs muck with internal DNS zones.

Pinning a VM to a host (by mounting a volume) so its alloc-id doesn’t change as often.
If <appname> only ever deploys one VM per region, then <region>.<appname>.internal is as good as a CNAME (alias) to that singular VM.

CanRau · June 8, 2022, 9:32pm

Wow thanks a lot for the insides and tips

I’m still a little busy but plan on looking more into this when I find the time.
Though not sure about digging more into erlang at the moment

Nuri · August 18, 2022, 3:53am

Is there any further development on this?

I would also love to implement a multi-region CouchDB cluster.
I think CouchDB is a great fit for the way the fly system works, for many use cases.

CanRau · August 2, 2023, 6:13pm

I’m currently not pursuing anything CouchDB related sorry.
For my current needs I’m using planetscale as main db and just got started with KeyDB multi-leader which works fantastic on fly. Still struggling to get RediSearch working with it though