CockroachDB example won't scale to multiple instances

Hey,

I’m seeing quite a few deployments failing with an error similar to this:

Failed due to unhealthy allocations - not rolling back to stable job version 23 as current job has same specification and deploying as v24

I’m also seeing some strange behaviour with scaling. I have created 3 volumes across 3 regions (ams, fra, cdg), but when I set fly scale count 3, I only see 1 instance being run and the others don’t show at all:

ID       VERSION REGION DESIRED STATUS  HEALTH CHECKS RESTARTS CREATED
9d223a09 24      ams    run     running               0        26m46s ago

is this potentially related to the same issue? Thanks.

I think this is different!

It looks like the instances are exiting. I took a look at 3dc15a0b and see an event log like this:

Recent Events:
Time                  Type            Description
2021-07-22T22:13:43Z  Killing         Sent interrupt. Waiting 2m0s before force killing
2021-07-22T22:13:42Z  Not Restarting  Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2021-07-22T22:13:42Z  Terminated      Exit Code: 7
2021-07-22T22:13:38Z  Started         Task started by client
2021-07-22T22:13:35Z  Restarting      Task restarting in 1.097628048s
2021-07-22T22:13:35Z  Terminated      Exit Code: 7
2021-07-22T22:13:31Z  Started         Task started by client
2021-07-22T22:13:28Z  Restarting      Task restarting in 1.017842215s
2021-07-22T22:13:28Z  Terminated      Exit Code: 7
2021-07-22T22:13:24Z  Started         Task started by client

It seems like the process started, exited with code 7, then we restarted it a few times. Then finally gave up.

You can run flyctl vm status 3dc15a0b to see these. You can also run fly status --all to see the history of instances, the ones marked failed seem to just be crashing.

Thanks for that @kurt!

Does the exit code 7 have any significance? I am trying to run GitHub - fly-apps/cockroachdb across multiple regions but am struggling to get the nodes connected to eachother.

I couldn’t find the Main child exited normally with code: 7 message in the CockroachDB repo so assume this is something coming from Fly? I found another reference to it in this thread: Dockerfile for Rails issue with - #7 by joshua

I’m using the same fly.toml from that repo and I notice there are not any health checks defined - how/why would Fly decide to restart this container or mark it as failing?

Exit code 7 is probably coming from Cockroach, our supervisor will log that as Main child exited normally with code: X.

When the process exits, we try and start it back up. As best I can tell, we’re not causing the process to exit.

Let me dig on this app a little and see what I can find.

1 Like

Wow this is pretty weird. It seems like Cockroach isn’t detecting the cluster it’s supposed to join, but I’m not sure why!

I’ve removed some of the disks from your app and left one in ams and one in cdg. I’ll see if I can get this working with 2 members for now.

Thanks @kurt!

I did end up making a couple of tweaks to the startup script in my attempts to get it working.

I noticed the warning in the logs e.g.

WARNING: neither --listen-addr nor --advertise-addr was specified.
The server will advertise "950a9219" to other nodes, is this routable?

I tried using --advertise-addr=$FLY_REGION.$FLY_APP_NAME.internal as an option to /cockroach/cockroach start but I’m not sure if that would work. Should the VM ID e.g. 950a9219 also be resolveable by DNS? I couldn’t see any references to it in Private Networking · Fly

Another question - is the experimental.private_network = true property in fly.toml still required to make full use of the 6PN features or is this now enabled by default? I can see this is enabled in the Postgres HA config postgres-ha/fly.toml at main · fly-apps/postgres-ha · GitHub.

Thanks again for looking into this, I really appreciate it.

That warning is a problem, I think something about the way we expose hostnames and cockroach detects hostnames changed since I built that demo. The fix is somewhat simple, but it’s still crashing and I haven’t figured out why.

Half the problem here is that Cockroach isn’t logging to stdout, it’s putting logs into files by the data directories. And their log config option is some kind of voodoo!

1 Like

I managed to get some more detailed logs just by adding the --logtostderr - I’ve pushed the changes I made to a fork of the repo: cockroachdb/start_fly.sh at main · jamesmbourne/cockroachdb · GitHub

I’ll keep trying to see if I can get this working with some additional information from the logs available.

Update:
Looks like this might be some sort of clock sync issue! I see this error just before the process exits with a code 7.

clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)

This was just using 2 nodes - one in ams and one in lhr so this sounds more like clock skew than network latency.

Oh that’s helpful. It seems ntp wasn’t running on a couple of hosts in Europe. All fixed, and my test cluster is working great now.

If you change your “advertise” argument to this you should be good to go (I’ll update the example app):

exec /cockroach/cockroach start \
  --insecure \
  --logtostderr \
  --advertise-addr=$(hostname -f) \
  --locality=region=$FLY_REGION \
  --cluster-name=$FLY_APP_NAME \
  --join=$FLY_APP_NAME.internal

Great, thanks for that @kurt! I’ve managed to get a 3-node cluster up and running without any trouble. It took a couple of minutes to boot up and resolve the DNS records for the other nodes but very straightforward other than that.

I’ve been using this in conjunction with Elixir/Phoenix/Ecto which works using the standard Postgres driver. The only tweak I had to make was passing migration_lock: false to my Ecto repo as CockroachDB does not support LOCK TABLE i.e.

config :fly_test, FlyTest.Repo, migration_lock: false
2 Likes

Very cool! I’ve wanted to play with cockroach for an Elixir project too! Thanks for sharing the config tweak.