CockroachDB example won't scale to multiple instances

jamesbourne · July 22, 2021, 10:17pm

Hey,

I’m seeing quite a few deployments failing with an error similar to this:

Failed due to unhealthy allocations - not rolling back to stable job version 23 as current job has same specification and deploying as v24

I’m also seeing some strange behaviour with scaling. I have created 3 volumes across 3 regions (ams, fra, cdg), but when I set fly scale count 3, I only see 1 instance being run and the others don’t show at all:

ID       VERSION REGION DESIRED STATUS  HEALTH CHECKS RESTARTS CREATED
9d223a09 24      ams    run     running               0        26m46s ago

is this potentially related to the same issue? Thanks.

kurt · July 22, 2021, 10:22pm

I think this is different!

It looks like the instances are exiting. I took a look at 3dc15a0b and see an event log like this:

Recent Events:
Time                  Type            Description
2021-07-22T22:13:43Z  Killing         Sent interrupt. Waiting 2m0s before force killing
2021-07-22T22:13:42Z  Not Restarting  Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2021-07-22T22:13:42Z  Terminated      Exit Code: 7
2021-07-22T22:13:38Z  Started         Task started by client
2021-07-22T22:13:35Z  Restarting      Task restarting in 1.097628048s
2021-07-22T22:13:35Z  Terminated      Exit Code: 7
2021-07-22T22:13:31Z  Started         Task started by client
2021-07-22T22:13:28Z  Restarting      Task restarting in 1.017842215s
2021-07-22T22:13:28Z  Terminated      Exit Code: 7
2021-07-22T22:13:24Z  Started         Task started by client

It seems like the process started, exited with code 7, then we restarted it a few times. Then finally gave up.

You can run flyctl vm status 3dc15a0b to see these. You can also run fly status --all to see the history of instances, the ones marked failed seem to just be crashing.

jamesbourne · July 22, 2021, 10:35pm

Thanks for that @kurt!

Does the exit code 7 have any significance? I am trying to run GitHub - fly-apps/cockroachdb across multiple regions but am struggling to get the nodes connected to eachother.

I couldn’t find the Main child exited normally with code: 7 message in the CockroachDB repo so assume this is something coming from Fly? I found another reference to it in this thread: Dockerfile for Rails issue with - #7 by joshua

I’m using the same fly.toml from that repo and I notice there are not any health checks defined - how/why would Fly decide to restart this container or mark it as failing?

kurt · July 22, 2021, 10:52pm

Exit code 7 is probably coming from Cockroach, our supervisor will log that as Main child exited normally with code: X.

When the process exits, we try and start it back up. As best I can tell, we’re not causing the process to exit.

Let me dig on this app a little and see what I can find.

kurt · July 22, 2021, 11:05pm

Wow this is pretty weird. It seems like Cockroach isn’t detecting the cluster it’s supposed to join, but I’m not sure why!

I’ve removed some of the disks from your app and left one in ams and one in cdg. I’ll see if I can get this working with 2 members for now.

jamesbourne · July 22, 2021, 11:26pm

Thanks @kurt!

I did end up making a couple of tweaks to the startup script in my attempts to get it working.

I noticed the warning in the logs e.g.

WARNING: neither --listen-addr nor --advertise-addr was specified.
The server will advertise "950a9219" to other nodes, is this routable?

I tried using --advertise-addr=$FLY_REGION.$FLY_APP_NAME.internal as an option to /cockroach/cockroach start but I’m not sure if that would work. Should the VM ID e.g. 950a9219 also be resolveable by DNS? I couldn’t see any references to it in Private Networking · Fly

Another question - is the experimental.private_network = true property in fly.toml still required to make full use of the 6PN features or is this now enabled by default? I can see this is enabled in the Postgres HA config postgres-ha/fly.toml at main · fly-apps/postgres-ha · GitHub.

Thanks again for looking into this, I really appreciate it.

kurt · July 23, 2021, 12:58am

That warning is a problem, I think something about the way we expose hostnames and cockroach detects hostnames changed since I built that demo. The fix is somewhat simple, but it’s still crashing and I haven’t figured out why.

Half the problem here is that Cockroach isn’t logging to stdout, it’s putting logs into files by the data directories. And their log config option is some kind of voodoo!

jamesbourne · July 23, 2021, 10:03am

I managed to get some more detailed logs just by adding the --logtostderr - I’ve pushed the changes I made to a fork of the repo: cockroachdb/start_fly.sh at main · jamesmbourne/cockroachdb · GitHub

I’ll keep trying to see if I can get this working with some additional information from the logs available.

Update:
Looks like this might be some sort of clock sync issue! I see this error just before the process exits with a code 7.

clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)

This was just using 2 nodes - one in ams and one in lhr so this sounds more like clock skew than network latency.

kurt · July 23, 2021, 8:25pm

Oh that’s helpful. It seems ntp wasn’t running on a couple of hosts in Europe. All fixed, and my test cluster is working great now.

If you change your “advertise” argument to this you should be good to go (I’ll update the example app):

exec /cockroach/cockroach start \
  --insecure \
  --logtostderr \
  --advertise-addr=$(hostname -f) \
  --locality=region=$FLY_REGION \
  --cluster-name=$FLY_APP_NAME \
  --join=$FLY_APP_NAME.internal

jamesbourne · July 28, 2021, 4:16pm

Great, thanks for that @kurt! I’ve managed to get a 3-node cluster up and running without any trouble. It took a couple of minutes to boot up and resolve the DNS records for the other nodes but very straightforward other than that.

I’ve been using this in conjunction with Elixir/Phoenix/Ecto which works using the standard Postgres driver. The only tweak I had to make was passing migration_lock: false to my Ecto repo as CockroachDB does not support LOCK TABLE i.e.

config :fly_test, FlyTest.Repo, migration_lock: false

Mark · July 28, 2021, 4:57pm

Very cool! I’ve wanted to play with cockroach for an Elixir project too! Thanks for sharing the config tweak.

Topic		Replies	Views
Fly clustering and scaling no longer working Questions / Help elixir	9	1077	September 27, 2022
Help, all my instances disappeared	7	528	December 23, 2021
"Your organization is currently restricted to single instance apps only" Questions / Help	9	915	February 9, 2023
Scaling issues: unable to scale to requested count Questions / Help elixir	2	338	October 4, 2021
No suitable (healthy) instance found to handle request	9	330	October 28, 2021

CockroachDB example won't scale to multiple instances

Related topics