I’m seeing quite a few deployments failing with an error similar to this:
Failed due to unhealthy allocations - not rolling back to stable job version 23 as current job has same specification and deploying as v24
I’m also seeing some strange behaviour with scaling. I have created 3 volumes across 3 regions (ams, fra, cdg), but when I set fly scale count 3, I only see 1 instance being run and the others don’t show at all:
ID VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
9d223a09 24 ams run running 0 26m46s ago
is this potentially related to the same issue? Thanks.
It looks like the instances are exiting. I took a look at 3dc15a0b and see an event log like this:
Recent Events:
Time Type Description
2021-07-22T22:13:43Z Killing Sent interrupt. Waiting 2m0s before force killing
2021-07-22T22:13:42Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2021-07-22T22:13:42Z Terminated Exit Code: 7
2021-07-22T22:13:38Z Started Task started by client
2021-07-22T22:13:35Z Restarting Task restarting in 1.097628048s
2021-07-22T22:13:35Z Terminated Exit Code: 7
2021-07-22T22:13:31Z Started Task started by client
2021-07-22T22:13:28Z Restarting Task restarting in 1.017842215s
2021-07-22T22:13:28Z Terminated Exit Code: 7
2021-07-22T22:13:24Z Started Task started by client
It seems like the process started, exited with code 7, then we restarted it a few times. Then finally gave up.
You can run flyctl vm status 3dc15a0b to see these. You can also run fly status --all to see the history of instances, the ones marked failed seem to just be crashing.
Does the exit code 7 have any significance? I am trying to run GitHub - fly-apps/cockroachdb across multiple regions but am struggling to get the nodes connected to eachother.
I couldn’t find the Main child exited normally with code: 7 message in the CockroachDB repo so assume this is something coming from Fly? I found another reference to it in this thread: Dockerfile for Rails issue with - #7 by joshua
I’m using the same fly.toml from that repo and I notice there are not any health checks defined - how/why would Fly decide to restart this container or mark it as failing?
I did end up making a couple of tweaks to the startup script in my attempts to get it working.
I noticed the warning in the logs e.g.
WARNING: neither --listen-addr nor --advertise-addr was specified.
The server will advertise "950a9219" to other nodes, is this routable?
I tried using --advertise-addr=$FLY_REGION.$FLY_APP_NAME.internal as an option to /cockroach/cockroach start but I’m not sure if that would work. Should the VM ID e.g. 950a9219 also be resolveable by DNS? I couldn’t see any references to it in Private Networking · Fly
Another question - is the experimental.private_network = true property in fly.toml still required to make full use of the 6PN features or is this now enabled by default? I can see this is enabled in the Postgres HA config postgres-ha/fly.toml at main · fly-apps/postgres-ha · GitHub.
Thanks again for looking into this, I really appreciate it.
That warning is a problem, I think something about the way we expose hostnames and cockroach detects hostnames changed since I built that demo. The fix is somewhat simple, but it’s still crashing and I haven’t figured out why.
Half the problem here is that Cockroach isn’t logging to stdout, it’s putting logs into files by the data directories. And their log config option is some kind of voodoo!
Great, thanks for that @kurt! I’ve managed to get a 3-node cluster up and running without any trouble. It took a couple of minutes to boot up and resolve the DNS records for the other nodes but very straightforward other than that.
I’ve been using this in conjunction with Elixir/Phoenix/Ecto which works using the standard Postgres driver. The only tweak I had to make was passing migration_lock: false to my Ecto repo as CockroachDB does not support LOCK TABLE i.e.