Deploy hanging at "Monitoring deployment" when using Overmind to provide DB access through Tailscale to CrunchyBridge

In order to set up Tailscale-based access to a CrunchyBridge Postgres database, I’ve taken a long journey to understand and test the approach here on community.fly.io (Dockerfile for elixir/phx umbrella app w/ tailscale, overmind, honeymarker) by
@ryansch (also consulted approach in comments there by @zachallaun).

Though that approach doesn’t explicitly include a database migration step, I added a release_command to handle that using the same Overmind/Procfile approach. Everything is working for the migration step in terms of connecting to the db (independent test in Release.ex using Postgrex/psql), up to the point where I reach an error where it seems the Migrator module can’t acquire its connection to lock the schema_migrations table. But that’s not the main issue.

Taking a break from the migration issue, I removed the release command and expected the deploy to work, by virtue of the Entrypoint, CMD, app-specific (not migrate) Procfile, and entrypoint / scripts that tested ok for the migrate-specific Procfile.

However, the deploys just hang at the “Monitoring deployment” step, without any useful log messages, using fly deploy --verbose --remote-only.

Even when trying this with a separate, simple Elixir/Phoenix app using the same files/approach, it hangs. Then for that simple app, when I checkout an earlier commit with a known-good (succeeded earlier) fly.io / Dockerfile configuration (the one given by fly launch), I get the same result (which differs from earlier success).

That leads me to wonder if there is some state that remains across deploys, whenever Overmind executed at least once for a given app’s deploy release_command.

I’ll keep taking a close look, but seem to be at wit’s end at the moment.

I should add that I kept Debian instead of using Ubuntu as in the link above. Below I’ll post my files that differ from those in the linked example above. Mine also have some debugging statements that might be useful to others when looking at this.

Dockerfile:

# This also enables https://tailscale.com/kb/1193/tailscale-ssh/

# In the fly-tailscale solution below, note that `USER nobody` was removed,
# to retain root access for some operations, but using `su-exec` when
# running `nobody` commands.
#
# See how `overmind` here works with `Procfile` and fly.io:
# https://fly.io/docs/app-guides/multiple-processes/#use-a-procfile-manager
# https://github.com/DarthSim/overmind
#
# For CrunchyBridge - Tailscale - Fly.io - Phoenix integration, see also:
# - https://www.crunchydata.com/blog/crunchy-bridge-with-tailscale
# - https://tailscale.com/kb/1132/flydotio/
# - https://tailscale.com/blog/ephemeral-logout/

# Find eligible builder and runner images on Docker Hub. We use Ubuntu/Debian instead of
# Alpine to avoid DNS resolution issues in production...

ARG ELIXIR_VERSION=1.14.0
ARG OTP_VERSION=25.1

ARG DEBIAN_VERSION=bullseye-20220801-slim

ARG BUILDER_IMAGE="hexpm/elixir:${ELIXIR_VERSION}-erlang-${OTP_VERSION}-debian-${DEBIAN_VERSION}"
ARG RUNNER_IMAGE="debian:${DEBIAN_VERSION}"

ARG APP_REVISION_ARG

# FROM outstand/su-exec:latest as su-exec

#######################################
# Stage 1 => Builder stage - not in final release
#######################################

FROM ${BUILDER_IMAGE} as builder

# for Fly.io with Tailscale - from https://github.com/zachallaun/flytail/blob/main/Dockerfile
# and https://community.fly.io/t/dockerfile-for-elixir-phx-umbrella-app-w-tailscale-overmind-honeymarker/5763
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND=noninteractive

# install build dependencies
RUN set -eux; \
  \
  apt-get update -y; \
  apt-get install -y \
    curl \
    ca-certificates \
  ; \
  apt-get update -y; \
  apt-get install -y \
    build-essential \
    git \
    nodejs \
    npm \
  ; \
  apt-get clean; \
  rm -f /var/lib/apt/lists/*_*

# prepare build dir
WORKDIR /app

# install hex + rebar
RUN mix local.hex --force && \
    mix local.rebar --force

# set build ENV
ENV MIX_ENV="prod"

# install mix dependencies
COPY mix.exs mix.lock ./
RUN mix deps.get --only $MIX_ENV
RUN mkdir config

# copy compile-time config files before we compile dependencies
# to ensure any relevant config change will trigger the dependencies
# to be re-compiled.
COPY config/config.exs config/appsignal.exs config/${MIX_ENV}.exs config/
RUN mix deps.compile

# compile assets
COPY lib lib
COPY priv priv
COPY assets assets

RUN cd assets && npm install
RUN mix assets.deploy

# Compile the release
RUN mix compile

# Changes to config/runtime.exs don't require recompiling the code
COPY config/runtime.exs config/

COPY rel rel
RUN mix release

####################################################################
# Stage 2 => __Start a new build stage__, so that the final image will only contain
# the compiled release and other runtime necessities
####################################################################

FROM ${RUNNER_IMAGE}

# latest seems to be 1.34.2:
ARG TAILSCALE_VERSION=1.34.2
# latest is 2.3.0:
ARG OVERMIND_VERSION=2.2.2
# COPY --from=su-exec /sbin/su-exec /sbin/su-exec
ENV TAILSCALE_VERSION=${TAILSCALE_VERSION}

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND=noninteractive

# Deps for Phoenix, Tailscale for Fly.io
RUN set -eux; \
  \
  apt-get update -y; \
  apt-get install -y \
    curl \
    ca-certificates \
  ; \
  curl -fsSL https://pkgs.tailscale.com/stable/debian/bullseye.noarmor.gpg | tee /usr/share/keyrings/tailscale-archive-keyring.gpg >/dev/null; \
  curl -fsSL https://pkgs.tailscale.com/stable/debian/bullseye.tailscale-keyring.list | tee /etc/apt/sources.list.d/tailscale.list; \
  \
  apt-get update -y; \
  apt-get install -y \
    libstdc++6 \
    openssl \
    libncurses5 \
    locales \
    iptables \
    postgresql-client \
    procps \
    tailscale=${TAILSCALE_VERSION} \
    tmux \
  ; \
  apt-get clean; \
  rm -f /var/lib/apt/lists/*_*;

# See what `iptables` really is in Debian Buster here:
# - https://github.com/tailscale/tailscale/issues/391#issuecomment-1244687027
# RUN update-alternatives --set iptables /usr/sbin/iptables-legacy
# RUN mkdir -p /var/run/tailscale /var/cache/tailscale /var/lib/tailscale

ENV OVERMIND_VERSION=${OVERMIND_VERSION}
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND=noninteractive

# Install Overmind for Procfile usage (see top of this file)
RUN set -eux; \
      \
      mkdir -p /tmp/build; \
	    cd /tmp/build; \
      curl -fsSL https://github.com/DarthSim/overmind/releases/download/v${OVERMIND_VERSION}/overmind-v${OVERMIND_VERSION}-linux-amd64.gz | gunzip > overmind; \
      mv overmind /usr/bin/overmind; \
      chmod +x /usr/bin/overmind; \
      cd; \
      rm -rf /tmp/build

# su-exec

SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND=noninteractive

RUN set -eux; \
      \
    apt-get update -y; \
    apt-get -y --no-install-recommends install \
      build-essential \
    ; \
    curl -L https://github.com/ncopa/su-exec/archive/master.tar.gz \
      | tar zxfv - -C /tmp --strip-components=1 && \
    make --directory /tmp  && \
    mv /tmp/su-exec /usr/local/bin \
    ; \
    apt-get -y purge build-essential; \
    apt-get clean; \
    rm -f /var/lib/apt/lists/*_*;

# Set the locale
RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen

ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

WORKDIR "/app"

# `-R` added as per https://community.fly.io/t/dockerfile-for-elixir-phx-umbrella-app-w-tailscale-overmind-honeymarker/5763
# (Note that not changing ownership of any docker/ files, since su-exec in one Procfile statement
#  is the only place where those downgrade to the nobody user.)
RUN chown -R nobody /app

# set runner ENV
ENV MIX_ENV="prod"

# Copy binaries to production image
# Only copy the final release from the build stage
COPY --from=builder --chown=nobody:root /app/_build/${MIX_ENV}/rel/myapp ./

COPY docker/Procfile-app.fly /app/Procfile-app.fly
COPY docker/Procfile-migrate.fly /app/Procfile-migrate.fly

COPY docker/tailscale-up.sh docker/wait-for-tailscale.sh /app/docker/
COPY docker/fly-entrypoint.sh /docker-entrypoint.sh

# see also OVERMIND_AUTO_RESTART, OVERMIND_ANY_CAN_DIE, OVERMIND_FORMATION,
# OVERMIND_TMUX_CONFIG, OVERMIND_SHELL, all described here:
# https://github.com/DarthSim/overmind/blob/8a5f4e2270b66c37055a7968d9909b00af859494/main.go

ENV OVERMIND_NO_PORT=1
ENV OVERMIND_CAN_DIE="tailscaleup"
ENV OVERMIND_STOP_SIGNALS="app=TERM"

# seconds (defaults to 5):
ENV OVERMIND_TIMEOUT=20

# specifying root is redundant, but explicit:
USER root

ENTRYPOINT ["/docker-entrypoint.sh"]

CMD ["overmind", "start", "-f", "/app/Procfile-app.fly"]
# CMD ["/app/bin/server"]

# Appended by flyctl
ENV ECTO_IPV6 false
# uncomment, when `ECTO_IPV6` above is `true`
# ENV ERL_AFLAGS "-proto_dist inet6_tcp"
ENV DATABASE_SSL true
ENV DATABASE_POOL_SIZE 10

# From https://github.com/zachallaun/flytail/blob/main/Dockerfile
ARG vcs_ref
LABEL org.label-schema.vcs-ref=$vcs_ref \
  org.label-schema.vcs-url="${REPOSITORY}" \
  SERVICE_TAGS=$vcs_ref
ENV VCS_REF ${vcs_ref}
ENV APP_REVISION ${vcs_ref}
RUN echo $VCS_REF >lib/myapp-0.1.0/priv/static/vcs_ref.txt

docker/fly-entrypoint.sh:

#!/bin/bash

set -euo pipefail

echo 'net.ipv4.ip_forward = 1' | tee -a /etc/sysctl.conf
echo 'net.ipv6.conf.all.forwarding = 1' | tee -a /etc/sysctl.conf

# This `ls` shows that `sysctl` is not present, so attempting to execute this command in Procfile's
ls -l /sbin/sysctl

# On Debian, requires `procps` package to be installed (see Dockerfile)
sysctl -p /etc/sysctl.conf

exec "$@"

docker/Procfile-migrate.fly:

tailscaled: tailscaled --verbose=1 --port 41641
tailscaleup: /app/docker/tailscale-up.sh
migrate: /app/docker/wait-for-tailscale.sh su-exec nobody /app/bin/migrate

docker/Procfile-app.fly:

tailscaled: tailscaled --verbose=1 --port 41641
tailscaleup: /app/docker/tailscale-up.sh
app: /app/docker/wait-for-tailscale.sh su-exec nobody /app/bin/server

docker/tailscale-up.sh:

Modify section below, to rule out Tailscale-side issues with machines going up/down

.
.

# also adds a timestamp, to deduplicate as necessary
tailscale up \
  "--authkey=${TAILSCALE_AUTHKEY}" \
  "--hostname=${TAILSCALE_MACHINE_NAME:-changeme}-${FLY_REGION}-$(date +%s)" \
  --accept-routes=true \
  --ssh

rel/overlays/bin/migrate:

#!/bin/bash

echo "Migrating Phoenix app database"

# Phoenix app
cd -P -- "$(dirname -- "$0")"

# echo "Executing migration"
# original file's version:
#exec ./myapp eval Myapp.Release.migrate

echo "Executing migration (without exec, since already in su-exec context)"
./myapp eval Myapp.Release.migrate

rel/overlays/bin/server:

#!/bin/sh

echo "Starting Phoenix app server"

# Phoenix app
cd -P -- "$(dirname -- "$0")"

# echo "Executing server start"
# original file's version:
# PHX_SERVER=true exec ./myapp start

echo "Executing server start (without exec, since already in su-exec context)"
PHX_SERVER=true ./myapp start

config/runtime.exs:

(NOTE: DATABASE_SSL must be true)

.
.
.
  # not verifying, until rest is working
  ssl_opts = [verify: :verify_none, log_level: :error]

  config :myapp, Myapp.Repo,
    hostname: env!.("DATABASE_HOST"),
    port: env.("DATABASE_PORT") || "5432",
    username: env!.("DATABASE_USER"),
    password: env!.("DATABASE_PASSWORD"),
    database: env.("MYAPP_DB") || "myapp",
    # must be `true` for CrunchyBridge DB connections:
    ssl: (env.("DATABASE_SSL") || "false") == "true",
    verify_ssl: false,
    ssl_opts: ssl_opts,
    pool_size: String.to_integer(env.("DATABASE_POOL_SIZE") || "10"),
    socket_options: maybe_ipv6

I’m reading up now!

@modellurgist Could you share your fly.toml?

Here’s an example from one of our apps:

app = "myapp"

kill_signal = "SIGTERM"
kill_timeout = 7
processes = []

[build]
  dockerfile = "Dockerfile.fly"

[deploy]
  strategy = "canary"
  release_command = "/app/bin/mark_deploy"

[env]
  PHX_HOST = "something.fly.dev"
  PORT = "8080"

statics = []

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  tcp_checks = []
  script_checks = []

  [services.concurrency]
    hard_limit = 2500
    soft_limit = 2000
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  # [[services.tcp_checks]]
  #   grace_period = "1s"
  #   interval = "15s"
  #   restart_limit = 0
  #   timeout = "2s"

  [[services.http_checks]]
    grace_period = "10s"
    interval = "5s"
    restart_limit = 0
    method = "get"
    path = "/health_check"
    protocol = "http"
    timeout = "2s"
    tls_skip_verify = true

And here’s an example plug for http health checking:

You may also want to try running fly apps destroy <builder name> to force fly to give you a fresh builder.

Hi @modellurgist! I don’t know if you checked that your app actually didn’t get deployed. Worth a check…

You can use another terminal to run fly status and tail logs with fly logs, to get a better idea where things aren’t going as expected.

Something else to try is to see if a simple app deploys OK for you right now.

Things like IP addresses and secrets, that you set through flyctl, as well as stuff you’ve written to an attached Fly Volume or separate database, will persist across deployments. The VM’s rootfs gets built freshly on deployment.

@ryansch thanks for taking a look. A classic, silly blocker explains the “deploy hanging” problem: I had used fly scale count 0 to bring down the server several days ago and forgot to scale back up to 1 when running fly deploy. Current status: the app is back up, but db schema migrations seem to need some additional db permissions for the db user they run as.

(After realizing I hadn’t searched on the specific problem, “hanging at Monitoring Deployment”, another community post triggered an “aha” moment, then I was able to iterate to success on the app deploy stage.)

After fully resolving the rest of the issues affecting schema migrations, I’ll summarize the remaining changes that were necessary to the above files.

Here’s my fly.toml (during testing of the deploy, I removed the release_command, used below just for migrations, until fixing all problems with that stage. Status: I think I’m down to just a few more db permissions issues, since this work is being done in the context of a migration of data to a new db service, Crunchy Bridge.). Your file has some changes I’ll study and consider in the near future.

# fly.toml file generated for myapp on 2022-02-21T12:05:43Z
# - added `force_https = true` to _both_ `services.ports` sections

app = "myapp"

kill_signal = "SIGTERM"
kill_timeout = 5
processes = []

[deploy]
  release_command = "overmind start -f /app/Procfile-migrate.fly"

[env]
  PHX_HOST = "app.myapp.com"
  PORT = "8080"

[experimental]
  allowed_public_ports = []
  auto_rollback = true

[[services]]
  http_checks = []
  internal_port = 8080
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "30s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

I’ve totally done that! I’m glad you got the scaling issue figured out.

Listing the complete solution, to the blocker and to remaining issues I had to resolve to restore full migrate and deploy capability:

  • to unblock and get expected output during deploy, ran fly scale count 1 one time (since was at 0), before running deploy command above
  • uncommented the line in Dockerfile for ENV ERL_AFLAGS "-proto_dist inet6_tcp" since it seemed needed for Fly.io network (even though the database is outside that network)
  • set DATABASE_URL to the CrunchyBridge DB’s v6 IP address and then needed to change ECTO_IPV6 to true in the Dockerfile
  • added to config options in runtime.exs above after_connect: {EctoConnectionCheck, :log_connection_success, [Myapp.Repo]}, in order to confirm connection succeeded (to eliminate possible issues under consideration)
  • granted / altered some database permissions (not necessary, in general, if already set up correctly)
  • added back the deploy release_command above to the fly.toml that calls the app-specific Procfile above via overmind
  • in Myapp.Release, attempted to add :debug in options to Ecto.Migrator.run/3 but they were ineffective: log_level, log_migrations_sql, log_migrator_sql (will re-test and file bug report on ecto_sql as necessary