Having trouble getting initial app deployment to succeed, hangs at "Monitoring deployment"

Hey folks,

I’m migrating a Django app from Heroku and am having issues getting an initial successful deployment. My deploys are hanging at the “Monitoring deployment” stage before eventually failing after several minutes. The only way I’ve managed to complete the process is by using --strategy immediate, but the app doesn’t run once deployed in this way.

This is my fly.toml (I’ve commented out the release_command to try to isolate the problem):

app = "<myapp>-web"
kill_signal = "SIGTERM"
kill_timeout = 30

  DJANGO_SETTINGS_MODULE = "<myapp>.config.production.settings"

#   release_command = "django-admin migrate --noinput"

  internal_port = 8080
  protocol = "tcp"

    hard_limit = 25
    soft_limit = 20
    type = "requests"

    force_https = true
    handlers = ["http"]
    port = "80"

    handlers = ["tls", "http"]
    port = "443"

    grace_period = "30s"
    interval = "15s"
    method = "get"
    path = "/"
    protocol = "http"
    restart_limit = 0
    timeout = "2s"
    tls_skip_verify = false

Note that I added the http_checks section later because I was wondering if having no health checks meant my app could never be healthy. However that doesn’t seem to have made any difference.

This is my Dockerfile:

# syntax=docker/dockerfile:1

FROM python:3.8-slim as base

# Set up environment

# Start a new build stage to build dependencies
FROM base AS python-deps

# Install pipenv and compilation dependencies, including those for mysqlclient and pillow
RUN pip install pipenv
RUN apt-get update && apt-get install -y --no-install-recommends gcc build-essential python3-dev default-libmysqlclient-dev zlib1g-dev libjpeg62-turbo-dev libopenjp2-7-dev libtiff5-dev libwebp-dev libfreetype6 libfreetype6-dev
RUN apt-get clean

# Install python dependencies in /.venv
COPY Pipfile .
COPY Pipfile.lock .
RUN pipenv install --deploy

# Start a new build stage to receive the result of dependency installation
FROM base AS runtime

# Create and switch to a new user
RUN useradd --create-home --user-group appuser
WORKDIR /home/appuser
USER appuser

# Copy virtual env from python-deps stage
COPY --from=python-deps --chown=appuser:appuser /.venv /.venv

# Configure PATH to include venv bin directory, and PYTHONPATH to include app directory
# This mimics our previous use of `python setup.py develop`
ENV PATH="/.venv/bin:$PATH"

# Copy application into container
COPY . .

# Collect static files
RUN django-admin collectstatic --noinput --settings <myapp>.config.production.settings

# Expose port

# Run gunicorn
# note for readers – this was previously an ipv4-only binding
CMD ["gunicorn", "<myapp>.wsgi:application", "--bind", ":8080", "--workers", "4", "--access-logfile", "-", "--error-logfile", "-"]

The deploy process stops at the monitoring step for several minutes before failing:

==> Monitoring deployment
Logs: https://fly.io/apps/climb-bandits-api-web/monitoring

 1 desired, 1 placed, 0 healthy, 1 unhealthy
--> v7 failed - Failed due to unhealthy allocations - not rolling back to stable job version 7 as current job has same specification and deploying as v8

--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort

I’ve tried building and running the Dockerfile locally and it seems to work fine. Additionally fly doctor reports no issues. Despite the release being marked as “unhealthy” there are no checks listed in fly checks:

Health Checks for <myapp>-web

Most maddeningly, I don’t get any logs in the dashboard nor through fly logs.

What am I doing wrong? :smiley: Thanks!

Quick update, as I learn more fly commands. Here is the output from fly vm status for the most recent instance of the app:

  ID            = 1d045f7f
  Process       = app
  Version       = 26
  Region        = lhr
  Desired       = run
  Status        = pending
  Health Checks =
  Restarts      = 0
  Created       = 1m22s ago

2022-12-22T14:06:39Z	Received  	Task received by client
2022-12-22T14:06:39Z	Task Setup	Building Task Directory
2022-12-22T14:06:43Z	Template  	Missing: vault.read(apps/data/555851/870811)


Recent Logs

I’ve tried ensuring that gunicorn now binds on ipv6 as well as ipv4, and I’ve added a TCP health check to the config, but can’t get any different results and – crucially – no log output anywhere.

It looks like, somehow, your secret didn’t make it to our database. The scheduler thinks it needs to fetch a secret but it can’t find it.

I have removed the non-existent one (that’s a weird thing to say) and it looks like the deploy went through.

Did you set 2 secrets? I see you faced an error trying to deploy an upstash redis instance, I wonder if that added a secret and then rolled back and got your app in this weird state.

Hey @jerome, thanks for the quick reply. AFAIK I’ve only ever had the DATABASE_URL secret, so I assume this ghost secret arose from the issues with provisioning Upstash Redis yesterday.

Unfortunately I still can’t seem to successfully deploy from this end – despite commenting out both of my custom health checks and reducing the number of gunicorn workers to 1 to try to avoid possible OOM errors. Running vm status on my latest vm suggests the “missing: vault.read” error relating to secrets is still occurring.

I still can’t see any other logs from deployment – any insights from your side?

Thanks again.

Just tried another deployment having left it for a day – seems OK now :slight_smile: thanks for your help with this!