timeout reached waiting for health checks to pass for machine

I’m facing this error when deploying:

Deployment logs

> [1/2] Clearing lease for 3d8dd959b70068
✔ [2/2] Cleared lease for 7811099c940318
✔ [1/2] Cleared lease for 3d8dd959b70068
Error: failed to update machine 7811099c940318: Unrecoverable error: timeout reached waiting for health checks to pass for machine 7811099c940318: failed to get VM 7811099c940318: Get "https://api.machines.dev/v1/apps/riley-api/machines/7811099c940318": net/http: request canceled
Error: Process completed with exit code 1.

Machine logs

2024-08-21T20:29:53Z runner[7811099c940318] sjc [info]Pulling container image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1
2024-08-21T20:29:54Z runner[3d8dd959b70068] sjc [info]Pulling container image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1
2024-08-21T20:30:06Z runner[7811099c940318] sjc [info]Configuring firecracker
2024-08-21T20:30:06Z runner[3d8dd959b70068] sjc [info]Successfully prepared image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1 (12.169936357s)
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 318
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [INFO] Handling signal: int
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 324 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 326 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 325 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 323 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [INFO] Shutting down: Master
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Main child exited normally with code: 0
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Starting clean up.
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][ 5731.812371] reboot: Restarting system
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO Starting init (commit: 20f21dc5f)...
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO Preparing to run: `gunicorn riley_backend.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --timeout 300` as root
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO [fly api proxy] listening at /.fly/api
2024-08-21T20:30:10Z runner[7811099c940318] sjc [info]Machine created and started in 17.087s
2024-08-21T20:30:10Z app[7811099c940318] sjc [info]2024/08/21 20:30:10 INFO SSH listening listen_address=[fdaa:9:8686:a7b:181:53a4:e1f2:2]:22 dns_server=[fdaa::3]:53
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Starting gunicorn 20.1.0
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Listening at: http://0.0.0.0:8000 (318)
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Using worker: uvicorn.workers.UvicornWorker
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [323] [INFO] Booting worker with pid: 323
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [324] [INFO] Booting worker with pid: 324
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [325] [INFO] Booting worker with pid: 325
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [326] [INFO] Booting worker with pid: 326
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Attempting to acquire lock 140641993491232 on /root/.cache/huggingface/hub/.locks/models--gpt2/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd.lock
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Lock 140641993491232 acquired on /root/.cache/huggingface/hub/.locks/models--gpt2/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd.lock
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Attempting to acquire lock 140641993745824 on /root/.cache/huggingface/hub/.locks/models--

I’ve tried everything. None of the following suggested solutions worked:

  1. Assigning more memory
  2. Adding timeout
  3. Implement graceful shutdown: To handle long-running tasks more effectively.

This my Docker/Stack/fly.toml

# Build stage
FROM python:3.12-slim AS builder

WORKDIR /build

# Copy only the files needed for installing dependencies
COPY xxxx/pyproject.toml xxxx/poetry.lock* ./

RUN pip install --no-cache-dir poetry

# Export runtime dependencies
RUN poetry export -f requirements.txt --output requirements.txt --without-hashes

# Copy the entire xxxx directory
COPY xxxx ./xxxx

# Final stage
FROM python:3.12-slim

WORKDIR /app

ENV MODAL_TOKEN_ID=${MODAL_TOKEN_ID}
ENV MODAL_TOKEN_SECRET=${MODAL_TOKEN_SECRET}

# Install ffmpeg and other necessary packages
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*

# Copy only runtime requirements
COPY --from=builder /build/requirements.txt .

# Install runtime dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code from the builder stage
COPY --from=builder /build/xxxxx /app

# Set Python path
ENV PYTHONPATH=/app

EXPOSE 8000

CMD ["gunicorn", "xxxxx.main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]
app = 'xxxxxx'
primary_region = 'sjc'

[build]
dockerfile = "./Dockerfile"

[http_service]
  internal_port = 8000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1
  processes = ['app'] 

[[vm]]
  memory = '2gb'
  cpu_kind = 'shared'
  cpus = 1

[scale]
  min = 1
  max = 2

[[services]]
  http_checks = []
  internal_port = 8000
  processes = ["app"]
  protocol = "tcp"
  script_checks = []

  [services.concurrency]
    hard_limit = 25
    soft_limit = 20
    type = "connections"

  [[services.ports]]
    force_https = true
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "60s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

No Success yet!

You duplicate services listening to the same internal port.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.